Wortschatz Leipzig

Portals
- Project page
- The main page of the Wortschatz Leipzig project.
- Corpora portal
- Our corpora portal offers access to text corpora in over 250 languages.
- Crawling
- Help us collect text material for "under-resourced" languages.
- Language Statistics
- Language statistics containing information on various facets of natural language.
- Dictionary portal
- Our dictionary portal offers access to a variety of dictionaries.
- Words of the Day
- Current terms selected from daily newspapers and news services in German.
- Toolbox
- Our Online Toolbox is a collection of text mining algorithms.
Data
- Downloads
- Download page of the Wortschatz project.
- Linked Data
- Parts of our resources are also available as Linked Open Data.
- Webservices
- You can access parts of our data via a RESTful API.
- Repository
- Our resources in the SAW Leipzig data repository.
Documentation
- About the Project
- Information about the Wortschatz Leipzig project.
- Frequently Asked Questions
- Answers to questions we are frequently asked.
- Wissensrohstoff Text
- Information about our book on text mining and its various applications.
- About our Team
- Information about the persons involved.
- Publications
- An overview of our publications.
- Frequency Dictionaries
- Information about our book series on wordlist-based language statistics.
Misc
- Terms of Usage
- Terms of usage of the Wortschatz Leipzig project.
- Accessibility
- Information about the accessibility of our websites.
- Privacy Notice
- Information on how we deal with data privacy.
- Contact
- How to contact us.
English
- German
- English

The project Wortschatz Leipzig or Leipzig Corpora Collection has been making information on many languages and their vocabularies available since the mid-1990s. All of our applications and services are based on language-statistical algorithms and text-mining methods that are used to analyze large amounts of text. This is done on the basis of our own web crawling infrastructure and using self-developed toolchains and quality assurance procedures.

The data used is collected automatically from selected publicly available sources. The example sentences are selected automatically and do not represent an expression of opinion by the project. The authors are solely responsible for the content and opinions contained therein. Even without special labelling, trademarks reproduced in the vocabulary, such as common names, trade names, product designations, etc., are subject to legal provisions. The synonymous use of a trademark does not necessarily describe product-specific characteristics but instead characterises the use of the term in a general linguistic context.