Andovar Localization Blog - tips & content for global growth

Corpus

Written by Steven Bussey | Feb 20, 2020 5:00:00 AM

Corpus (pl. corpora) is a large body of machine-readable text used for research purposes.

Corpora are either monolingual or multilingual. They often include extra information about parts of speech or alignment of segments in different languages. Some corpora are kept private by their owners, while others are available for everyone to use free of charge. Large translation memories can be used as multilingual corpora.

Research in monolingual corpora can be used in language teaching, voice-recognition and for terminology mining. Bilingual corpora are fundamental to training Statistical Machine Translation engines.

Some of the largest freely available English corpora can be found online here.