This Blog Template is created by

Written by Steven Bussey
on February 20, 2020

Corpus (pl. corpora) is a large body of machine-readable text used for research purposes.

Corpora are either monolingual or multilingual. They often include extra information about parts of speech or alignment of segments in different languages. Some corpora are kept private by their owners, while others are available for everyone to use free of charge. Large translation memories can be used as multilingual corpora.

Research in monolingual corpora can be used in language teaching, voice-recognition and for terminology mining. Bilingual corpora are fundamental to training Statistical Machine Translation engines.

Some of the largest freely available English corpora can be found online here.

You may also like:

Andovar Academy

Computer-Aided Translation (CAT) Tools

Computer-Aided Translation (CAT) Tools are software applications that assist in translating content from one language to...

Andovar Academy


XML Localization Interchange File Format (XLIFF) is an open XML-based format standard for exchanging localizable data. T...

Andovar Academy

Weighted Word Count

Weighted Word Count (WWC) is an approach to counting words for translation which presents all levels of repetitions in a...