Informatics 1: Data & Analysis 2016/17 Lecture 12: Corpora
In computational linguistics and in theoretical linguistics a
corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of
empirical linguistics.
This lecture briefly covers the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like
balancing and
sampling;
tokenization and
annotation.
Link: Web page for lecture