Building a Corpus

French version

Definition: Corpus

(TLF translated + Wikipedia) - Large and structured set of texts collected according to various study criteria that may include exhautivity, domain or genre specificity, etc.   

Theoretic considerations:

Relevance : The framework (aims, approaches, ...) of the study must be defined precisely.  
Compliance: Define the requirements ensuring the corpus provides a realistic representation (i.e., both compliant with the reality of the phenomenon under study, and exhibiting sufficient regularity to obtain some result).
Usability: The corpus needs to be structured, and its size must be sufficient to allow for a suitable representation of the phenomenon that is studied. The issue of statistic representativity must be considered. If several phenomena are studied and compared, they need to be equally represented within the corpus.  

Practical Work:

 Always ask yourself:
Can I use an existing corpus? 

If building a corpus, keep the following issues in mind:  
Assign consistant and relevant names to the files
Keep track of the building procedure used
Before any work on the corpus is processed, start with some descriptive stats concerning both the corpus as a whole, and the components: word count and various information are useful (language, authors, keywords, categories, document type, etc.)
Make sure the study can be reproduced at a later time to allow for parameter or method comparison. 

