Definition: Corpus (TLF translated + Wikipedia)
- Large and structured set of texts collected according to various study
criteria that may include exhautivity, domain or genre specificity,
etc.
Theoretic considerations: Relevance : The framework (aims, approaches, ...) of the
study must be defined precisely.
Compliance: Define
the requirements ensuring the corpus provides a realistic representation
(i.e., both compliant with the reality of the phenomenon under study,
and exhibiting sufficient regularity to obtain some result).
Usability: The corpus needs to be structured, and its size
must be sufficient to allow for a suitable representation of the
phenomenon that is studied. The issue of statistic representativity must
be considered. If several phenomena are studied and compared, they need to
be equally represented within the corpus. Practical Work:
Always ask yourself:
Can I use an
existing corpus?
If building a corpus, keep the following issues in mind:
Assign consistant and relevant
names to the files
Keep track of the building
procedure used
Before any work on the corpus
is processed, start with some descriptive stats concerning both the corpus as a whole, and the components: word count and various information are
useful (language, authors, keywords, categories, document type, etc.)
Make sure the study can be reproduced
at a later time to allow for parameter or method comparison.
Bibliography (in French) Définir un Corpus (général):
Extrait de la thèse de B. Pincemin (1999)
Construction et
gestion des corpus (point de vue terminologique) - E. Marshman, OLST 2003 |