Creation of Corpora and Language Data Retrieval: Methods, Models, Tools

Publisher: Vydavatelství Filozofické fakulty Univerzity Palackého v Olomouci
Place: Olomouc
Year: 2014

chapter 1: queries, freqlists
chapter 2: queries, tools, texts
chapter 3: queries, tools, texts, corpora
chapter 4: queries, commands, tools, scripts, texts
chapter 5: Manatee/Bonito - demo corpora, software


This book offers a systematic overview of technical processing of language data and effective data retrieval. It also presents the possibilities and means of creating one’s own text database (a language corpus).
Firstly, the basics of the Corpus Query Language (CQL) and elementary principles of corpus data extraction are introduced. Attention is also given to the basic methods of quantitative evaluation of corpus data, especially creating frequency lists and discovering collocations (and colligations) by means of common statistical tests, such as MI-score, t-score, Log-Likelihood, and Chi-square.
Considerable emphasis is placed on the technical aspects of corpus creation and text annotation, particularly on data formatting and character coding, text segmentation, and the use of the Extensible Markup Language (XML) for annotation. These essential aspects of corpus database preparation are discussed both theoretically and with the help of practical illustrations. Their good understanding is necessary not only for corpus creation, but also for more advanced corpus usage, for instance the use of regular expressions and CQL in more complex search masks. We specifically illustrate the possible variability of CQL queries, i.e. the fact that one query can be written in several ways in CQL, and the possible overgeneration of complex structured paradigms. Alternative querying methods are also presented, e.g. the use of extended sets of regular expressions (PCRE and POSIX metacharacters).
An essential part of the book is devoted to corpus formats and XML, which is currently the most widely used standard for corpus database annotation. The basics of XML syntax and of the XML file layout are introduced, as well as the possibilities of corpus annotation in various formats. We highlight the principal connections between XML, as a method of database annotation, and CQL, as a search mask format, including the use of the so-called proximity operators.
Further, the book introduces selected computer programs for corpus data extraction, ranging from simple, single-purpose applications to complex corpus software tools. We primarily focus on the analysis of texts without linguistic annotation and demonstrate the relatively simple ways of creating small corpora and of extracting data from them, always providing a brief description of the tool used or the implemented query language. Extended functions of some concordance tools, especially AntConc and Xaira, are described as well, such as the creation of lemma lists, displaying the dispersion of expressions in texts, the use of statistical tests for collocation and colligation searching, etc.
The technically most demanding chapters are concerned with the possibilities of processing a text automatically into a structured database with the help of software tools and computer scripts. These chapters also introduce corpora compilation using the Manatee/Bonito system. Each stage of computer data processing are presented one by one: the setup or the conversion of character coding, of line end coding and of the file format, segmentation or tokenization of the text, its processing into one of the corpus formats (e.g. the so-called vertical format), as well as the process of annotation of a various type and extent (particularly lemmatization and tagging). Since there is no graphical user interface for working with the necessary computer scripts, essentials of the command-line interface (CLI) operations (with various parameters) are also included in this technical part. We also illustrate the utilization of CLI for certain basic corpus operations, such as keywords search, concordance generation, compilation of frequency or alphabetical lists, etc., all directly from the source data, without the need to import it into corpus software tools.

Supplemental to this monograph is a web repository (, where readers can find a large number of related materials: software installation files, computer scripts that we work with, as well as excerpts of source texts or example texts from the book.
Pořízka, Petr: Tvorba korpusů a vytěžování jazykových dat: metody, modely, nástroje. Vydavatelství Filozofické fakulty Univerzity Palackého, Olomouc 2014. (288 s.) ISBN 978-80-87895-17-7 (tisk); ISBN 978-80-87895-16-0 (iPDF)