Skip to content

How we administer and work with our corpora

Architecture

Our corpora are stored on github, in two different repositories for each language. We have giellalt/corpus-xxx-orig for storing originals, and giellalt/corpus-xxx for storing processed files. The xxx is a 3-letter ISO-639-3 language code, for example sme for North Saami, and nob for Norwegian Bokmål.

Here's a quick overview:

corpus-xxx-orig/          - original files (txt, pdf, html, doc, etc)
  ...<categories>/        - documents are ordered in subfolder, one folder
                            for each category (admin, facta, ficti, news, etc...)
    .../<document>        - The original document
    .../<document>.xsl    - Metadata for the document

corpus-xxx/               - processed files
  converted/              - extracted texts from source files in corpus-xxx-orig
                            each file is stored in xml format
    ...<categories>/      - subcategores, same structure as under 'corpus-xxx-orig'
       .../<document>.xml - Individual document in our internal xml format
  analysed/               - result of running analysis on converted files
    ...<categories>/
       ...<document>.xml  - Analysed document, in our internal xml format
  korp_mono/              - "Korp-ready files"
    ...<categories>/
      .../<document>.xml  - analysed doucment in xml korp ready format
  korp_tmx/               -
  tmx/                    - Files in TMX format, ready for further parallel
    <lang1>/                corpus processing
    <lang2>/
    <...langN>/           - one subfolder for each language
      ...<categories>/
        ...<document>.tmx - each document in tmx format

TMX is an XML file for storing translation strings.

See the wikipedia article https://en.wikipedia.org/wiki/Translation_Memory_eXchange

What each script does

Script Description
add_files_to_corpus Copies original source files into corpus-xxx-orig. Adds the .xsl metadata files. The corpus maintainer will add missing metadata about each document.
convert2xml Reads documents in corpus-xxx-orig (or some subfolder or single file therein), and outputs the .xml file of the extracted text from that document. The file automatically determines file type, and uses an extractor to read text from that filetype.
analyse_corpus Reads documents in corpus-xxx/converted, and runs our language tools (hfst, etc) on them. The resulting analysed xml document is placed in corpus-xxx/analysed.
analyse_para Analyses sentence-aligned input files found in corpus-xxx/tmx and outputs to corpus-xxx/tmx_analysed.
korp_mono Reads documents in corpus-xxx/analysed, and converts the cg3-analysis format into a CWB-input format, one .vrt file per document.
korp_para Reads documents from corpus-xxx/tmx_analysed, and converts the analysis format into CWB-input format, one .vrt file per tmx document.
compile_cwb_mono Concatenates korp_mono (the files in corpus-xxx/korp_mono) into one .vrt file per genre, and runs CWB-tools on each of those to generate a CWB-corpus (which is what Korp reads).
compile_cwb_para Ditto as mono, but for tmx_analysed
parallelize Sentence-alignes two parallel corpora, outputs into corpus-xxx/tmx.
reparallelize Uses information found in the input .tmx file, to re-do the sentence alignment (convert and parallelize). Useful when fixing mis-aligned .tmx files.
ccat Prints plain text from an internal .xml (xml files produced by the convert2xml or analyse_corpus) file.