How we administer and work with our corpora

Architecture

Our corpora are stored on github, in two different repositories for each language. We have giellalt/corpus-xxx-orig for storing originals, and giellalt/corpus-xxx for storing processed files. The xxx is a 3-letter ISO-639-3 language code, for example sme for North Saami, and nob for Norwegian Bokmål.

Here's a quick overview:

corpus-xxx-orig/          - original files (txt, pdf, html, doc, etc)
  ...<categories>/        - documents are ordered in subfolder, one folder
                            for each category (admin, facta, ficti, news, etc...)
    .../<document>        - The original document
    .../<document>.xsl    - Metadata for the document

corpus-xxx/               - processed files
  converted/              - extracted texts from source files in corpus-xxx-orig
                            each file is stored in xml format
    ...<categories>/      - subcategores, same structure as under 'corpus-xxx-orig'
       .../<document>.xml - Individual document in our internal xml format
  analysed/               - result of running analysis on converted files
    ...<categories>/
       ...<document>.xml  - Analysed document, in our internal xml format
  korp_mono/              - "Korp-ready files"
    ...<categories>/
      .../<document>.xml  - analysed doucment in xml korp ready format
  korp_tmx/               -
  tmx/                    - Files in TMX format, ready for further parallel
    <lang1>/                corpus processing
    <lang2>/
    <...langN>/           - one subfolder for each language
      ...<categories>/
        ...<document>.tmx - each document in tmx format

TMX is an XML file for storing translation strings.

See the wikipedia article https://en.wikipedia.org/wiki/Translation_Memory_eXchange

What each script does

Script	Description
`add_files_to_corpus`	Copies original source files into `corpus-xxx-orig`. Adds the `.xsl` metadata files. The corpus maintainer will add missing metadata about each document.
`convert2xml`	Reads documents in `corpus-xxx-orig` (or some subfolder or single file therein), and outputs the `.xml` file of the extracted text from that document. The file automatically determines file type, and uses an extractor to read text from that filetype.
`analyse_corpus`	Reads documents in `corpus-xxx/converted`, and runs our language tools (`hfst, etc`) on them. The resulting analysed xml document is placed in `corpus-xxx/analysed`.
`analyse_para`	Analyses sentence-aligned input files found in `corpus-xxx/tmx` and outputs to `corpus-xxx/tmx_analysed`.
`korp_mono`	Reads documents in `corpus-xxx/analysed`, and converts the cg3-analysis format into a CWB-input format, one `.vrt` file per document.
`korp_para`	Reads documents from `corpus-xxx/tmx_analysed`, and converts the analysis format into CWB-input format, one `.vrt` file per tmx document.
`compile_cwb_mono`	Concatenates `korp_mono` (the files in `corpus-xxx/korp_mono`) into one `.vrt` file per genre, and runs CWB-tools on each of those to generate a CWB-corpus (which is what `Korp` reads).
`compile_cwb_para`	Ditto as mono, but for `tmx_analysed`
`parallelize`	Sentence-alignes two parallel corpora, outputs into `corpus-xxx/tmx`.
`reparallelize`	Uses information found in the input `.tmx` file, to re-do the sentence alignment (convert and parallelize). Useful when fixing mis-aligned `.tmx` files.
`ccat`	Prints plain text from an internal `.xml` (xml files produced by the `convert2xml` or `analyse_corpus`) file.