compile_cwb_mono.py
Concatenates all vrt-files in corpus-xxx/korp_mono/
(which is made by the
korp_mono
script), and run the Corpus WorkBench (CWB
) toolchain to
produce the final files needed for Korp (the files in the data/
and
registry/
CWB folders).
This is the last script in the process, and is very quick to run.
After running the script, .yaml
files need to be added to the Korp
backend configuration directory CORPUS_CONFIG/corpora
.
One .yaml file for each file in the created registry/
folder.
It contains information to Korp for how to present the corpus in the
web interface. Things like "description" and such go in there. Refer to the
Korp documentation for more information about that.
This only applies to v9 of Korp, as previously configuration was done
differently.
Some of these things could be done programmatically, but ultimately there are settings in there that we cannot determine from this script (at least not trivally), so automating it further is probably not worth it. As long as the documentation is good, it's really ok.
Basic usage
usage: compile_cwb_mono.py [-h] [--date DATE] [--cwb-binaries-dir CWB_BINARIES_DIR]
[--target TARGET] [--data-dir DATA_DIR] [--registry-dir REGISTRY_DIR]
directory
$ compile_cwb_mono --target TARGET KORP_MONO_DIRECTORY
Use --date
to set the date as it appears to CWB. It's optional, and today's
date will be used if not given.
Either (1) --target
or (2) both --data-dir
and --registry-dir
must
be given, even though the usage text of the program (the text that you get
when you run compile_cwb_mono --help
) suggests otherwise.
--data-dir
specifies where the CWB data/
directory resides, while
--registry-dir
specifies the CWB registry/
directory. If both the data/
and registry/
directory resides next to each other in the same parent directory,
it's easier to use --target
, which specifies the parent directory.
Options
If the script cannot find the CWB binaries (cwb-encode
, cwb-makeall
,
etc...), you can use --cwb-binaries-dir
to tell the script where they are
located.