analyse_corpus
Analyse converted corpus files.
analyse_corpus depends on these external programs:
- vislcg3
- hfst
Usage
To be able to use this program you must either use the nightly giella packages or build the needed resources for the supported languages (exchange "sma" with "sme, smj" ad lib):
cd $GTLANGS/langs-sma
Configure the language, use at least these to options --prefix=$HOME/.local
--enable-tokenisers
./configure --prefix=$HOME/.local --enable-tokenisers # add your own flags to taste
make install
Then you must convert the corpus files as explained in the convert2xml section.
When this is done you can analyse all files in the corpus repos:
analyse_corpus corpus-<lang>/converted # exchange <lang> with your lang e.g. sme, sma, mdf
The analysed file will be found in corpus-<lang>/analysed
To analyse only one file, issue this command:
analyse_corpus --serial sme corpus-<lang>/converted/file.html.xml
The complete help text from the program:
usage: analyse_corpus [-h] [--version] [--ncpus NCPUS] [--skip-existing]
[--serial]
[-k {xfst,hfst,hfst_thirties,hfst_eighties,hfst_no_korp,trace-smegram-dev,trace-smegram}]
converted_entities [converted_entities ...]
Analyse files in parallel.
positional arguments:
converted_entities converted files or director(y|ies) where the converted
files exist
options:
-h, --help show this help message and exit
--version show program's version number and exit
--ncpus NCPUS The number of cpus to use. If unspecified, defaults to
using as many cpus as it can. Choose between 1-12,
some (3), half (6), most (9) or all (12).
--skip-existing Skip analysis of files thar are already analysed (that
already exist in the analysed/ folder
--serial When this argument is used files will be analysed one
by one.Using --serial takes priority over --ncpus
-k {xfst,hfst,hfst_thirties,hfst_eighties,hfst_no_korp,trace-smegram-dev,trace-smegram}, --modename {xfst,hfst,hfst_thirties,hfst_eighties,hfst_no_korp,trace-smegram-dev,trace-smegram}
You can set the analyser pipeline explicitely if you
want.