ccat
ccat
Convert corpus format xml to clean text.
ccat has three usage modes, print to stdout the content of:
- converted files (produced by convert2xml)
- converted files containing errormarkup (produced by convert2xml)
- analysed files (produced by analyse_corpus)
Printing content of converted files to stdout
To print out all sme content of all the converted files found in $GTFREE/converted/sme/admin and its subdirectories, issue the command:
ccat -a -l sme $GTFREE/converted/sme/admin
It is also possible to print a file at a time:
ccat -a -l sme $GTFREE/converted/sme/admin/sd/other_files/vl_05_1.doc.xml
To print out the content of e.g. all converted pdf files found in a directory and its subdirectories, issue this command:
find converted/sme/science/ -name "*.pdf.xml" | xargs ccat -a -l sme
Printing content of analysed files to stdout
The analysed files produced by analyse_corpus contain among other one dependency element and one disambiguation element, that contain the dependency and disambiguation analysis of the original files content.
ccat -dis sda/sda_2006_1_aikio1.pdf.xml
Prints the content of the disambiguation element.
ccat -dep sda/sda_2006_1_aikio1.pdf.xml
Prints the content of the dependency element.
The usage pattern for printing these elements is otherwise the same as printing the content of converted files.
Printing dependency elements
ccat -dep $GTFREE/analysed/sme/admin
ccat -dep $GTFREE/analysed/sme/admin/sd/other_files/vl_05_1.doc.xml
find analysed/sme/science/ -name "*.pdf.xml" | xargs ccat -dep
Printing disambiguation elements
ccat -dis $GTFREE/analysed/sme/admin
ccat -dis $GTFREE/analysed/sme/admin/sd/other_files/vl_05_1.doc.xml
find analysed/sme/science/ -name "*.pdf.xml" | xargs ccat -dis
Printing errormarkup content
This usage mode is used in the speller tests. Examples of this usage pattern is found in the make files in $GTBIG/prooftools.
The complete help text from the program
usage: ccat [-h] [--version] [-l LANG] [-T] [-L] [-t] [-a] [-c] [-C] [-ort]
[-ortreal] [-morphsyn] [-syn] [-lex] [-format] [-foreign]
[-noforeign] [-typos] [-f] [-S] [-dis] [-dep]
[-hyph HYPH_REPLACEMENT]
targets [targets ...]
Print the contents of a corpus in XML format The default is to print paragraphs
with no type (=text type).
positional arguments:
targets Name of the files or directories to process. If a
directory is given, all files in this directory and
its subdirectories will be listed.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-l LANG Print only elements in language LANG. Default is all
languages.
-T Print paragraphs with title type
-L Print paragraphs with list type
-t Print paragraphs with table type
-a Print all text elements
-c Print corrected text instead of the original typos &
errors
-C Only print unclassified (§/<error..>) corrections
-ort Only print ortoghraphic, non-word ($/<errorort..>)
corrections
-ortreal Only print ortoghraphic, real-word
(¢/<errorortreal..>) corrections
-morphsyn Only print morphosyntactic (£/<errormorphsyn..>)
corrections
-syn Only print syntactic (¥/<errorsyn..>) corrections
-lex Only print lexical (€/<errorlex..>) corrections
-format Only print format (‰/<errorformat..>) corrections
-foreign Only print foreign (∞/<errorlang..>) corrections
-noforeign Do not print anything from foreign (∞/<errorlang..>)
corrections
-typos Print only the errors/typos in the text, with
corrections tab-separated
-f Add the source filename as a comment after each error
word.
-S Print the whole text one word per line; typos have tab
separated corrections
-dis Print the disambiguation element
-dep Print the dependency element
-hyph HYPH_REPLACEMENT
Replace hyph tags with the given argument