Skip to content

convert2xml

The convert2xml script runs corpustools.convertermanager:main.

Overview

Convert original files in a corpus to giellatekno/divvun xml format.

Dependencies

convert2xml depends on these external programs:

  • pdftotext
  • wvhtml

Usage

Convert all files in the directory $GTFREE/orig/sme and its subdirectories.

convert2xml $GTFREE/orig/sme

The converted files are placed in $GTFREE/converted/sme with the same directory structure as that in $GTFREE/orig/sme.

Convert only one file:

convert2xml $GTFREE/orig/sme/admin/sd/file1.html

The converted file is found in $GTFREE/orig/sme/admin/sd/file1.htm.xml

Convert all sme files in directories ending with corpus

convert2xml *corpus/orig/sme

If convert2xml is not able to convert a file these kinds of message will appear:

~/Dokumenter/corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/calendar-for-the-ministry-of-children-an.html_id=308

A log file will be found in

~/Dokumenter/corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/calendar-for-the-ministry-of-children-an.html_id=308.log

explaining what went wrong.

The complete help text from the program:

usage: convert2xml [-h] [--version] [--serial] [--lazy-conversion]
                   [--write-intermediate] [--goldstandard]
                   sources [sources ...]

Convert original files to giellatekno xml.

positional arguments:
  sources               The original file(s) or directory/ies where the
                        original files exist

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --serial              use this for debugging the conversion process. When
                        this argument is used files will be converted one by
                        one.
  --lazy-conversion     Reconvert only if metadata have changed.
  --write-intermediate  Write the intermediate XML representation to
                        ORIGFILE.im.xml, for debugging the XSLT. (Has no
                        effect if the converted file already exists.)
  --goldstandard        Convert goldstandard and .correct files