ocrconverter
to_plaintext(path, language)
Convert a PDF containing ocr'd text to an iterable containing text paragraphs.
Pick up the tiff images created by to_tiff and use pytesseract to extract text from them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
The path to the PDF file. |
required |
language
|
str
|
The language of the text in the PDF file. |
required |
Source code in corpustools/ocrconverter.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
to_tiff(path)
Convert a PDF to a series of tiff images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
The path to the PDF file. |
required |
Raises:
| Type | Description |
|---|---|
ConversionError
|
If the conversion fails. |
Source code in corpustools/ocrconverter.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
to_xml(path, language)
Convert a PDF containing ocr'd text to a Giella xml document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
The path to the PDF file. |
required |
language
|
str
|
The language of the text in the PDF file. |
required |
Returns: (_Element): The xml document.
Source code in corpustools/ocrconverter.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |