htmlconverter
Convert html files to the Giella xml format.
HTMLError
Bases: Exception
Raise this error in this module.
Source code in corpustools/htmlconverter.py
25 26 | |
remove_declared_encoding(content)
Remove declared decoding.
lxml explodes if we send a decoded Unicode string with an xml-declared encoding http://lxml.de/parsing.html#python-unicode-strings
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
the contents of a html document |
required |
Returns:
| Type | Description |
|---|---|
str
|
content sans the declared decoding |
Source code in corpustools/htmlconverter.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |
to_html_elt(filename)
Return the content of the html doc as a string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
Path
|
path to the webpage |
required |
Returns:
| Type | Description |
|---|---|
HtmlElement
|
The content of the webpage sent through the lxml.html5parser. |
Source code in corpustools/htmlconverter.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |