epubconverter
Convert epub documents to the Giella xml format.
Epub files are zip files that contain text in xhtml files. This class reads all xhtml files found in this archive. The body element of these files are converted to div elements, and appended inside a new body element.
It is possible to filter away ranges of elements from this new xhtml file. These ranges consist pairs of xpath paths, specified inside the metadata file that belongs to this epub file.
chapters(book, metadata)
Get the all linear chapters of the epub book.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
book |
epub.Book
|
The epub book element |
required |
Yields:
Type | Description |
---|---|
lxml.etree.Element
|
The body of an xhtml file found in the epub file. |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
extract_content(filename, metadata)
Extract content from the epub file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename |
str
|
path to the document |
required |
Returns:
Type | Description |
---|---|
lxml.etree.Element
|
the content of the epub file wrapped in html element |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
read_chapter(chapter)
Read the contents of a epub_file chapter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chapter |
epub.BookChapter
|
the chapter of an epub file |
required |
Returns:
Type | Description |
---|---|
str
|
The contents of a chapter |
Raises:
Type | Description |
---|---|
util.ConversionException
|
on conversion error |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
remove_first_element(path1, content)
Remove the first element in the range.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path1 |
str
|
path to the first element to remove. |
required |
content |
lxml.etree.Element
|
the xhtml document that should be altered. |
required |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
230 231 232 233 234 235 236 237 238 239 240 241 |
|
remove_range(path1, path2, content)
Remove a range of elements from an xhtml document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path1 |
str
|
path to first element |
required |
path2 |
str
|
path to second element |
required |
content |
lxml.etree.Element
|
xhtml document |
required |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
|
remove_ranges(metadata, html)
Remove ranges of html elements.
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
95 96 97 98 99 100 |
|
remove_siblings_shorten_path(parts, content, preceding=False)
Remove all siblings before or after an element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
parts |
list of str
|
a xpath path split on / |
required |
content |
etree._Element
|
an xhtml document |
required |
preceding |
bool
|
When True, iterate through the preceding siblings of the found element, otherwise iterate throughe the following siblings. |
False
|
Returns:
Type | Description |
---|---|
list[str]
|
the path to the parent of parts. |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
remove_trees_1(path1, path2, content)
Remove tree nodes that do not have the same parents.
While the parents in starts and ends are unequal (that means that starts and ends belong in different trees), remove elements following starts and preceding ends. Shorten the path to the parents of starts and ends and remove more elements if needed. starts and ends are of equal length.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path1 |
str
|
path to first element |
required |
path2 |
str
|
path to second element |
required |
content |
etree._Element
|
xhtml document, where elements are removed. |
required |
Returns:
Type | Description |
---|---|
tuple[list[str]]
|
paths to the new start and end element. |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
|
remove_trees_2(starts, ends, content)
Remove tree nodes that have the same parents.
Now that the parents of starts and ends are equal, remove the last trees of nodes between starts and ends (if necessary).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
starts |
list[str]
|
path to first element |
required |
ends |
list[str]
|
path to second element |
required |
content |
lxml.etree.Element
|
xhtml document, where elements are removed. |
required |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
|
shorten_longest_path(path1, path2, content)
Remove elements from the longest path.
If starts is longer than ends, remove the siblings following starts, shorten starts with one step (going to the parent). If starts still is longer than ends, remove the siblings following the parent. This is done untill starts and ends are of equal length.
If on the other hand ends is longer than starts, remove the siblings preceding ends, then shorten ends (going to its parent).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path1 |
str
|
path to first element |
required |
path2 |
str
|
path to second element |
required |
content |
etree._Element
|
xhtml document, where elements are removed. |
required |
Returns:
Type | Description |
---|---|
tuple[list[str]]
|
paths to the new start and end element, now with the same length. |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
to_html_elt(filename)
Append all chapter bodies as divs to an html file.
Returns:
Type | Description |
---|---|
lxml.etree.Element
|
An etree.Element containing the content of all xhtml files found in the epub file as one xhtml document. |
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|