Skip to content

epubconverter

Convert epub documents to the Giella xml format.

Epub files are zip files that contain text in xhtml files. This class reads all xhtml files found in this archive. The body element of these files are converted to div elements, and appended inside a new body element.

It is possible to filter away ranges of elements from this new xhtml file. These ranges consist pairs of xpath paths, specified inside the metadata file that belongs to this epub file.

chapters(book, metadata)

Get the all linear chapters of the epub book.

Parameters:

Name Type Description Default
book epub.Book

The epub book element

required

Yields:

Type Description
lxml.etree.Element

The body of an xhtml file found in the epub file.

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def chapters(book, metadata):
    """Get the all linear chapters of the epub book.

    Args:
        book (epub.Book): The epub book element

    Yields:
        (lxml.etree.Element): The body of an xhtml file found in the epub file.
    """
    excluded = metadata.epub_excluded_chapters
    for index, chapter in enumerate(book.chapters):
        if index not in excluded:
            chapterbody = read_chapter(chapter).find(
                "{http://www.w3.org/1999/xhtml}body"
            )
            chapterbody.tag = "{http://www.w3.org/1999/xhtml}div"
            yield chapterbody

extract_content(filename, metadata)

Extract content from the epub file.

Parameters:

Name Type Description Default
filename str

path to the document

required

Returns:

Type Description
lxml.etree.Element

the content of the epub file wrapped in html element

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def extract_content(filename, metadata):
    """Extract content from the epub file.

    Args:
        filename (str): path to the document

    Returns:
        (lxml.etree.Element): the content of the epub file wrapped in html
            element
    """
    mainbody = etree.Element("{http://www.w3.org/1999/xhtml}body")
    html = etree.Element("{http://www.w3.org/1999/xhtml}html")
    html.append(etree.Element("{http://www.w3.org/1999/xhtml}head"))
    html.append(mainbody)

    book = epub.Book(epub.open_epub(filename))

    for chapterbody in chapters(book, metadata):
        mainbody.append(chapterbody)

    return html

read_chapter(chapter)

Read the contents of a epub_file chapter.

Parameters:

Name Type Description Default
chapter epub.BookChapter

the chapter of an epub file

required

Returns:

Type Description
str

The contents of a chapter

Raises:

Type Description
util.ConversionException

on conversion error

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def read_chapter(chapter):
    """Read the contents of a epub_file chapter.

    Args:
        chapter (epub.BookChapter): the chapter of an epub file

    Returns:
        (str): The contents of a chapter

    Raises:
        util.ConversionException: on conversion error
    """
    try:
        return etree.fromstring(chapter.read())
    except KeyError as error:
        raise util.ConversionError(error)

remove_first_element(path1, content)

Remove the first element in the range.

Parameters:

Name Type Description Default
path1 str

path to the first element to remove.

required
content lxml.etree.Element

the xhtml document that should be altered.

required
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
230
231
232
233
234
235
236
237
238
239
240
241
def remove_first_element(path1, content):
    """Remove the first element in the range.

    Args:
        path1 (str): path to the first element to remove.
        content (lxml.etree.Element): the xhtml document that should
            be altered.
    """
    first_start = content.find(
        path1, namespaces={"html": "http://www.w3.org/1999/xhtml"}
    )
    first_start.getparent().remove(first_start)

remove_range(path1, path2, content)

Remove a range of elements from an xhtml document.

Parameters:

Name Type Description Default
path1 str

path to first element

required
path2 str

path to second element

required
content lxml.etree.Element

xhtml document

required
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def remove_range(path1, path2, content):
    """Remove a range of elements from an xhtml document.

    Args:
        path1 (str): path to first element
        path2 (str): path to second element
        content (lxml.etree.Element): xhtml document
    """
    if path2:
        starts, ends = remove_trees_1(path1, path2, content)
        remove_trees_2(starts, ends, content)
        remove_first_element(path1, content)
    else:
        found = content.find(path1, namespaces={"html": "http://www.w3.org/1999/xhtml"})
        found.getparent().remove(found)

remove_ranges(metadata, html)

Remove ranges of html elements.

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
 95
 96
 97
 98
 99
100
def remove_ranges(metadata, html):
    """Remove ranges of html elements.
    """
    if metadata.skip_elements:
        for pairs in metadata.skip_elements:
            remove_range(pairs[0], pairs[1], html)

remove_siblings_shorten_path(parts, content, preceding=False)

Remove all siblings before or after an element.

Parameters:

Name Type Description Default
parts list of str

a xpath path split on /

required
content etree._Element

an xhtml document

required
preceding bool

When True, iterate through the preceding siblings of the found element, otherwise iterate throughe the following siblings.

False

Returns:

Type Description
list[str]

the path to the parent of parts.

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def remove_siblings_shorten_path(parts, content, preceding=False):
    """Remove all siblings before or after an element.

    Args:
        parts (list of str): a xpath path split on /
        content (etree._Element): an xhtml document
        preceding (bool): When True, iterate through the preceding siblings
            of the found element, otherwise iterate throughe the following
            siblings.

    Returns:
        (list[str]): the path to the parent of parts.
    """
    path = "/".join(parts)
    found = content.find(path, namespaces={"html": "http://www.w3.org/1999/xhtml"})
    parent = found.getparent()
    for sibling in found.itersiblings(preceding=preceding):
        parent.remove(sibling)

    return parts[:-1]

remove_trees_1(path1, path2, content)

Remove tree nodes that do not have the same parents.

While the parents in starts and ends are unequal (that means that starts and ends belong in different trees), remove elements following starts and preceding ends. Shorten the path to the parents of starts and ends and remove more elements if needed. starts and ends are of equal length.

Parameters:

Name Type Description Default
path1 str

path to first element

required
path2 str

path to second element

required
content etree._Element

xhtml document, where elements are removed.

required

Returns:

Type Description
tuple[list[str]]

paths to the new start and end element.

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
def remove_trees_1(path1, path2, content):
    """Remove tree nodes that do not have the same parents.

    While the parents in starts and ends are unequal (that means that
    starts and ends belong in different trees), remove elements
    following starts and preceding ends. Shorten the path to the parents
    of starts and ends and remove more elements if needed. starts and
    ends are of equal length.

    Args:
        path1 (str): path to first element
        path2 (str): path to second element
        content (etree._Element): xhtml document, where elements are
            removed.

    Returns:
        (tuple[list[str]]): paths to the new start and end element.
    """
    starts, ends = shorten_longest_path(path1, path2, content)

    while starts[:-1] != ends[:-1]:
        starts = remove_siblings_shorten_path(starts, content)
        ends = remove_siblings_shorten_path(ends, content, preceding=True)

    return starts, ends

remove_trees_2(starts, ends, content)

Remove tree nodes that have the same parents.

Now that the parents of starts and ends are equal, remove the last trees of nodes between starts and ends (if necessary).

Parameters:

Name Type Description Default
starts list[str]

path to first element

required
ends list[str]

path to second element

required
content lxml.etree.Element

xhtml document, where elements are removed.

required
Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
def remove_trees_2(starts, ends, content):
    """Remove tree nodes that have the same parents.

    Now that the parents of starts and ends are equal, remove the last
    trees of nodes between starts and ends (if necessary).

    Args:
        starts (list[str]): path to first element
        ends (list[str]): path to second element
        content (lxml.etree.Element): xhtml document, where elements are
            removed.
    """
    deepest_start = content.find(
        "/".join(starts), namespaces={"html": "http://www.w3.org/1999/xhtml"}
    )
    deepest_end = content.find(
        "/".join(ends), namespaces={"html": "http://www.w3.org/1999/xhtml"}
    )
    parent = deepest_start.getparent()
    for sibling in deepest_start.itersiblings():
        if sibling == deepest_end:
            break
        else:
            parent.remove(sibling)

shorten_longest_path(path1, path2, content)

Remove elements from the longest path.

If starts is longer than ends, remove the siblings following starts, shorten starts with one step (going to the parent). If starts still is longer than ends, remove the siblings following the parent. This is done untill starts and ends are of equal length.

If on the other hand ends is longer than starts, remove the siblings preceding ends, then shorten ends (going to its parent).

Parameters:

Name Type Description Default
path1 str

path to first element

required
path2 str

path to second element

required
content etree._Element

xhtml document, where elements are removed.

required

Returns:

Type Description
tuple[list[str]]

paths to the new start and end element, now with the same length.

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def shorten_longest_path(path1, path2, content):
    """Remove elements from the longest path.

    If starts is longer than ends, remove the siblings following starts,
    shorten starts with one step (going to the parent). If starts still is
    longer than ends, remove the siblings following the parent. This is
    done untill starts and ends are of equal length.

    If on the other hand ends is longer than starts, remove the siblings
    preceding ends, then shorten ends (going to its parent).

    Args:
        path1 (str): path to first element
        path2 (str): path to second element
        content (etree._Element): xhtml document, where elements are
            removed.

    Returns:
        (tuple[list[str]]): paths to the new start and end element, now
            with the same length.
    """
    starts, ends = path1.split("/"), path2.split("/")

    while len(starts) > len(ends):
        starts = remove_siblings_shorten_path(starts, content)

    while len(ends) > len(starts):
        ends = remove_siblings_shorten_path(ends, content, preceding=True)

    return starts, ends

to_html_elt(filename)

Append all chapter bodies as divs to an html file.

Returns:

Type Description
lxml.etree.Element

An etree.Element containing the content of all xhtml files found in the epub file as one xhtml document.

Source code in /home/anders/projects/CorpusTools/corpustools/epubconverter.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def to_html_elt(filename):
    """Append all chapter bodies as divs to an html file.

    Returns:
        (lxml.etree.Element): An etree.Element containing the content of
            all xhtml files found in the epub file as one xhtml document.
    """
    metadata = xslsetter.MetadataHandler(filename + ".xsl", create=True)
    html = extract_content(filename, metadata)
    try:
        remove_ranges(metadata, html)
    except AttributeError:
        raise util.ConversionError(
            "Check that skip_elements in the "
            "metadata file has the correct format"
        )

    return html