sentencedivider
Classes and functions to sentence align two files.
SentenceDivider
A class to divide plain text output into sentences.
Uses hfst-tokenise as the motor for this purpose.
Attributes:
Name | Type | Description |
---|---|---|
stops |
list[str]
|
tokens that imply where a sentence ends. |
lang |
str
|
three character language code |
relative_path |
str
|
relative path to where files needed by modes.xml are found. |
tokeniser |
modes.Pipeline
|
tokeniser pipeline |
Source code in /home/anders/projects/CorpusTools/corpustools/sentencedivider.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
__init__(lang, giella_prefix=None)
Set the files needed by the tokeniser.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lang |
str
|
language the analyser can tokenise |
required |
Source code in /home/anders/projects/CorpusTools/corpustools/sentencedivider.py
61 62 63 64 65 66 67 |
|
make_sentences(tokenised_output)
Turn ccat output into cleaned up sentences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenised_output |
str
|
plain text output of ccat. |
required |
Yields:
Type | Description |
---|---|
str
|
a cleaned up sentence |
Source code in /home/anders/projects/CorpusTools/corpustools/sentencedivider.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
make_valid_sentences(ccat_output)
Turn ccat output into full sentences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ccat_output |
str
|
the plain text output of ccat |
required |
Returns:
Type | Description |
---|---|
list[str]
|
The ccat output has been turned into a list of full sentences. |
Source code in /home/anders/projects/CorpusTools/corpustools/sentencedivider.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
to_plain_text(file_path)
Turn an xml formatted file into clean text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
CorpusPath
|
The path to the file |
required |
Raises:
Type | Description |
---|---|
UserWarning
|
if there is no text, raise a UserWarning |
Returns:
Type | Description |
---|---|
str
|
the content of ccat output |
Source code in /home/anders/projects/CorpusTools/corpustools/sentencedivider.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|