languagedetector
This file contains classes fix converted documents.
LanguageDetector
Detect and set the languages of a document.
Source code in /home/anders/projects/CorpusTools/corpustools/languagedetector.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
inlangs
property
Return the predifined possible languages of the document.
mainlang
property
Get the mainlang of the file.
__init__(document, language_guesser)
Initialise the LanguageDetector class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document |
etree.Element
|
an etree element. |
required |
language_guesser |
text_cat.Classifier
|
a text_cat.Classifier. |
required |
Source code in /home/anders/projects/CorpusTools/corpustools/languagedetector.py
28 29 30 31 32 33 34 35 36 |
|
detect_language()
Detect language in all the paragraphs in self.document.
Source code in /home/anders/projects/CorpusTools/corpustools/languagedetector.py
107 108 109 110 111 |
|
remove_quote(paragraph)
staticmethod
Extract all text except the one inside .
Source code in /home/anders/projects/CorpusTools/corpustools/languagedetector.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
set_paragraph_language(paragraph)
Set xml:lang of paragraph.
Extract the text outside the quotes, use this text to set language of the paragraph. Set the language of the quotes in the paragraph.
Source code in /home/anders/projects/CorpusTools/corpustools/languagedetector.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
set_span_language(paragraph)
Set xml:lang of span element.
Source code in /home/anders/projects/CorpusTools/corpustools/languagedetector.py
77 78 79 80 81 82 83 84 85 86 |
|