languagedetector
This file contains classes fix converted documents.
LanguageDetector
Detect and set the languages of a document.
Source code in corpustools/languagedetector.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
inlangs
property
Return the predifined possible languages of the document.
mainlang
property
Get the mainlang of the file.
__init__(document, language_guesser)
Initialise the LanguageDetector class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
Element
|
an etree element. |
required |
language_guesser
|
Classifier
|
a text_cat.Classifier. |
required |
Source code in corpustools/languagedetector.py
28 29 30 31 32 33 34 35 36 | |
detect_language()
Detect language in all the paragraphs in self.document.
Source code in corpustools/languagedetector.py
107 108 109 110 111 | |
remove_quote(paragraph)
staticmethod
Extract all text except the one inside .
Source code in corpustools/languagedetector.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | |
set_paragraph_language(paragraph)
Set xml:lang of paragraph.
Extract the text outside the quotes, use this text to set language of the paragraph. Set the language of the quotes in the paragraph.
Source code in corpustools/languagedetector.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
set_span_language(paragraph)
Set xml:lang of span element.
Source code in corpustools/languagedetector.py
77 78 79 80 81 82 83 84 85 86 | |