dupe_finder
Classes to find and handle duplicate files in the repository.
The classes work on converted files.
DupeFinder
Handle duplicates in the corpus.
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
compare_files(filename1, filename2)
Compare two files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename1 |
str
|
name of the first file. |
required |
filename2 |
str
|
name of the second file. |
required |
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
get_parallel_texts(filename1)
staticmethod
Get the names of the parallel files.
filename (str): name of the file that should be searched.
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
59 60 61 62 63 64 65 |
|
get_wc(filename)
staticmethod
Get the wordcount of a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename |
str
|
name of the file to retrieve the word count from. |
required |
Returns:
Type | Description |
---|---|
float
|
the word count |
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
good_word_ratio(filename1, filename2)
Check if the word ratio of two files are nearly equal.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename1 |
str
|
name of the first file. |
required |
filename2 |
str
|
name of the second file. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the ratio is larger than 0.9, False if it is less. |
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
iterate_all_files(remove=False)
Compare all files to each other.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
remove |
bool
|
Defaults to False. If True, remove files, otherwise keep files. |
False
|
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
remove_dupe_file(filename1, filename2)
Remove duplicate files.
filename1 (str): name of the first file to be compared. filename2 (str): name of the second file to be compared.
Source code in /home/anders/projects/CorpusTools/corpustools/dupe_finder.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|