Working with corpora

Structure

The corpus for a given language is hosted on github. Raw source files, along with metadata, is stored in github.com/giellalt/corpus-xxx-orig (where xxx is a language code), while processed output is stored in github.com/giellalt/corpus-xxx.

Additionally, bound (or restricted) corpus are stored as github.com/giellalt/corpus-xxx-orig-x-closed and github.com/giellalt/corpus-xxx-x-closed.

Git LFS

The source repositories contains original files (as pdf, docx, etc), and many of them are large. We use Git Large File Storage (LFS) to handle them.

This means that they are not downloaded when you clone the repository, instead you will get files containing information for LFS about where the files are located, and their size.

Note

LFS is only required for handling raw source files (i.e. when working with corpus-xxx-orig). If you are not dealing with any raw material - and only dealing with the *corpus-xxx folder, you can skip this.

Installation

Installation is documented in their readme, at https://github.com/git-lfs/git-lfs#installing, but see also the main site at https://git-lfs.com/.