.. colibri documentation master file, created by sphinx-quickstart on Mon Oct 8 11:38:12 2012. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. **************************** Colibri Core Documentation **************************** .. toctree:: :maxdepth: 3 Introduction =================== Colibri Core is software consisting of command line tools as well as programming libraries. to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns. The employed notion of pattern or construction encompasses the following categories: * **n-gram** -- n consecutive words * **skipgram** -- An abstract pattern of predetermined length with one or multiple gaps (of specific size). * **flexgram** -- An abstract pattern without predetermined length, with one or more gaps. N-gram extraction may seem fairly trivial at first, with a few lines in your favourite scripting language, you can move a simple sliding window of size *n* over your corpus and store the results in some kind of hashmap. This trivial approach however makes an unnecessarily high demand on memory resources, this often becomes prohibitive if unleashed on large corpora. Colibri Core tries to minimise these space requirements in several ways: * **Binary representation** -- Each word type is assigned a numeric class, which is encoded in a compact binary format in which highly frequent classes take less space than less frequent classes. Colibri core always uses this representation rather than a full string representation, both on disk and in memory. * **Informed counting** -- Counting is performed more intelligently by iteratively processing the corpus in several passes and quickly discarding patterns that won't reach the desired occurrence threshold. Skipgram and flexgram extraction are computationally more demanding but have been implemented with similar optimisations. Skipgrams are computed by abstracting over n-grams, and flexgrams in turn are computed either by abstracting over skipgrams, or directly from n-grams on the basis of co-occurrence information (mutual pointwise information). When patterns have been extracted, along with their counts and or index references to original corpus data, they form a so-called *pattern model*. At the heart of Colibri Core lies the tool ``colibri-patternmodeller`` which allows you to build, view, manipulate and query pattern models. The Colibri software is developed in the scope of the Ph.D. research project **Constructions as Linguistic Bridges**. This research examines the identification and extraction of aligned constructions or patterns across natural languages, and the usage of such constructions in Machine Translation. The aligned constructions are not identified on the basis of an extensive and explicitly defined grammar or expert database of linguistic knowledge, but rather are implicitly distilled from large amounts of example data. Our notion of constructions is broad and transcends the idea of words or variable-length phrases. This documentation will illustrate how to work with the various tools and the library of colibri, as well as elaborate on the implementation of certain key aspects of the software. Installation =============== Installation via LaMachine ------------------------------------------------------------ Colibri Core is included in the `LaMachine `_ distribution. This includes all dependencies and other NLP software. LaMachine can also run as a virtual machine on any host OS. Installing dependencies ------------------------------- To compile Colibri Core, you need a sane build environment, install the necessary dependencies for your distribution. For Debian/Ubuntu:: $ sudo apt-get install make gcc g++ pkg-config autoconf-archive libtool autotools-dev libbz2-dev zlib1g-dev libtar-dev python3 python3-dev cython3" For RedHat-based systems (run as root):: # yum install pkgconfig libtool autoconf automake autoconf-archive make gcc gcc-c++ libtar libtar-devel python3 python3-devel zlib zlib-devel python3-pip bzip2 bzip2-devel cython3 For Mac OS X with `homebrew `_:: $ brew install autoconf automake libtool autoconf-archive python3 pkg-config Arch Linux users can simply install Colibri Core and all dependencies directly from the `Arch User Repository `_ , no further installation is necessary in this case. Installation via the Python Package Index ------------------------------------------------------------ Colibri Core can be installed from the `Python Package Index `_ using the ``pip`` tool, often named ``pip3`` for the Python 3 version, which we recommend. This procedure will automatically download, compile, and install all of Colibi Core. First ensure you installed all dependencies from the previous section! Colibri Core requires an up-to-date version of Cython first (0.23 or above), or the installation will fail, we use ``pip`` to compile it from scratch:: $ sudo pip3 install cython Then we can install Colibri Core itself:: $ sudo pip3 install colibricore For installation without root privileges we recommend creating a `Python Virtual environment, `_ , in which case all of Colibri Code will be installed under it:: $ virtualenv --python=python3 coco $ . coco/bin/activate #you will need to do this each time you want to use Colibri Core (coco)$ pip install cython (coco)$ pip install colibricore Alternatively, pass a prefix:: $ pip3 install --install-option="--prefix=/my/installation/directory" colibricore **Important Note:** If you install Colibri Core locally (in a Python Virtual Environment or elsewhere), then you need to set ``LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/`` prior to running Python for the Python binding to function (replace ``$VIRTUAL_ENV`` with the directory you used as a prefix if you do not use Python Virtual Environment). Otherwise, you will be confronted with an error: ``ImportError: libcolibricore.so.0: cannot open shared object file: No such file or directory``. If you get this error after a global installaton, run ``sudo ldconfig``. Installation from Github ----------------------------- First ensure you installed all dependencies from the dependencies section! Colibri Core is hosted on `github `_ and should ideally be retrieved through the versioning control system ``git``. Provided git is installed on your system, this is done as follows:: $ git clone https://github.com/proycon/colibri-core.git Alternatively, you can download and extract release archives from the aforementioned Github page. Now we can install Colibri Core itself, the following will install everything globally under ``/usr/`` and hence requires administrative privileges:: $ sudo python3 setup.py install If you install from within a `Python Virtual environment, `_ everything will be installed under it. This does not require root privileges:: $ virtualenv --python=python3 coco $ . coco/bin/activate #you will need to do this each time you want to use Colibri Core (coco)$ pip install cython (coco)$ pip install colibricore For local installation elsewhere, pass a prefix, for example:: $ python3 setup.py install --prefix=/home/yourname/local The note at the end of the previous section applies for any non-global installation! Manual compilation and installation (advanced) ------------------------------------------------- Installation via the Python Package Index or Github invokes compilation of the C++ source for you. You can however to compile everything yourself. This is especially relevant if you are not only interested in command-line tools or C++ library, and do not care about the Python library. Moreover, this route is also needed if you want FoLiA support (optional), for which you then first to install the following dependency. By default FoLiA support is disabled. * **libfolia**; obtainable from `the FoLiA website `_, follow the instructions included with libfolia to install it. Including pulling the sources from github, Colibri Core can be compiled and installed as follows:: $ git clone https://github.com/proycon/colibri-core.git $ cd colibri-core $ bash bootstrap $ ./configure [--prefix=/usr] [--with-folia --with-folia-path=/path/to/libfolia] $ make $ sudo make install If not prefix is set, installation will be under ``/usr/local/`` by default. To compile the Python binding manually, we first create an empty file ``manual`` which signals the build process to not attempt to recompile the C++ library itself:: $ touch manual Then build as follows, provided Colibri Core is in a globally accessible location already:: $ sudo python3 ./setup.py install If you used a prefix and want to install in a customised non-global location, again use ``--prefix`` along with ``--include-dirs`` and ``--library-dirs`` to point to where the C++ headers and the library are installed:: $ python3 ./setup.py build_ext --include-dirs=/path/to/include/colibri-core --library-dirs=/path/to/lib/ install --prefix=/path/to/somewhere/ Update ``$LD_LIBRARY_PATH`` and ``$PYTHONPATH`` where necessary. Keeping colibri up to date ----------------------------- It is recommended to often check back for new versions of Colibri Core. If you used, ``pip`` just run:: $ pip3 install -U colibricore For the git version:: $ git pull And then recompile as per the above instructions. General usage instructions --------------------------------- Colibri consist of various programs and scripts, each of which will output an extensive overview of available parameters if the parameter ``-h`` is passed. Each program is designed for a specialised purpose, with specific input and output formats. It is often needed to call multiple programs in succession to obtain the final analysis or model you desire. Quick start: High-level scripts ================================= Introduction --------------- Colibri Core comes with a set of scripts that provide simpler access to the underlying tools and can be used from the command line by end-users to get quick results. Input to these tools is always one or more plain text files, in tokenised form, with one sentence per line. Tokenisation -------------- If your corpus is not tokenised yet, you can consider using the tokeniser `ucto `_ , which is not part of Colibri Core, Debian/Ubuntu users may it in the repository (``sudo apt-get install ucto``), Mac OS X users can find it in homebrew (``brew install naiaden/lama/ucto``). This will also do sentence detection and, with the ``-n`` flag output one line per sentence, as Colibri prefers:: $ ucto -L en -n untokenisedcorpus.txt > tokenisedcorpus.txt The ``-L`` specifies the language of your corpus (English in this case), several others are available as well. Of course, you can use any other tokeniser of your choice. Scripts ----------- In addition to the core tools, described in the remainder of this documentation, Colibri Core offers the following scripts: * ``colibri-ngrams`` - Extracts n-grams of a particular size from the corpus text, in the order they occur, i.e. by moving a sliding window over the text. * ``colibri-freqlist`` - Extracts all n-grams from one or more corpus text files and outputs a frequency list. Also allows for the extraction of skipgrams. By default all n-grams are extracted, but an occurrence threshold can be set with the ``-t`` flag. * ``colibri-ngramstats`` - Prints a summary report on the ngrams in one or more corpus text files. To get the full details on interpreting the output report, read the section *Statistical Reports and Histograms*. * ``colibri-histogram`` - Prints a histogram of ngram/skipgram occurrence count * ``colibri-queryngrams`` - Interactive tool allowing you to query ngrams from standard input, various statistics and relations can be output. * ``colibri-reverseindex`` - Computes and prints a reverse index for the specified corpus text file. For each token position in the corpus, it will output what patterns are found there (i.e start at that very same position) * ``colibri-loglikelihood`` - Computes the log-likelihood between patterns in two or more corpus text files, which allows users to determine what words or patterns are significantly more frequent in one corpus than the other. * ``colibri-coverage`` - Computes overlap between a training corpus and a test corpus, produces coverage metrics. * ``colibri-findpatterns`` - Finds patterns (including skipgrams and flexgrams) in corpus data. You specify the patterns in a text file (one per line). Users have to be aware, however, that these script only expose a limited amount of the functionality of Colibri Core. Nevertheless, they simplify a lot of common tasks people do with Colibri Core. Corpus Class Encoding ================================ Introduction ---------------------- Computation on large datasets begs for solutions to keep memory consumption manageable. Colibri requires that input corpora are converted into a compressed binary form. The vocabulary of the corpus is converted to integer form, i.e. each word-type in the corpus is represented by a numeric class. Highly frequent word-types get assigned low class numbers and less frequent word-types get higher class numbers. The class is represented in a dynamic-width byte-array, rather than a fixed-width integer. Patterns are encoded per word, each word starts with a size marker of one byte indicating the number of bytes are used for that word. The specified number of bytes that follow encode the word class. Instead of a size marker, byte values of 128 and above are reserved for special markers, such as encoding gaps and structural data. Finally, the pattern as a whole is ended by a null byte. All internal computations of all tools in colibri proceed on this internal representation rather than actual textual strings, keeping running time shorter and memory footprint significantly smaller. Class-encoding your corpus ----------------------------------- When working with colibri, you first want to **class encode** your corpus. This is done by the program ``colibri-classencode``. It takes as input a *tokenised* monolingual corpus in plain text format, containing *one sentence per line*, as a line is the only structural unit Colibri works with, extracted patterns will never cross line boundaries. Each line should be delimited by a single newline character (unix line endings). If you desire another structural unit (such as for example a tweet, or a paragraph), simply make sure each is on one line. Colibri is completely agnostic when it comes to the character encoding of the input. Given a corpus file ``yourcorpus``, class encoding is done as follows:: $ colibri-classencode yourcorpus This results in two files: * ``yourcorpus.colibri.cls`` - This is the class file; it lists all word-types and class numbers. * ``yourcorpus.colibri.dat`` - This is the corpus is encoded binary form. It is a lossless compression that is roughly half the size of the original If your corpus is not tokenised yet, you can consider using the tokeniser `ucto `_ (not part of colibri), this will also do sentence detection and output one line per sentence:: $ ucto -L en -n untokenisedcorpus.txt > tokenisedcorpus.txt The above sample is for English (``-L en``), several other languages are also supported. In addition to this plain text input. The class encoder also supports *FoLiA XML* (`folia website `_) if you compiled with FoLiA support, make sure such files end with the extension ``xml`` and they will be automatically interpreted as FoLiA XML:: $ colibri-classencode yourcorpus.xml The class file is the vocabulary of your corpus, it simply maps word strings to integer. You must always ensure that whenever you are working with multiple models, and you want to compare them, to use the exact same class file. It is possible to encode multiple corpus files similtaneously, generating a joined class file:: $ colibri-classencode yourcorpus1.txt yourcorpus2.txt This results in ``yourcorpus1.colibri.cls`` and ``yourcorpus1.colibri.dat`` and ``yourcorpus2.colibri.dat``. The class file spans both despite the name. An explicit name can be passed with the ``-o`` flag. It is also possible to encode multiple corpora in a single unified file by passing the ``-u`` flag. This is often desired if you want to train a pattern model on all the joined data:: $ colibri-classencode -o out -u yourcorpus1.txt yourcorpus2.txt This will produce ``out.colibri.dat`` and ``out.colibri.cls``. You can use the ``-l`` option to read input filenames from file instead of command line arguments (one filename per line). If you have a pre-existing class file you can load it with the ``-c`` flag, and use it to encode new data:: $ colibri-classencode -c yourcorpus1.colibri.cls yourcorpus2.txt This will produce a ``yourcorpus2.colibri.dat``, provided that all of the word types already existed in ``yourcorpus1.colibri.cls`` (which usually is not the case, in which case an error will be shown. To circumvent this error you have to specify how to deal with unknown words. There are two ways; the ``-U`` flag will encode all unknown word as a single word class dedicated to the task, whereas the ``-e`` flag will *extend* the specified class file with any new classes found. It has to be noted that this extension method spoils the optimal compression as classes are no longer strictly sorted by frequency. If you can all needed data in one go, then that is always preferred. This setup, however, is often seen in a train/test paradigm:: $ colibri-classencode -f testset.txt -c trainset.colibri.cls -e This will result in an encoded corpus ``testset.colibri.dat`` and an *extended* class file ``testset.colibri.cls``, which is a superset of the original ``trainset.cls``, adding only those classes that did not yet exist in the training data. Class-decoding your corpus ------------------------------ Given an encoded corpus and a class file, the original corpus can always be reconstructed (unless the ``-U`` option was used in encoding to allow unknown words). This we call *class decoding* and is done using the ``colibri-classdecode`` program:: $ colibri-classdecode -f yourcorpus.colibri.dat -c yourcorpus.colibri.cls Partial decoding can be done by specifying start and end line numbers using the flags ``-s`` and ``-e`` respectively. Output will be to ``stdout``, you can redirect it to a file as follows:: $ colibri-classdecode -f yourcorpus.colibri.dat -c yourcorpus.colibri.cls > yourcorpus.txt Pattern Modeller =============================== Introduction ----------------------- The ``colibri-patternmodeller`` program is used to create pattern models capturing recurring patterns from a monolingual corpus. The extracted patterns are n-grams or skip-grams, where a skip-gram is an n-gram with one or more gaps of either a predefined size, thus containing unspecified or wildcard tokens, or of dynamic width. In the internal pattern representation, in the place of the size marker, byte value 128 is used for a fixed gap of a single token, and can be repeated for gaps of longer length, byte value 129 is used for a gap of unspecified dynamic width. The pattern finding algorithm is iterative in nature and is guaranteed to find all n-grams above a specified occurrence threshold and optionally given a maximum size for n. It does so by iterating over the corpus n times, iterating over all possible values for n in ascending order. At each iteration, a sliding window extracts all n-grams in the corpus for the size is question. An n-gram is counted in a hashmap data structure only if both n-1-grams it by definition contains are found during the previous iteration with an occurrence above the set threshold. The exception are unigrams, which are all by definition counted if they reach the threshold, as they are already atomic in nature. At the end of each iteration, n-grams not making the occurrence threshold are pruned. This simple iterative technique reduces the memory footprint compared to the more naive approach of immediately storing all in a hashmap, as it prevents the storing of lots of patterns not making the threshold by discarding them at an earlier stage. At the beginning of each iteration of n, all possible ways in which any n-gram of size *n* can contain gaps is computed. When an n-gram is found, various skip-grams are tried in accordance with these gap configurations. This is accomplished by 'punching holes' in the n-gram, resulting in a skip-gram. If all consecutive parts of this skip-gram were counted during previous iterations and thus made the threshold, then the skip-gram as a whole is counted, otherwise it is discarded. After each iteration, pruning again takes places to prune skip-grams that are not frequent enough. The pattern finder can create either indexed or unindexed models. For indexed models, the precise location of where an n-gram or skipgram instance was found in the corpus is recorded. This comes at the cost of much higher memory usage, but is necessary for more strongly constrained skip extraction, as well as for extracting relations between patterns at a later stage. Indexed models by default also maintain a reverse index allowing, and even unindexed models do so during building. Note that for fixed-size skipgrams in indexed models, the various fillings for the gaps can be reconstructed precisely. If you are only interested in simple n-gram or simple skip-gram counts, then an unindexed model may suffice. Creating a pattern model ---------------------------- First make sure to have class-encoded your corpus. Given this encoded corpus, ``colibri-patternmodeller`` can be invoked to produce an indexed pattern model. Always specify the output file using the ``--outputmodel`` or ``-o`` flag. The occurrence threshold is specified with parameter ``--threshold`` or ``-t``, patterns occuring less will not be counted. The default value is two. The maximum value for n, i.e. the maximum n-gram/skipgram size, can be restricted using the parameter ``-l``.:: $ colibri-patternmodeller --datafile yourcorpus.dat --threshold 10 --outputmodel yourcorpus.colibri.indexedpatternmodel This outputted model ``yourcorpus.colibri.indexedpatternmodel`` is stored in a binary format. To print it into a human readable presentation it needs to be decoded. The ``colibri-patternmodeller`` program can do this by specifying an input model using the ``--inputmodel`` or ``-i`` flag, the class file using the ``--classfile`` (``-c``) parameter, and the desired action to print it all using ``--print`` (``-P``):: $ colibri-patternmodeller --inputmodel yourcorpus.colibri.indexedpatternmodel --classfile yourcorpus.colibri.cls --print Optionally, instead of or in addition to outputting a model to file using ``-o``, you can also print it directly with ``-P``. Output will be to ``stdout`` in a tab delimited format, with the first line reserved for the header. This facilitates easy parsing as you can just load it into any software accepting CSV files, such as spreadsheets. An excerpt follows:: PATTERN COUNT TOKENS COVERAGE CATEGORY SIZE FREQUENCY REFERENCES For 2 2 0.0059 ngram 1 0.0121 11:0 15:0 death 2 2 0.0059 ngram 1 0.0121 11:5 23:7 who 2 2 0.0059 ngram 1 0.0121 15:1 21:5 . 4 4 0.0118 ngram 1 0.0242 5:6 9:4 10:6 13:4 be 4 4 0.0118 ngram 1 0.0242 1:1 1:5 9:2 35:3 flee 2 2 0.0059 ngram 1 0.0121 36:1 36:5 not to 4 8 0.0235 ngram 2 0.1538 1:3 36:3 37:3 38:3 The various columns are: * **Pattern** - The actual pattern. Gaps in skipgrams are represented as ``{*}`` (for each gap). Variable-width skipgrams are just ``{**}``. * **Occurrence count** - The absolute number of times this pattern occurs * **Tokens** - The absolute number of tokens in the corpus that this pattern covers. Longer patterns by definition cover more tokens. This value's maximum is ``occurrencecount * n``, the value will be smaller if a pattern overlaps itself. * **Coverage** - The number of covered tokens, as a fraction of the total number of tokens. * **Category** - The type of pattern (ngram, skipgram or flexgram). * **Size** - The length of the n-gram or skipgram in words/tokens. * **Frequency** - The frequency of the pattern *within its category and size class*, so for an ngram of size two, the frequency indicates the frequency amongst all bigrams. * **References** - A space-delimited list of indices in the corpus that correspond to a occurrence of this pattern. Indices are in the form ``sentence:token`` where sentence starts at one and token starts at zero. This column is only available for indexed models. Creating a pattern model with skipgrams and/or flexgrams ---------------------------------------------------------- The pattern model created in the previous example did not yet include skip-grams, these have to be explicitly enabled with the ``--skipgrams`` (``-s``) flag. When this is used, another options becomes available for consideration: * ``--skiptypes [value]`` (``-T`` for short) - Only skipgrams that have at least this many different types as skip content, i.e. possible options filling the gaps, will be considered. The default is set to two. Here is an example of generating an indexed pattern model including skipgrams:: $ colibri-patternmodeller --datafile yourcorpus.colibri.dat --threshold 10 --skipgrams --skiptypes 3 --outputmodel yourcorpus.colibri.indexedpatternmodel If you want to generate unindexed models, simply add the flag ``--unindexed`` or ``-u`` for short. Do note that for unindexed models the parameter ``--skiptypes`` has no effect, it will extract all skipgrams it can find as if ``-skiptypes`` were set to one! If you want decent skipgrams, you're best off with an indexed model. Note that indexed models can always be read and printed in an unindexed way (with the ``--unindexed`` flag); but unindexed models can not be read in an indexed way, as they simply lack indices:: $ colibri-patternmodeller --inputmodel yourcorpus.colibri.indexedpatternmodel --classfile yourcorpus.colibri.cls --unindexed --print $ colibri-patternmodeller --inputmodel yourcorpus.colibri.unindexedpatternmodel --classfile yourcorpus.colibri.cls --unindexed --print Flexgrams, non-consecutive patterns in which the gap (only one in the current implementation) is of dynamic width, can be generated in one of two ways: * Extract flexgrams by abstracting from skipgrams: use the ``--flexgrams S`` flag. * Extract flexgrams directly from n-gram co-occurence: use the ``--flexgams [threshold]`` flag, where the threshold is expressed as normalised pointwise mutual information [-1,1]. The skipgram approach has the advantage of allowing you to rely on the ``--skiptypes`` threshold, but comes with the disadvantage of having a maximum span. The co-occurrence approach allows for flexgrams over larger distances. Both methods come at the cost of more memory, especially the former method. Neither skipgrams nor flexgrams will cross the line boundary of the original corpus data, so ensure your data is segmented into lines suitable for your purposes in the encoding stage. Two-stage building ----------------------- Generating an indexed pattern model takes considerably more memory than an unindexed model, as instead of mere counts, all indices have to be retained. The creation of a pattern models progresses through stages of counting and pruning. When construction of an indexed model with an occurrence threshold of 2 or higher reaches the limits of your system's memory capacity, then two-stage building *may* offer a solution. Two-stage building first constructs an unindexed model (demanding less memory), and subsequently loads this model and searches the corpus for indices for all patterns in the model. Whilst this method is more time-consuming, it prevents the memory bump (after counting, prior to pruning) that normal one-stage building of indexed models have. Two-stage building is enabled using the ``-2`` flag:: $ colibri-patternmodeller -2 --datafile yourcorpus.colibri.dat --threshold 10 --skipgrams --skiptypes 3 -outputmodel yourcorpus.colibri.indexedpatternmodel Statistical reports and histograms ---------------------------------- If you have a pattern model, you can generate a statistical report which includes information on the number of occurrences and number of types for patterns, grouped for n-grams or skipgrams for a specific value of *n*. A report is generated using the ``-report`` (``-R``) flag, the input model is specified using ``--inputmodel`` (``-i``):: $ colibri-patternmodeller --inputmodel yourcorpus.colibri.indexedpatternmodel --report Example output:: REPORT ---------------------------------- PATTERNS TOKENS COVERAGE TYPES Total: - 340 - 177 Uncovered: - 175 0.5147 136 Covered: 69 165 0.4853 41 CATEGORY N (SIZE) PATTERNS TOKENS COVERAGE TYPES OCCURRENCES all all 69 165 0.4853 41 243 all 1 40 165 0.4853 40 165 all 2 11 26 0.0765 13 26 all 3 7 17 0.0500 9 19 all 4 5 10 0.0294 9 14 all 5 5 9 0.0265 9 17 all 6 1 2 0.0059 6 2 n-gram all 62 165 0.4853 40 215 n-gram 1 40 165 0.4853 40 165 n-gram 2 11 26 0.0765 13 26 n-gram 3 5 12 0.0353 8 12 n-gram 4 3 6 0.0176 6 6 n-gram 5 2 4 0.0118 6 4 n-gram 6 1 2 0.0059 6 2 skipgram all 7 7 0.0206 6 28 skipgram 3 2 7 0.0206 4 7 skipgram 4 2 4 0.0118 4 8 skipgram 5 3 5 0.0147 5 13 Some explanation is in order to correctly interpret this data. First of all patterns are grouped by category (ngram,skipgram, flexgram) and size. There are various metrics: * **Pattern** - The number of distinct patterns in this group, so for category n-gram of 2, this reflects the number of distinct bigrams. * **Tokens** - The number of tokens that is covered by the patterns in the group. Longer patterns by definition cover more tokens. . This is only available for indexed models, for unindexed models it is either omitted or the number shown is maximum projection ``occurrencecount * size`` . * **Coverage** - The number of tokens covered as a fraction of the total number of tokens. Only for indexed models. * **Types** - The number of unique **word** types covered, i.e the number of distinct unigrams. * **Occurrences** - Cumulative occurrence count of all the patterns in the group. Used as a basis for computing frequency. Occurrence count differs from **tokens**, the former expresses the number of times a pattern occurs in the corpus, the latter expresses how many tokens are part of the pattern To better understand these metrics, let's perceive them in the following test sentence:: to be or not to be , that is the question If we generate an indexed pattern model purely on this sentence, **with threshold two**. We find the following three patterns:: PATTERN COUNT TOKENS COVERAGE CATEGORY SIZE FREQUENCY REFERENCES to 2 2 0.181818 ngram 1 0.5 1:0 1:4 be 2 2 0.181818 ngram 1 0.5 1:1 1:5 to be 2 4 0.363636 ngram 2 1 1:0 1:4 The report then looks as follows:: REPORT ---------------------------------- PATTERNS TOKENS COVERAGE TYPES Total: - 11 - 9 Uncovered: - 7 0.6364 7 Covered: 3 4 0.3636 2 CATEGORY N (SIZE) PATTERNS TOKENS COVERAGE TYPES OCCURRENCES all all 3 4 0.3636 2 6 all 1 2 4 0.3636 2 4 all 2 1 4 0.3636 2 2 n-gram all 3 4 0.3636 2 6 n-gram 1 2 4 0.3636 2 4 n-gram 2 1 4 0.3636 2 2 Our sentence has 11 tokens, 7 of which are not covered by the patterns found, 4 of which are. Since we have only n-grams and no skipgrams or flexgrams in this simple example, the data for *all* and *n-gram* is the same. The **coverage** metric expresses this in a normalised fashion. In our data we have two unigrams *(to, be)* and one bigram *(to be)*, this is expressed by the **patterns** metric. Both the unigrams and the bigrams cover the exact same four tokens in our sentence, i.e 0, 1, 4, and 5, so the TOKENS column reports four for all. If we look at the **types** column, we notice we only have two word types: *to* and *be*. The unigrams occur in four different instances and the bigrams occur in two different instances. This is expressed in the **occurrences** column. Combined that makes six occurrences. We have 9 types in total, of which only 2 (to, be) are covered, the remaining 7 *(or not , that is the question)* remain uncovered as we set our occurrence threshold for this model to two. Pattern models store how many of the tokens and types in the original corpus were covered. Tokens and types not covered did not make the set thresholds. Make sure to use indexed models if you want accurate coverage data. A histogram can also be generated, using the ``--histogram`` (``-H``) flag:: $ colibri-patternmodeller --inputmodel yourcorpus.colibri.indexedpatternmodel --histogram Example output:: OCCURRENCES PATTERNS 2 39 3 5 4 13 5 5 6 1 7 1 8 1 10 1 13 1 14 1 15 1 Filtering models -------------------------------- Patterns models can be read with ``--inputmodel`` and filtered by setting stricter thresholds prior to printing, reporting or outputting to file. An example:: $ colibri-patternmodeller --inputmodel yourcorpus.colibri.indexedpatternmodel -t 20 --skiptypes 10 --outputmodel yourcorpus_filtered.colibri.indexedpatternmodel --print You can also filter pattern models by intersecting with another pattern model using the ``--constraintmodel`` option. This only works when both are built on the same class file:: $ colibri-patternmodeller --inputmodel yourcorpus.colibri.indexedpatternmodel --constraintmodel yourcorpus2.colibri.indexedpatternmodel --outputmodel yourcorpus_filtered.colibri.indexedpatternmodel The output pattern model will contain only those patterns that were present in both the input model (``--inputmodel``) as well as the constraining model (``--constraintmodel``), which may be either indexed or unindexed regardless of the input model; it will always contain the counts/indices from the input model. Training and testing coverage -------------------------------- An important quality of pattern models lies in the fact that pattern models can be compared, provided they use comparable vocabulary, i.e. are based on the same class file. More specifically, you can train a pattern model on a corpus and test it on another corpus, which yields another pattern model containing only those patterns that occur in both training and test data. The difference in count, frequency and coverage can then be easily be compared. You build such a model by taking the intersection with a training model using the ``--constraintmodel`` flag. Make sure to always use the same class file for all datasets you are comparing. Instructions for this were given in :ref:`classencodetraintest`. Training:: $ colibri-patternmodeller --datafile trainset.colibri.dat --outputmodel trainset.colibri.indexedpatternmodel This results in a model ``trainset.colibri.indexedpatternmodel``. Now proceed with testing on another corpus: Testing:: $ colibri-patternmodeller --datafile testset.colibri.dat --constraintmodel trainset.indexedpatternmodel.colibri --outputmodel testset.colibri.indexedpatternmodel It is, however, more memory efficient to load the constraint model as the inputmodel (using ``--inputmodel``), and specify ``--constrained``, this will do an in-place rebuilding of the model:: $ colibri-patternmodeller --datafile testset.colibri.dat --constrained trainset.indexedpatternmodel.colibri --outputmodel testset.colibri.indexedpatternmodel This results in a model ``testset.colibri.indexedpatternmodel`` that only contains patterns that also occur in the specified training model. Such an intersection of models can also be created at any later stage using ``--inputmodel`` and ``--constraintmodel``, as shown in the previous section. If you are interested in coverage metrics for a training model on a test model, the ``colibri-coverage`` script provides a shortcut for this. Reverse index ----------------- Indexed pattern models have a what is called a *forward index*. For each pattern, all of the positions, ``(sentence,token)``, at which a token of the pattern can be found, is held. In Colibri-core, sentences always start at 1, whereas tokens start at 0. A reverse index is a mapping of references of the type ``(sentence,token)`` to a set of all the patterns that *begin* at that location. Such a reverse index can be constructed from the forward index of an indexed pattern model, or it can be explicitly given by simply passing the original corpus data to the model, which makes reverse indices available even for unindexed models. Explicitly providing a reverse index makes loading a model faster (especially on larger models), but at the cost of higher memory usage, especially in case of sparse models. The usual ``--datafile`` flag takes enables a reverse index by definition. To compute and display the full reverse index, use the ``--printreverseindex`` (``-Z``) flag:: $ colibri-patternmodeller --inputmodel yourcorpus.indexedpatternmodel.colibri --datafile yourcorpus.colibri.dat --classfile yourcorpus.colibri.cls --printreverseindex Indexes and/or reverse indexes are required for various purposes, one of which is the extraction of relations and co-occurrence information. The ``colibri-reverseindex`` script provides a shortcut for this. Query mode -------------- The pattern modeller has query mode which allows you to quickly extract patterns from test sentences or fragments thereof. The query mode is invoked by loading a pattern model (``--inputmodel``), a class file (``--classfile``) and the ``--querymode`` (``-Q``) flag. The query mode can be run interactively as it takes input from ``stdin``, one *tokenised* sentence per line. The following example illustrates this, the sentence *"To be or not to be"* was typed as input:: $ colibri-patternmodeller --inputmodel /tmp/data.colibri.patternmodel --classfile /tmp/hamlet.colibri.cls --querymode Loading class decoder from file /tmp/hamlet.colibri.cls Loading class encoder from file /tmp/hamlet.colibri.cls Loading indexed pattern model /tmp/data.colibri.patternmodel as input model... Colibri Patternmodeller -- Interactive query mode. Type ctrl-D to quit, type X to switch between exact mode and extensive mode (default: extensive mode). 1>> To be or not to be 1:0 To 8 8 0.0235294 ngram 1 0.0484848 1:0 5:7 9:5 10:0 22:0 36:0 37:0 38:0 1:1 be 4 4 0.0117647 ngram 1 0.0242424 1:1 1:5 9:2 35:3 1:2 or 4 4 0.0117647 ngram 1 0.0242424 1:2 36:2 37:2 38:2 1:3 not 5 5 0.0147059 ngram 1 0.030303 1:3 27:7 36:3 37:3 38:3 1:4 to 13 13 0.0382353 ngram 1 0.0787879 1:4 2:6 4:1 5:10 6:7 8:4 9:1 9:8 10:4 27:2 36:4 37:4 38:4 1:2 or not 4 8 0.0235294 ngram 2 0.153846 1:2 36:2 37:2 38:2 1:3 not to 4 8 0.0235294 ngram 2 0.153846 1:3 36:3 37:3 38:3 1:2 or not to 4 12 0.0352941 ngram 3 0.333333 1:2 36:2 37:2 38:2 The output starts with an index in the format ``sentence:token``, specifying where the pattern found was found in your input. The next columns are the same as the print output.The interactive query mode distinguishes two modes, extensive mode and exact mode. In extensive mode, your input string will be scanned for all patterns occurring in it. In exact mode, the input you specified needs to match exactly and as a whole. Type ``X`` to switch between the modes. In addition to interactive query mode, there is also a command line query mode ``--query`` (``-q``) in which you specify the pattern you want to query as argument on the command line. Multiple patterns can be specified by repeating the ``-q`` flag. This mode always behaves according to exact mode:: $ colibri-patternmodeller --inputmodel /tmp/data.colibri.patternmodel --classfile /tmp/hamlet.colibri.cls --query "to be" Loading class decoder from file /tmp/hamlet.colibri.cls Loading class encoder from file /tmp/hamlet.colibri.cls Loading indexed pattern model /tmp/data.colibri.patternmodel as input model... to be 2 4 0.0117647 ngram 2 0.0769231 1:4 9:1 If you are working with skipgrams or flexgrams and want to explicitly instantiate them to ngrams in the output, add the ``--instantiate`` flag. Pattern Relations --------------------- A pattern model contains a wide variety of patterns; the relationships between those can be made explicit. These relationships can be imagined as a directed graph, in which the nodes represent the various patterns (n-grams and skipgrams), and the edges represent the relations. The following relations are distinguished; note that as the graph is directed relations often come in pairs; one relationship for each direction: * **Subsumption relations** - Patterns that are subsumed by larger patterns are called *subsumption children*, the larger patterns are called *subsumption parents*. These are the two subsumption relations that can be extracted from an indexed pattern model. * **Successor relations** - Patterns that follow eachother are in a left-of/right-of relation. * **Instantiation relations** - There is a relation between skipgrams and patterns that instantiate them ``to be {*1*} not {*1*} be`` is instantiated by ``to {*1*} or``, also referred to as the skip content. Similarly, ``to be or not to be`` is a full instantation, and the skipgram can be referred to as a template. You can all of these extract relations using the ``--relations`` (``-g``) flag, which is to be used in combination with ``--query`` or ``--querymode```. Consider the following sample:: $ colibri-patternmodeller -i /tmp/data.colibri.patternmodel -c /tmp/hamlet.colibri.cls -q "to be" -g Loading class decoder from file /tmp/hamlet.colibri.cls Loading class encoder from file /tmp/hamlet.colibri.cls Loading indexed pattern model /tmp/data.colibri.patternmodel as input model... Post-read processing (indexedmodel) to be 2 4 0.0117647 ngram 2 0.0769231 1:4 9:1 # PATTERN1 RELATION PATTERN2 REL.COUNT REL.FREQUENCY COUNT2 to be SUBSUMES to 2 0.5 13 to be SUBSUMES be 2 0.5 4 to be RIGHT-NEIGHBOUR-OF To {*1*} or not 1 0.25 4 to be RIGHT-NEIGHBOUR-OF To {*2*} not 1 0.25 4 to be RIGHT-NEIGHBOUR-OF not 1 0.25 5 to be RIGHT-NEIGHBOUR-OF or not 1 0.25 4 The following columns are reported, all are indented with a single tab so possible parsers can distinguish the numbers for the queried pattern itself from the relationships with other patterns. * **Pattern 1** -- The pattern you queried * **Relation** -- The nature of the relationship between pattern 1 and pattern 2 * **Pattern 2** -- The pattern that is related to the queried pattern * **Relation Count** -- The number of times pattern 1 and pattern 2 occur in this relation * **Relation Frequency** -- The number of times pattern 1 and pattern 2 occur in this relationas a fraction of all relations of this type * **Count 2** -- The absolute number of occurrences of pattern 2 in the model Co-occurrence -------------------- Co-occurrence in Colibri-Core measures which patterns co-occur on the same line (i.e. usually corresponding to a sentence or whatever structural unit you decided upon when encoding your corpus). Co-occurrence is another relation in addition to the ones described in the previous section. The degree of co-occurrence can be expressed as either an absolute occurrence number (``--cooc`` or ``-C``), for a normalised mutual pointwise information (``--npmi`` or ``-Y``). Both flags take a threshold, setting the threshold too low, specially for npmi, may cause very high memory usage. The following syntax would show all patterns that occur at least five times in the same sentence. Note that the order of the pattern pairs does not matter; if there are two patterns X and Y then result X Y or Y X would be the same, yet only one of them is included in the output to prevent duplicating information:: $ colibri-patternmodeller -i /tmp/data.colibri.patternmodel -c /tmp/hamlet.colibri.cls -C 5 The ``colibri-cooc`` script provides a shortcut for this. Finding specific patterns ---------------------------- If you have a very specific, and limited, list of patterns you want to find in large corpus data, then you can create a constraint model based on such a pattern list, and apply to your test data. This follows the same paradigm as shown in the section on training and testing. To make a model from a pattern list, create a file with one pattern per line. The patterns can be skipgrams or flexgrams using ``{*}`` and ``{**}`` to mark gaps. As always, it is vital the test and training corpus share the same classfile. You can create one in the test data first and then extend it on the pattern list:: $ colibri-classencode -o test testcorpus.txt $ colibri-classencode -c test.colibri.cls -e -o patternlist patternlist.txt The extended class file will be ``patternlist.colibri.cls``. To build a model on the patternlist, in which each line contains one pattern only, use the ``--patternlist`` flag: $ colibri-patternmodeller --datafile patternlist.colibri.dat --outputmodel patternlist.colibri.patternmodel --patternlist We can now load the patternlist model as a constraint, and build on the test data:: $ colibri-patternmodeller --constrainted --inputmodel patternlist.colibri.patternmodel --datafile test.colibri.cls --classfile test.colibri.dat --outputmodel test.colibri.patternmodel For skipgram support, add ``--skipgrams`` to this last command. If you want to immediately print the output and see how skipgrams are instantiated, add ``--instantiate --print``. The ``colibri-findpatterns`` script provides a shortcut for this procedure. Architecture Overview ======================= .. image:: arch.png Python Tutorial ======================= Colibri Core offers both a C++ API as well as a Python API. It exposes all of the functionality, and beyond, of the tools outlined above. The Python API binds with the C++ code, and although it is more limited than the C++ API, it still offers most higher-level functionality. The Colibri Core binding between C++ and Python is written in Cython. A Python tutorial for Colibri Core is available in the form of an IPython Notebook, meaning that you can interactively run it and play with. You can go to the static, read-only, version `by clicking here `_ Python API Reference ======================= .. automodule:: colibricore :members: :undoc-members: :special-members: :show-inheritance: