Colibri Core Python Tutorial: Efficiently working with n-grams, skipgrams and flexgrams

(for Colibri Core v2.4 - revision 2016-06-10)

https://proycon.github.io/colibri-core

by Maarten van Gompel, Radboud University Nijmegen

This tutorial will show you how to work with Colibri Core's Python API, a tool for Natural Language Processing. It is assumed that you have already read the Colibri Core documentation, followed the installation instructions, and are familiar its purpose and concepts. The documentation also provides an API reference for all the Python classes and method. This tutorial is in the form of a Python Notebook, allowing you to interactively participate. Press shift+enter in code field to evaluate it.

Colibri Core is written in C++ and the Python binding is writting in Cython. This offers the advantage of native-speed and memory efficiency, combined with the ease of a high-level pythonic interface. We will be using Python 3 conventions here, but Colibri Core and this tutorial can also work with Python 2.7.

We obviously start our adventure with an import of colibricore, so make sure you installed it properly:

In [1]:
#These first imports are for Python 2.7 compatibility, we approximate Python 3 as much as possible
from __future__ import print_function
from sys import version
PYTHON2 = version[0] == '2'
if PYTHON2:
    from io import open


import colibricore

TMPDIR = "/tmp/" #this is where we'll store intermediate files

Class encoding/decoding

To give us something to work with, we will take an excerpt of Shakespeare's Hamlet as our corpus text:

In [2]:
corpustext = """To be, or not to be, that is the question
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them? To die, to sleep
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub,
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause. There's the respect
That makes Calamity of so long life:
For who would bear the Whips and Scorns of time,
Th' Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
That patient merit of the unworthy takes,
When he himself might his Quietus make
With a bare Bodkin? Who would these Fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered Country, from whose bourn
No Traveler returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.
Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And enterprises of great pitch and moment,
With this regard their Currents turn awry,
And lose the name of Action. Soft you now,
The fair Ophelia. Nymph, in all thy Orisons
Be all my sins remembered"""

#first we do some very rudimentary tokenisation
# Yes, I realise this is a very stupid way ;)
corpustext = corpustext.replace(',',' ,')
corpustext = corpustext.replace('.',' .')
corpustext = corpustext.replace(':',' :')

if PYTHON2: corpustext = unicode(corpustext,'utf-8')

corpusfile_plaintext = TMPDIR + "hamlet.txt"

with open(corpusfile_plaintext,'w',encoding='utf-8') as f:
    f.write(corpustext)

To work with this data with Colibri Core. We need to class encode it, assigning integer values to each word type. Using Python, a class encoder is built as follows:

In [3]:
classfile = TMPDIR + "hamlet.colibri.cls"

#Instantiate class encoder
classencoder = colibricore.ClassEncoder()

#Build classes
classencoder.build(corpusfile_plaintext)

#Save class file
classencoder.save(classfile)

print("Encoded ", len(classencoder), " classes, well done!")
Encoded  184  classes, well done!

Now we have a class encoder we can encode our corpus, producing a new encoded file (which tends to be about 50% compressed compared to the original):

In [4]:
corpusfile = TMPDIR + "hamlet.colibri.dat" #this will be the encoded corpus file
classencoder.encodefile(corpusfile_plaintext, corpusfile)

To check whether that worked as planned, we will construct a Class Decoder, load our class file, and decode the corpus:

In [5]:
#Load class decoder from the classfile we just made
classdecoder = colibricore.ClassDecoder(classfile)

#Decode corpus data
decoded = classdecoder.decodefile(corpusfile)

#Show
print(decoded)
To be , or not to be , that is the question
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune ,
Or to take Arms against a Sea of troubles ,
And by opposing end them? To die , to sleep
No more; and by a sleep , to say we end
The Heart-ache , and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished . To die , to sleep ,
To sleep , perchance to Dream; Aye , there's the rub ,
For in that sleep of death , what dreams may come ,
When we have shuffled off this mortal coil ,
Must give us pause . There's the respect
That makes Calamity of so long life :
For who would bear the Whips and Scorns of time ,
Th' Oppressor's wrong , the proud man's Contumely ,
The pangs of despised Love , the Law’s delay ,
The insolence of Office , and the Spurns
That patient merit of the unworthy takes ,
When he himself might his Quietus make
With a bare Bodkin? Who would these Fardels bear ,
To grunt and sweat under a weary life ,
But that the dread of something after death ,
The undiscovered Country , from whose bourn
No Traveler returns , Puzzles the will ,
And makes us rather bear those ills we have ,
Than fly to others that we know not of .
Thus Conscience does make Cowards of us all ,
And thus the Native hue of Resolution
Is sicklied o'er , with the pale cast of Thought ,
And enterprises of great pitch and moment ,
With this regard their Currents turn awry ,
And lose the name of Action . Soft you now ,
The fair Ophelia . Nymph , in all thy Orisons

Playing with patterns

Now we have a class encoder and decoder, we can toy around with the most basic units in Colibri Core: patterns. These are using for n-grams, skipgrams, flexgrams and any kind of test. You would basically use an instance of Pattern where you'd normally use a string, as Patterns are much smaller in memory. Let's build a pattern from a string using the classencoder, note that we will only be able to use words that are known by the class encoder:

In [6]:
#Build a pattern from a string, using the class encoder
p = classencoder.buildpattern("To be or not to be")

#To print it we need the decoder
print(p.tostring(classdecoder))
print(len(p))
To be or not to be
6

Iteration over a pattern will produce all the tokens that it is made up of. Note that the concepts of characters is gone from patterns! As a consequence, the ability to lowercase or uppercase text is also lost.

In [7]:
#Iterate over the token in a pattern, each token will be a Pattern instance

for token in p:
    print(token.tostring(classdecoder))
    
To
be
or
not
to
be
In [8]:
#Extracting subpatterns by offset

#Get first token
print(p[0].tostring(classdecoder))

#Get last token
print(p[-1].tostring(classdecoder))

#Get slice
print(p[2:4].tostring(classdecoder))
    
To
be
or not

Given a pattern, we can now very easily extract all n-grams in it, one of the most common NLP tasks:

In [9]:
#let's get all bigrams
for ngram in p.ngrams(2):
    print(ngram.tostring(classdecoder))
To be
be or
or not
not to
to be
In [10]:
#or all n-grams:
for ngram in p.ngrams():
    print(ngram.tostring(classdecoder))
To
be
or
not
to
be
To be
be or
or not
not to
to be
To be or
be or not
or not to
not to be
To be or not
be or not to
or not to be
To be or not to
be or not to be
In [11]:
#or particular ngrams, such as unigrams up to trigrams:
for ngram in p.ngrams(1,3):
    print(ngram.tostring(classdecoder))
To
be
or
not
to
be
To be
be or
or not
not to
to be
To be or
be or not
or not to
not to be

The in operator can be used to check if a token OR ngram is part of a pattern

In [12]:
#token
p2 = classencoder.buildpattern("be")
print(p2 in p)

#ngram
p3 = classencoder.buildpattern("or not")
print(p3 in p)
True
True

The follow snippet is here just to prove that our Pattern representation is usually smaller than a string representation, and offers a sneak peek under the hood:

In [13]:
from sys import version
if version[0] == '3': #This works on Python 3 only
    print(bytes(p), len(bytes(p)))
    print(b"To be or not to be", len(b"To be or not to be"))
    len(bytes(p)) < len(b"To be or not to be")
b'\x0e\x16\x81\x01\x1d\t\x16' 7
b'To be or not to be' 18

Reading a corpus

If we want to read an entire corpus, we can use the IndexedCorpus class. This we can use, for example, if we are merely interested in moving a sliding window over our data and extracting n-grams without counting or storing them:

In [14]:
corpusdata = colibricore.IndexedCorpus(corpusfile) #encoded data, will be loaded into memory entirely

for sentence in corpusdata.sentences(): #will return a Pattern per sentence (generator)
    for trigram in sentence.ngrams(3):
        print(trigram.tostring(classdecoder), end= " | ")
To be , | be , or | , or not | or not to | not to be | to be , | be , that | , that is | that is the | is the question | Whether 'tis Nobler | 'tis Nobler in | Nobler in the | in the mind | the mind to | mind to suffer | The Slings and | Slings and Arrows | and Arrows of | Arrows of outrageous | of outrageous Fortune | outrageous Fortune , | Or to take | to take Arms | take Arms against | Arms against a | against a Sea | a Sea of | Sea of troubles | of troubles , | And by opposing | by opposing end | opposing end them? | end them? To | them? To die | To die , | die , to | , to sleep | No more; and | more; and by | and by a | by a sleep | a sleep , | sleep , to | , to say | to say we | say we end | The Heart-ache , | Heart-ache , and | , and the | and the thousand | the thousand Natural | thousand Natural shocks | That Flesh is | Flesh is heir | is heir to? | heir to? 'Tis | to? 'Tis a | 'Tis a consummation | Devoutly to be | to be wished | be wished . | wished . To | . To die | To die , | die , to | , to sleep | to sleep , | To sleep , | sleep , perchance | , perchance to | perchance to Dream; | to Dream; Aye | Dream; Aye , | Aye , there's | , there's the | there's the rub | the rub , | For in that | in that sleep | that sleep of | sleep of death | of death , | death , what | , what dreams | what dreams may | dreams may come | may come , | When we have | we have shuffled | have shuffled off | shuffled off this | off this mortal | this mortal coil | mortal coil , | Must give us | give us pause | us pause . | pause . There's | . There's the | There's the respect | That makes Calamity | makes Calamity of | Calamity of so | of so long | so long life | long life : | For who would | who would bear | would bear the | bear the Whips | the Whips and | Whips and Scorns | and Scorns of | Scorns of time | of time , | Th' Oppressor's wrong | Oppressor's wrong , | wrong , the | , the proud | the proud man's | proud man's Contumely | man's Contumely , | The pangs of | pangs of despised | of despised Love | despised Love , | Love , the | , the Law’s | the Law’s delay | Law’s delay , | The insolence of | insolence of Office | of Office , | Office , and | , and the | and the Spurns | That patient merit | patient merit of | merit of the | of the unworthy | the unworthy takes | unworthy takes , | When he himself | he himself might | himself might his | might his Quietus | his Quietus make | With a bare | a bare Bodkin? | bare Bodkin? Who | Bodkin? Who would | Who would these | would these Fardels | these Fardels bear | Fardels bear , | To grunt and | grunt and sweat | and sweat under | sweat under a | under a weary | a weary life | weary life , | But that the | that the dread | the dread of | dread of something | of something after | something after death | after death , | The undiscovered Country | undiscovered Country , | Country , from | , from whose | from whose bourn | No Traveler returns | Traveler returns , | returns , Puzzles | , Puzzles the | Puzzles the will | the will , | And makes us | makes us rather | us rather bear | rather bear those | bear those ills | those ills we | ills we have | we have , | Than fly to | fly to others | to others that | others that we | that we know | we know not | know not of | not of . | Thus Conscience does | Conscience does make | does make Cowards | make Cowards of | Cowards of us | of us all | us all , | And thus the | thus the Native | the Native hue | Native hue of | hue of Resolution | Is sicklied o'er | sicklied o'er , | o'er , with | , with the | with the pale | the pale cast | pale cast of | cast of Thought | of Thought , | And enterprises of | enterprises of great | of great pitch | great pitch and | pitch and moment | and moment , | With this regard | this regard their | regard their Currents | their Currents turn | Currents turn awry | turn awry , | And lose the | lose the name | the name of | name of Action | of Action . | Action . Soft | . Soft you | Soft you now | you now , | The fair Ophelia | fair Ophelia . | Ophelia . Nymph | . Nymph , | Nymph , in | , in all | in all thy | all thy Orisons | 

Now you may be very tempted to start storing and counting n-grams this way, but don't. This method is only suitable for iterating and quickly discarding the ngrams. Colibri core has facilities to deal with storing and counting far more efficiently, these are pattern models which we will discuss in the next section.

First some more about IndexedCorpus. We can also obtain any pattern using its index, a (sentence,token) tuple:

In [15]:
unigram = corpusdata[(2,3)]
print(unigram.tostring(classdecoder))
in

A slice syntax is also supported, but may never cross line/sentence boundaries. As is customary in Python, the last index is non-inclusive.

In [16]:
ngram = corpusdata[(2,3):(2,8)]
print(ngram.tostring(classdecoder))
in the mind to suffer

The number of sentences and the length of each sentence can be extracted as follows:

In [17]:
sentencecount = corpusdata.sentencecount()
for i in range(1, sentencecount+1): #note the 1..+1 range, sentences are 1-indexed (whereas tokens are 0-indexed)
    print("Length of sentence " + str(i) + ":", corpusdata.sentencelength(i))
Length of sentence 1: 12
Length of sentence 2: 8
Length of sentence 3: 8
Length of sentence 4: 10
Length of sentence 5: 10
Length of sentence 6: 11
Length of sentence 7: 8
Length of sentence 8: 8
Length of sentence 9: 11
Length of sentence 10: 12
Length of sentence 11: 12
Length of sentence 12: 9
Length of sentence 13: 8
Length of sentence 14: 8
Length of sentence 15: 11
Length of sentence 16: 9
Length of sentence 17: 10
Length of sentence 18: 8
Length of sentence 19: 8
Length of sentence 20: 7
Length of sentence 21: 10
Length of sentence 22: 9
Length of sentence 23: 9
Length of sentence 24: 7
Length of sentence 25: 8
Length of sentence 26: 10
Length of sentence 27: 10
Length of sentence 28: 9
Length of sentence 29: 7
Length of sentence 30: 11
Length of sentence 31: 8
Length of sentence 32: 8
Length of sentence 33: 11
Length of sentence 34: 10

You can also find specific patterns in IndexedCorpus data. However, it is usually more efficient to use a Pattern Model, as discussed in the next section.

In [18]:
searchpattern = classencoder.buildpattern("or not")
for (sentence,token), pattern in corpusdata.findpattern(searchpattern):
    print("Pattern found at: " + str(sentence) + ":" + str(token))
Pattern found at: 1:3

You can pass an extra parameter with a sentence index to findpattern() to limit your search to one particular sentence rather than the entire corpus.

Pattern Models

Now it's time to build our first pattern model on the Hamlet excerpt. We will extract all patterns occurring at least twice and with maximum length 8.

In [19]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Instantiate an empty unindexed model 
model = colibricore.UnindexedPatternModel()

#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)

print("Found " , len(model), " patterns:")

#Let's see what patterns are in our model (the order will be 'random')
for pattern in model:
    print(pattern.tostring(classdecoder), end=" | ")
Found  54  patterns:
To die , to sleep | die , to sleep | To die , to | , and the | , to sleep | die , to | To die , | , the | we have | and the | , and | sleep , | to sleep | die , | To die | to be | be , | all | With | make | we | That | not | No | die | And | a | , | be | is | that | would | in | to | and | end | For | To | The | have | death | us | the | When | this | makes | death , | , to | by | life | sleep | bear | of | . | 

Rather than just output the patterns, we of course now have the counts as well, let's output it:

In [20]:
#Models behave much alike to Python dictionaries:
for pattern, count in model.items():
    print(pattern.tostring(classdecoder), count)
To die , to sleep 2
die , to sleep 2
To die , to 2
, and the 2
, to sleep 2
die , to 2
To die , 2
, the 2
we have 2
and the 2
, and 2
sleep , 3
to sleep 2
die , 2
To die 2
to be 2
be , 2
all 2
With 2
make 2
we 4
That 3
not 2
No 2
die 2
And 5
a 5
, 36
be 3
is 2
that 4
would 2
in 3
to 9
and 7
end 2
For 2
To 5
The 6
have 2
death 2
us 3
the 15
When 2
this 2
makes 2
death , 2
, to 3
by 2
life 2
sleep 5
bear 3
of 15
. 5

We can also query specific patterns:

In [21]:
querypattern = classencoder.buildpattern("sleep")

print("How much sleep?")
print(model[querypattern])
How much sleep?
5
In [22]:
#Like dictionaries, unknown patterns will trigger a KeyError
querypattern = classencoder.buildpattern("insolence")

print("How much insolence?")
try:
    print(model[querypattern])
except KeyError:
    print("Nope, KeyError, no such pattern in model..")
How much insolence?
Nope, KeyError, no such pattern in model..

We can check whether a pattern is in a model in the usual pythonic fashion:

In [23]:
if querypattern in model:
    print("Insolence in model!")
else:
    print("No insolence in model!")
No insolence in model!

Rather than the absolute counts, we can get the frequency of a pattern within its type and class. For example the frequency of a bigram amongst all bigrams:

In [24]:
querypattern = classencoder.buildpattern("and the")

print(model.frequency(querypattern))
0.07692307692307693

To analyse the distribution of occurrences, we can extract a histogram from our model as follows:

In [25]:
for occurrencecount, frequency in model.histogram():
    print(occurrencecount , " occurrences by ", frequency , "patterns")
    
2  occurrences by  34 patterns
3  occurrences by  7 patterns
4  occurrences by  2 patterns
5  occurrences by  5 patterns
6  occurrences by  1 patterns
7  occurrences by  1 patterns
9  occurrences by  1 patterns
15  occurrences by  2 patterns
36  occurrences by  1 patterns

Once we have a model, we can save it to file, to reload later, loading is much faster than training:

In [26]:
patternmodelfile = TMPDIR + "hamlet.colibri.patternmodel"

model.write(patternmodelfile)

#and reload just to show we can:
model = colibricore.UnindexedPatternModel(patternmodelfile, options)

Unindexed models are much smaller in memory than indexed models, but their functionality is also limited. Let's take a look at indexed models. Indexed models keep a forward index to all locations in the original corpus where patterns occur. The references are 2-tuples in the form (sentence,token), where sentence is 1-indexed and token is 0-indexed.

In [27]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Instantiate an empty indexed model 
model = colibricore.IndexedPatternModel()

#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)

print("Found " , len(model), " patterns:")

#Let's see what patterns are in our model (the order will be 'random')
for pattern, indices in model.items():
    print(pattern.tostring(classdecoder),end=" ")
    for index in indices:
        print(index,end=" ") #(sentence,token) tuple, sentences start with 1, tokens with 0
    print()
        
Found  54  patterns:
To die , to sleep (5, 5) (9, 5) 
die , to sleep (5, 6) (9, 6) 
To die , to (5, 5) (9, 5) 
, and the (7, 2) (18, 4) 
, to sleep (5, 7) (9, 7) 
die , to (5, 6) (9, 6) 
To die , (5, 5) (9, 5) 
, the (16, 3) (17, 5) 
we have (12, 1) (26, 7) 
and the (7, 3) (18, 5) 
, and (7, 2) (18, 4) 
sleep , (6, 5) (9, 9) (10, 1) 
to sleep (5, 8) (9, 8) 
die , (5, 6) (9, 6) 
To die (5, 5) (9, 5) 
to be (1, 5) (9, 1) 
be , (1, 1) (1, 6) 
all (28, 7) (34, 7) 
With (21, 0) (32, 0) 
make (20, 6) (28, 3) 
we (6, 9) (12, 1) (26, 7) (27, 5) 
That (8, 0) (14, 0) (19, 0) 
not (1, 4) (27, 7) 
No (6, 0) (25, 0) 
die (5, 6) (9, 6) 
And (5, 0) (26, 0) (29, 0) (31, 0) (33, 0) 
a (4, 5) (6, 4) (8, 6) (21, 1) (22, 5) 
, (1, 2) (1, 7) (3, 7) (4, 9) (5, 7) (6, 6) (7, 2) (9, 7) (9, 10) (10, 2) (10, 7) (10, 11) (11, 6) (11, 11) (12, 8) (15, 10) (16, 3) (16, 8) (17, 5) (17, 9) (18, 4) (19, 7) (21, 9) (22, 8) (23, 8) (24, 3) (25, 3) (25, 7) (26, 9) (28, 8) (30, 3) (30, 10) (31, 7) (32, 7) (33, 10) (34, 5) 
be (1, 1) (1, 6) (9, 2) 
is (1, 9) (8, 2) 
that (1, 8) (11, 2) (23, 1) (27, 4) 
would (15, 2) (21, 5) 
in (2, 3) (11, 1) (34, 6) 
to (1, 5) (2, 6) (4, 1) (5, 8) (6, 7) (9, 1) (9, 8) (10, 4) (27, 2) 
and (3, 2) (6, 2) (7, 3) (15, 6) (18, 5) (22, 2) (31, 5) 
end (5, 3) (6, 10) 
For (11, 0) (15, 0) 
To (1, 0) (5, 5) (9, 5) (10, 0) (22, 0) 
The (3, 0) (7, 0) (17, 0) (18, 0) (24, 0) (34, 0) 
have (12, 2) (26, 8) 
death (11, 5) (23, 7) 
us (13, 2) (26, 2) (28, 6) 
the (1, 10) (2, 4) (7, 4) (10, 9) (13, 6) (15, 4) (16, 4) (17, 6) (18, 6) (19, 4) (23, 2) (25, 5) (29, 2) (30, 5) (33, 2) 
When (12, 0) (20, 0) 
this (12, 5) (32, 1) 
makes (14, 1) (26, 1) 
death , (11, 5) (23, 7) 
, to (5, 7) (6, 6) (9, 7) 
by (5, 1) (6, 3) 
life (14, 6) (22, 7) 
sleep (5, 9) (6, 5) (9, 9) (10, 1) (11, 3) 
bear (15, 3) (21, 8) (26, 4) 
of (3, 4) (4, 7) (11, 4) (14, 3) (15, 8) (17, 2) (18, 2) (19, 3) (23, 4) (27, 8) (28, 5) (29, 5) (30, 8) (31, 2) (33, 4) 
. (9, 4) (13, 4) (27, 9) (33, 6) (34, 3) 

One interesting feature we can get from indexed models, is coverage information. This shows how many of the tokens in the original corpus data are covered by a particular pattern.

In [28]:
querypattern = classencoder.buildpattern("and the")

print(model.coverage(querypattern))
0.012698412698412698

Some numbers on the original corpus data can be obtained from the model:

In [29]:
print("Total amount of tokens in the corpus data:" , model.tokens() )
print("Total amount of word types in the corpus data:" , model.types() )
Total amount of tokens in the corpus data: 315
Total amount of word types in the corpus data: 180

Whilst we have a forward index, we can also include a reverse index in our pattern model, which then allows us to look what patterns begin at a particular location. To build a reverse model, we explicitly need to instantiate an IndexedCorpus and pass it to the IndexedPatternModel constructor:

In [30]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Load the corpus
corpus = colibricore.IndexedCorpus(corpusfile)

#Instantiate an empty unindexed model 
model = colibricore.IndexedPatternModel(reverseindex=corpus)

#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)

print("Found " , len(model), " patterns:")
Found  54  patterns:

Now we have a model with a reverse index, we can compute what patterns from our model begin at a certain index, expressed as a (sentence,token) tuple:

In [31]:
print("Patterns at (1,5): ")
for pattern in model.getreverseindex( (1,5) ):
    print(pattern.tostring(classdecoder))
Patterns at (1,5): 
to be
to

You can also use this to easily get all patterns in a sentence that are in the model:

In [32]:
print("Patterns in first sentence")
for (sentence, token), pattern in model.getreverseindex_bysentence(1):
    print(sentence,token, " -- ", pattern.tostring(classdecoder))
Patterns in first sentence
1 0  --  To
1 1  --  be ,
1 1  --  be
1 2  --  ,
1 4  --  not
1 5  --  to be
1 5  --  to
1 6  --  be ,
1 6  --  be
1 7  --  ,
1 8  --  that
1 9  --  is
1 10  --  the

It is easy to iterate over all indices in the reverse index:

In [33]:
for ref in model.reverseindex():
    print(ref, end=" ") #ref is a (sentence,token) tuple
(1, 0) (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (1, 9) (1, 10) (1, 11) (2, 0) (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6) (2, 7) (3, 0) (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6) (3, 7) (4, 0) (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6) (4, 7) (4, 8) (4, 9) (5, 0) (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6) (5, 7) (5, 8) (5, 9) (6, 0) (6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6) (6, 7) (6, 8) (6, 9) (6, 10) (7, 0) (7, 1) (7, 2) (7, 3) (7, 4) (7, 5) (7, 6) (7, 7) (8, 0) (8, 1) (8, 2) (8, 3) (8, 4) (8, 5) (8, 6) (8, 7) (9, 0) (9, 1) (9, 2) (9, 3) (9, 4) (9, 5) (9, 6) (9, 7) (9, 8) (9, 9) (9, 10) (10, 0) (10, 1) (10, 2) (10, 3) (10, 4) (10, 5) (10, 6) (10, 7) (10, 8) (10, 9) (10, 10) (10, 11) (11, 0) (11, 1) (11, 2) (11, 3) (11, 4) (11, 5) (11, 6) (11, 7) (11, 8) (11, 9) (11, 10) (11, 11) (12, 0) (12, 1) (12, 2) (12, 3) (12, 4) (12, 5) (12, 6) (12, 7) (12, 8) (13, 0) (13, 1) (13, 2) (13, 3) (13, 4) (13, 5) (13, 6) (13, 7) (14, 0) (14, 1) (14, 2) (14, 3) (14, 4) (14, 5) (14, 6) (14, 7) (15, 0) (15, 1) (15, 2) (15, 3) (15, 4) (15, 5) (15, 6) (15, 7) (15, 8) (15, 9) (15, 10) (16, 0) (16, 1) (16, 2) (16, 3) (16, 4) (16, 5) (16, 6) (16, 7) (16, 8) (17, 0) (17, 1) (17, 2) (17, 3) (17, 4) (17, 5) (17, 6) (17, 7) (17, 8) (17, 9) (18, 0) (18, 1) (18, 2) (18, 3) (18, 4) (18, 5) (18, 6) (18, 7) (19, 0) (19, 1) (19, 2) (19, 3) (19, 4) (19, 5) (19, 6) (19, 7) (20, 0) (20, 1) (20, 2) (20, 3) (20, 4) (20, 5) (20, 6) (21, 0) (21, 1) (21, 2) (21, 3) (21, 4) (21, 5) (21, 6) (21, 7) (21, 8) (21, 9) (22, 0) (22, 1) (22, 2) (22, 3) (22, 4) (22, 5) (22, 6) (22, 7) (22, 8) (23, 0) (23, 1) (23, 2) (23, 3) (23, 4) (23, 5) (23, 6) (23, 7) (23, 8) (24, 0) (24, 1) (24, 2) (24, 3) (24, 4) (24, 5) (24, 6) (25, 0) (25, 1) (25, 2) (25, 3) (25, 4) (25, 5) (25, 6) (25, 7) (26, 0) (26, 1) (26, 2) (26, 3) (26, 4) (26, 5) (26, 6) (26, 7) (26, 8) (26, 9) (27, 0) (27, 1) (27, 2) (27, 3) (27, 4) (27, 5) (27, 6) (27, 7) (27, 8) (27, 9) (28, 0) (28, 1) (28, 2) (28, 3) (28, 4) (28, 5) (28, 6) (28, 7) (28, 8) (29, 0) (29, 1) (29, 2) (29, 3) (29, 4) (29, 5) (29, 6) (30, 0) (30, 1) (30, 2) (30, 3) (30, 4) (30, 5) (30, 6) (30, 7) (30, 8) (30, 9) (30, 10) (31, 0) (31, 1) (31, 2) (31, 3) (31, 4) (31, 5) (31, 6) (31, 7) (32, 0) (32, 1) (32, 2) (32, 3) (32, 4) (32, 5) (32, 6) (32, 7) (33, 0) (33, 1) (33, 2) (33, 3) (33, 4) (33, 5) (33, 6) (33, 7) (33, 8) (33, 9) (33, 10) (34, 0) (34, 1) (34, 2) (34, 3) (34, 4) (34, 5) (34, 6) (34, 7) (34, 8) (34, 9) 

Alternatively, use model.items() to get the pattern as well, it will return (ref, pattern) tuples.

The reverse index, as returned by the reverseindex() method, is just the same instance of IndexedCorpus, which we passed to the constructor earlier.

Skipgrams and flexgrams and relations between patterns

Skipgrams are n-grams with one or more gaps of a particular size. Flexgrams have a gap of dynamic size. Colibri Core can deal with both. Let's start with a new, and somewhat bigger, corpus. As the data in our previous example was too sparse to find any skipgrams. To that end, we will download Plato's Republic, this version is already tokenised and has one sentence per line, just as Colibri Core likes it:

In [34]:
corpusfile_plato_plaintext = TMPDIR + "republic.txt"

if PYTHON2:
    from urllib import urlopen
else:
    from urllib.request import urlopen
f = urlopen('http://lst.science.ru.nl/~proycon/republic.txt')
with open(corpusfile_plato_plaintext,'wb') as of:
    of.write(f.read())
print("Downloaded to " + corpusfile_plato_plaintext)
Downloaded to /tmp/republic.txt

Now we create a class file and class encode the corpus, but because we may later on want to compare Shakespeare's Hamlet with Plato's Republic, we ensure that we use the same vocabulary. Note that it would have been better (more optimal classes, better compression) if we had built the original class encoder on both files right away, but you don't always have the luxury of foresight.

In [35]:
classfile_plato = TMPDIR + "republic.colibri.cls"
corpusfile_plato  = TMPDIR + "republic.colibri.dat"

#Build classes, re-using our classencoder from Hamlet! Let's reload it just for completion's sake
classencoder = colibricore.ClassEncoder(TMPDIR + "hamlet.colibri.cls")

#Now we will extend it by buildiing classes on Plato's data. If we had done this earlier, 
# we could have passed a list of filenames, ensuring more optimal encoding.
classencoder.build(corpusfile_plato_plaintext)

#Save new class file, this will be a superset of the original one.
classencoder.save(classfile_plato)

#Encode the corpus
classencoder.encodefile(corpusfile_plato_plaintext, corpusfile_plato)

#Load decoder because the old one will only handle Hamlet
classdecoder = colibricore.ClassDecoder(classfile_plato)

print("Done")
Done

Now we have a proper class file and encoded corpus, we can build an indexed pattern model with skipgrams. Skipgrams can only be build most efficiently using indexed models.

In [36]:
#Set the options, doskipgrams=True is the key to enabling skipgrams
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8, doskipgrams=True)

#Instantiate an empty indexed model 
corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
model = colibricore.IndexedPatternModel(reverseindex=corpus_plato)

#Train it on our corpus file (class-encoded data, not plain text)
print("Training")
model.train(corpusfile_plato, options)

print("Found " , len(model), " patterns:")
Training
Found  84957  patterns:

Now how many of those patterns are skipgrams? We can find out ourselves by iterating over the patterns and checking their category.

In [37]:
skipgrams = 0
for pattern in model:
    if pattern.category() == colibricore.Category.SKIPGRAM:
        skipgrams += 1
print("Found",skipgrams," skipgrams")
        
    
Found 8568  skipgrams

However, it is much faster to do this using the built-in filter() method, which can also be used to filter patterns above a certain occurrence threshold, we can constrain it to a specific type such as skipgrams, and to a specific length (third argument, not used here):

In [38]:
skipgrams = 0
for pattern, occurrencecount in model.filter(0,colibricore.Category.SKIPGRAM): #the first parameter is the occurrence threshold
    skipgrams += 1
print("Found",skipgrams," skipgrams")
Found 8568  skipgrams

Similar to filter() is the top() method, which we can use to extract the top patterns, let's get the top 20 of skipgrams. We will still need to relay it through a sorting function to get it in descending order:

In [39]:
for pattern, occurrencecount in sorted( model.top(20,colibricore.Category.SKIPGRAM), key=lambda x:x[1]*-1 ):
    print(pattern.tostring(classdecoder), " -- ", occurrencecount)
    
    
the {*} of  --  3392
, {*} the  --  1058
of {*} ,  --  947
the {*} ,  --  918
, {*} said  --  857
the {*} {*} the  --  805
, {*} {*} ,  --  777
the {*} of the  --  672
, {*} ,  --  562
, {*} is  --  536
, {*} {*} the  --  496
, {*} said ,  --  492
the {*} .  --  454
I {*} ,  --  445
of {*} .  --  395
the {*} and  --  391
he {*} .  --  389
of {*} and  --  388
, {*} he  --  386
of {*} {*} ,  --  379
, {*} {*} .  --  372

Each occurrence of {*} expresses a gap of exactly one word/token. We can create skipgrams from scratch using the same syntax with the classencoder, you can also use {*2*} for a gap covering two words etc..:

In [40]:
skipgram = classencoder.buildpattern("To {*} or not to {*} is the question")

The consecutive non-gap parts of a skipgram can be obtained using the parts() method. The skipgram above consists of three parts:

In [41]:
for part in skipgram.parts():
    print(part.tostring(classdecoder))
To
or not to
is the question

Because an indexed model stores all the locations at which a pattern occurs, and a reverse index allows us to fill missing gaps, we can easily obtain all n-grams of which the skipgram is an abstraction:

In [42]:
#let's pick a common skipgram from the data:
skipgram = classencoder.buildpattern("to the {*} of")

for ngram, occurrences in model.getinstances(skipgram):
    print(ngram.tostring(classdecoder), " -- occurring ", occurrences, " times" )
        
to the words of  -- occurring  2  times
to the Republic of  -- occurring  2  times
to the rest of  -- occurring  4  times
to the level of  -- occurring  2  times
to the plain of  -- occurring  2  times
to the sight of  -- occurring  3  times
to the devastation of  -- occurring  2  times
to the terms of  -- occurring  2  times
to the happiness of  -- occurring  2  times
to the authority of  -- occurring  3  times
to the tale of  -- occurring  2  times
to the eye of  -- occurring  2  times
to the voice of  -- occurring  2  times
to the importance of  -- occurring  2  times
to the question of  -- occurring  2  times
to the end of  -- occurring  3  times
to the mind of  -- occurring  2  times
to the idea of  -- occurring  8  times
to the law of  -- occurring  2  times
to the conditions of  -- occurring  3  times
to the class of  -- occurring  4  times
to the contemplation of  -- occurring  5  times
to the injury of  -- occurring  3  times
to the wants of  -- occurring  2  times
to the relation of  -- occurring  3  times
to the good of  -- occurring  5  times
to the vision of  -- occurring  2  times
to the sum of  -- occurring  4  times
to the knowledge of  -- occurring  3  times
to the number of  -- occurring  2  times
to the nature of  -- occurring  2  times

The reverse is also possible, given an ngram we can find what skipgrams are abstractions, or templates of it:

In [43]:
#let's pick something that should be covered by a skipgram from the data:
ngram = classencoder.buildpattern("to the question of")

for skipgram, occurrences in model.gettemplates(ngram):
    print(skipgram.tostring(classdecoder), " -- occurring ", occurrences, " times" )    
to {*} {*} of  -- occurring  2  times
to the {*} of  -- occurring  2  times

Another trait of indexed pattern models is the ability to extract co-occurrence information using the getcooc() method. Let's see with what patterns the ngram "the law of" co-occurs more than five times (the second argument specifies this threshold, using it is always more efficient than doing a check on the variable occurrences that is returned):

In [44]:
ngram = classencoder.buildpattern("the law")

for coocngram, occurrences in sorted( model.getcooc(ngram,5), key=lambda x: x[1] *-1): #let's sort the output too
    print(coocngram.tostring(classdecoder), " -- occurring ", occurrences, " times")
        
        
,  -- occurring  99  times
the  -- occurring  97  times
and  -- occurring  75  times
of  -- occurring  72  times
to  -- occurring  45  times
in  -- occurring  34  times
is  -- occurring  34  times
.  -- occurring  32  times
a  -- occurring  30  times
the {*} of  -- occurring  28  times
;  -- occurring  24  times
, and  -- occurring  24  times
which  -- occurring  21  times
or  -- occurring  19  times
they  -- occurring  19  times
their  -- occurring  18  times
not  -- occurring  18  times
by  -- occurring  16  times
that  -- occurring  15  times
be  -- occurring  15  times
the {*} {*} the  -- occurring  14  times
of the  -- occurring  14  times
have  -- occurring  14  times
them  -- occurring  13  times
he  -- occurring  12  times
are  -- occurring  12  times
the {*} ,  -- occurring  12  times
this  -- occurring  11  times
, {*} {*} {*} ,  -- occurring  10  times
the {*} {*} {*} {*} {*} {*} the  -- occurring  10  times
, {*} the  -- occurring  9  times
we  -- occurring  9  times
; and  -- occurring  9  times
as  -- occurring  9  times
, the  -- occurring  9  times
man  -- occurring  8  times
in the  -- occurring  8  times
and {*} ,  -- occurring  8  times
and {*} {*} {*} the  -- occurring  8  times
the {*} of the  -- occurring  8  times
the {*} {*} {*} {*} ,  -- occurring  8  times
but  -- occurring  8  times
, {*} ,  -- occurring  8  times
his  -- occurring  8  times
of {*} {*} {*} {*} {*} the  -- occurring  8  times
when  -- occurring  8  times
these  -- occurring  8  times
for  -- occurring  7  times
is {*} {*} {*} the  -- occurring  7  times
to be  -- occurring  7  times
, {*} {*} {*} of  -- occurring  7  times
, {*} {*} {*} the  -- occurring  7  times
of a  -- occurring  7  times
, {*} {*} {*} {*} and  -- occurring  7  times
is not  -- occurring  7  times
And  -- occurring  7  times
, {*} {*} {*} to  -- occurring  7  times
I  -- occurring  7  times
the {*} {*} {*} and  -- occurring  6  times
the {*} {*} {*} {*} and  -- occurring  6  times
him  -- occurring  6  times
to the  -- occurring  6  times
--  -- occurring  6  times
, {*} {*} {*} {*} {*} the  -- occurring  6  times
with  -- occurring  6  times
of {*} {*} {*} {*} ,  -- occurring  6  times
of {*} and  -- occurring  6  times
the {*} {*} {*} {*} {*} {*} ,  -- occurring  6  times
, or  -- occurring  6  times
and {*} {*} {*} and  -- occurring  6  times
there  -- occurring  6  times
of {*} {*} the  -- occurring  6  times
of {*} ,  -- occurring  6  times
in {*} {*} ,  -- occurring  6  times
of these  -- occurring  6  times
of {*} {*} ,  -- occurring  6  times
by the  -- occurring  6  times
her  -- occurring  5  times
, {*} {*} to  -- occurring  5  times
father  -- occurring  5  times
who  -- occurring  5  times
in which  -- occurring  5  times
own  -- occurring  5  times
the {*} {*} of  -- occurring  5  times
what  -- occurring  5  times
all  -- occurring  5  times
they have  -- occurring  5  times
the {*} {*} {*} {*} {*} the  -- occurring  5  times
, {*} {*} {*} {*} {*} ,  -- occurring  5  times
, {*} {*} of  -- occurring  5  times
, {*} {*} and  -- occurring  5  times
no  -- occurring  5  times
of {*} {*} {*} ,  -- occurring  5  times
state  -- occurring  5  times
has  -- occurring  5  times
the {*} {*} {*} ,  -- occurring  5  times
one  -- occurring  5  times
, {*} {*} the  -- occurring  5  times
State  -- occurring  5  times
'  -- occurring  5  times
, {*} {*} {*} and  -- occurring  5  times
of {*} {*} {*} {*} and  -- occurring  5  times
is the  -- occurring  5  times
the {*} {*} {*} {*} {*} ,  -- occurring  5  times

There are also specific methods for extracting co-occurrences left or right of the pattern: getleftcooc() and getrightcooc(). Other relationships can be extracted in an identical fashion:

  • getleftneighbours(pattern,threshold=0,category=0,size=0) -- returns the neighbours to the immediate left of a pattern (threshold, category and size are constraints which are set to 0 by default)
  • getrightneighbours(pattern,threshold=0,category=0,size=0)-- returns the neighbours to the immediate right of a pattern
  • getsubchildren(pattern,threshold=0,category=0,size=0)-- returns patterns that are a subpart (subsumed by) the specified
  • getsubparents(pattern,threshold=0,category=0,size=0)-- the reverse of the above, returns patterns which subsume the specified patterns

In addition to skipgrams, Colibri Core core also supports flexgrams. Whereas the gaps in skipgrams are of a predefined size, in flexgrams they are by definition variable. A gap in a flexgram is represented as {**}. All of the existing functions that work on skipgrams, including the methods to extract relationsships, should also work on flexgrams.

Flexgrams can be computed in two ways, but only on indexed pattern models:

  • by abstracting from the skipgrams in the model using the computeflexgrams_fromskipgrams() method.
  • by co-occurrence based on normalised mutual pointwise information. In this case a flexgram will have only one gap. Use the computeflexgrams_fromcooc(threshold) method for this.

You have to explicitly choose one of these methods. An example of the first strategy:

In [45]:
#Set the options, doskipgrams=True is the key to enabling skipgrams
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8, doskipgrams=True)

#Instantiate an empty indexed model 
corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
flexmodel = colibricore.IndexedPatternModel(reverseindex=corpus_plato)

#Train it on our corpus file (class-encoded data, not plain text)
flexmodel.train(corpusfile_plato, options)

#compute the flexgrams
found = flexmodel.computeflexgrams_fromskipgrams()

print("Found " , str(found), " flexgrams")
Found  6880  flexgrams

Comparing pattern models

Pattern Models can be used in a train/test paradigm. You can create a Pattern Model on the training corpus and then generated a Pattern Model on the test corpus constrained by the training model. This allows you to test what patterns from the training corpus also occur in the test corpus, and how often. Statistics on these two differing counts can provide insight into how much corpora differ.

We already saw the coverage metric previously, when applied to a train/test scenario it measures the number or ratio of tokens in the test corpus covered by patterns found during training. Let's perform such a comparison.

We made a Pattern Model on Plato's Republic and we have a small excerpt from Hamlet. Let's use the former as training and the letter as test.

When doing any kind of comparison, it is absolutely crucial that you make sure the training and test data are class encoded with the same classes. The best method for this is to build the class files for all data in advance. In the previous class encoding example we saw classencoder.build() which does nothing more than provide us with a shortcut to call classencoder.processcorpus() followed by classencoder.buildclasses(). To process multiple corpora, we do this ourselves:

In [46]:
classfile2 = TMPDIR + "platoandhamlet.colibri.cls"

#Instantiate class encoder
classencoder2 = colibricore.ClassEncoder()

#Build classes
classencoder2.processcorpus(corpusfile_plato_plaintext)
classencoder2.processcorpus(corpusfile_plaintext)
classencoder2.buildclasses()

#Save class file
classencoder2.save(classfile2)

print("Encoded ", len(classencoder2), " classes, well done!")
Encoded  11540  classes, well done!

It is important to realise that the Class Encoder we just built (classencoder2) is now not compatible with the earlier class encoder used for previous examples!

Often, however, you do not have all data available in advance. You may add a different test set later on, long after training. The way to make sure you have a proper class encoding is to extend your original class encoding. Rather than using the class encoder we just build, let us opt for that method, as this will keep all the classes we already had for the training data (Plato's Republic). This we do by calling the encodefile() method with two extra arguments set to True, indicating respectively that unknown words are allowed, and that unknown words are automatically added the the class encoding. If the second boolean is set to False, all unknown words would be encoded by one single class reserved for unknown words.

In [47]:
print("Class encoder has ", len(classencoder), " classes prior to extension")

testcorpusfile = TMPDIR + "hamlet_test.colibri.dat" #this will be the encoded test corpus file
classencoder.encodefile(corpusfile_plaintext, testcorpusfile, True, True)

classfile_test = TMPDIR + "platoplushamlet.colibri.cls"
classencoder.save(classfile_test)

print("Class encoder has ", len(classencoder), " classes after extension")
Class encoder has  11544  classes prior to extension
Class encoder has  11544  classes after extension

Do note that this method of encoding is not optimal, only encoding everything in one go ensures the smallest possible memory footprint.

We already created a pattern model on the training data in one of our earlier steps (called model), to create our test model we train a constrained model on the test set, this model is constrained by the training model we made earlier. This will result in a new pattern model. The nomenclature may be a bit confusing at first. We simply do all this by instantiating a new model and calling the train() method and passing the contraining model as the last argument.

In [48]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Instantiate an empty indexed model 
testmodel = colibricore.IndexedPatternModel()

#Train it on our test corpus file (class-encoded data, not plain text)
testmodel.train(testcorpusfile, options, model)

Now we have a test model (effectively the intersection an unconstrained model of the test corpus and the training model). We can see what patterns from the training corpus occur in the test corpus:

In [49]:
for pattern in testmodel:
    print(pattern.tostring(classdecoder))
When
death ,
, to
by
die
No
death
and the
of
To
end
sleep ,
And
The
For
be
in
,
a
have
sleep
the
bear
not
That
is
, and the
to sleep
, the
.
and
to be
die ,
we have
us
makes
be ,
life
this
would
, and
to
all
we
With
that
make

We can inspect the differences between the counts:

In [50]:
for pattern in testmodel:
    print(pattern.tostring(classdecoder), " ---  in training: ", model.occurrencecount(pattern), ", in test: ", testmodel.occurrencecount(pattern)   )
When  ---  in training:  78 , in test:  2
death ,  ---  in training:  14 , in test:  2
, to  ---  in training:  131 , in test:  3
by  ---  in training:  1324 , in test:  2
die  ---  in training:  13 , in test:  2
No  ---  in training:  85 , in test:  2
death  ---  in training:  60 , in test:  2
and the  ---  in training:  736 , in test:  2
of  ---  in training:  10374 , in test:  15
To  ---  in training:  88 , in test:  5
end  ---  in training:  103 , in test:  2
sleep ,  ---  in training:  3 , in test:  3
And  ---  in training:  1071 , in test:  5
The  ---  in training:  797 , in test:  6
For  ---  in training:  169 , in test:  2
be  ---  in training:  2930 , in test:  3
in  ---  in training:  4319 , in test:  3
,  ---  in training:  15352 , in test:  36
a  ---  in training:  3924 , in test:  5
have  ---  in training:  1485 , in test:  2
sleep  ---  in training:  13 , in test:  5
the  ---  in training:  14783 , in test:  15
bear  ---  in training:  30 , in test:  3
not  ---  in training:  2267 , in test:  2
That  ---  in training:  197 , in test:  3
is  ---  in training:  4619 , in test:  2
, and the  ---  in training:  310 , in test:  2
to sleep  ---  in training:  4 , in test:  2
, the  ---  in training:  551 , in test:  2
.  ---  in training:  7014 , in test:  5
and  ---  in training:  8517 , in test:  7
to be  ---  in training:  940 , in test:  2
die ,  ---  in training:  3 , in test:  2
we have  ---  in training:  138 , in test:  2
us  ---  in training:  436 , in test:  3
makes  ---  in training:  76 , in test:  2
be ,  ---  in training:  23 , in test:  2
life  ---  in training:  395 , in test:  2
this  ---  in training:  936 , in test:  2
would  ---  in training:  590 , in test:  2
, and  ---  in training:  2899 , in test:  2
to  ---  in training:  5917 , in test:  9
all  ---  in training:  867 , in test:  2
we  ---  in training:  1314 , in test:  4
With  ---  in training:  9 , in test:  2
that  ---  in training:  2830 , in test:  4
make  ---  in training:  224 , in test:  2

This isn't so informative unless we apply some normalisation, so let's get the coverage instead:

In [51]:
for pattern in testmodel:
    print(pattern.tostring(classdecoder), " ---  in training: ", model.coverage(pattern), ", in test: ", testmodel.coverage(pattern)   )
When  ---  in training:  0.0003101058733257265 , in test:  7.94148712287863e-06
death ,  ---  in training:  0.00011132005709128642 , in test:  1.588297424575726e-05
, to  ---  in training:  0.0010416376770684657 , in test:  2.382446136863589e-05
by  ---  in training:  0.005263848413887972 , in test:  7.94148712287863e-06
die  ---  in training:  5.168431222095441e-05 , in test:  7.94148712287863e-06
No  ---  in training:  0.0003379358875985481 , in test:  7.94148712287863e-06
death  ---  in training:  0.00023854297948132806 , in test:  7.94148712287863e-06
and the  ---  in training:  0.005852254429941915 , in test:  1.588297424575726e-05
of  ---  in training:  0.04124408115232162 , in test:  5.9561153421589724e-05
To  ---  in training:  0.00034986303657261447 , in test:  1.9853717807196577e-05
end  ---  in training:  0.0004094987814429465 , in test:  7.94148712287863e-06
sleep ,  ---  in training:  2.3854297948132806e-05 , in test:  2.382446136863589e-05
And  ---  in training:  0.004257992183741705 , in test:  1.9853717807196577e-05
The  ---  in training:  0.003168645910776974 , in test:  2.382446136863589e-05
For  ---  in training:  0.0006718960588724073 , in test:  7.94148712287863e-06
be  ---  in training:  0.011648848831338186 , in test:  1.1912230684317945e-05
in  ---  in training:  0.01717111880633093 , in test:  1.1912230684317945e-05
,  ---  in training:  0.06103519701662247 , in test:  0.00014294676821181534
a  ---  in training:  0.015600710858078855 , in test:  1.9853717807196577e-05
have  ---  in training:  0.005903938742162869 , in test:  7.94148712287863e-06
sleep  ---  in training:  5.168431222095441e-05 , in test:  1.9853717807196577e-05
the  ---  in training:  0.05877301442787454 , in test:  5.9561153421589724e-05
bear  ---  in training:  0.00011927148974066403 , in test:  1.1912230684317945e-05
not  ---  in training:  0.009012948908069512 , in test:  7.94148712287863e-06
That  ---  in training:  0.0007832161159636937 , in test:  1.1912230684317945e-05
is  ---  in training:  0.01836383370373757 , in test:  7.94148712287863e-06
, and the  ---  in training:  0.003697416181960585 , in test:  2.382446136863589e-05
to sleep  ---  in training:  3.1805730597510405e-05 , in test:  1.588297424575726e-05
, the  ---  in training:  0.0043812393898070585 , in test:  1.588297424575726e-05
.  ---  in training:  0.02788567430136725 , in test:  1.9853717807196577e-05
and  ---  in training:  0.03386117593737452 , in test:  2.7795204930075206e-05
to be  ---  in training:  0.0074743466904149455 , in test:  1.588297424575726e-05
die ,  ---  in training:  2.3854297948132806e-05 , in test:  1.588297424575726e-05
we have  ---  in training:  0.001097297705614109 , in test:  1.588297424575726e-05
us  ---  in training:  0.0017334123175643172 , in test:  1.1912230684317945e-05
makes  ---  in training:  0.0003021544406763489 , in test:  7.94148712287863e-06
be ,  ---  in training:  0.00018288295093568484 , in test:  1.588297424575726e-05
life  ---  in training:  0.0015704079482520763 , in test:  7.94148712287863e-06
this  ---  in training:  0.0037212704799087174 , in test:  7.94148712287863e-06
would  ---  in training:  0.0023456726315663925 , in test:  7.94148712287863e-06
, and  ---  in training:  0.023051203250545667 , in test:  1.588297424575726e-05
to  ---  in training:  0.023524313493183634 , in test:  3.5736692052953834e-05
all  ---  in training:  0.0034469460535051905 , in test:  7.94148712287863e-06
we  ---  in training:  0.005224091250641084 , in test:  1.588297424575726e-05
With  ---  in training:  3.578144692219921e-05 , in test:  7.94148712287863e-06
that  ---  in training:  0.011251277198869307 , in test:  1.588297424575726e-05
make  ---  in training:  0.0008905604567302914 , in test:  7.94148712287863e-06

Particularly the total coverage may be an interesting metric for similarity accross of corpora, which we can compute as follows:

In [52]:
coverage = testmodel.totaltokensingroup() / testmodel.tokens()

print(coverage)
0.0006750264054446836

To get a more traditional frequency metric for a pattern, you have to be aware that the total that is used in normalisation is impacted by the fact that the model is constrained! It will not include any unseen n-grams, for that you'd need an unconstrained model.

In [53]:
sleep = classencoder.buildpattern("to sleep")

print("Frequency in training:", model.frequency(sleep))

print("Frequency in test (constrained):", testmodel.frequency(sleep) )
print("Coverage in test (constrained):", testmodel.coverage(sleep) )

fullmodel = colibricore.IndexedPatternModel()
fullmodel.train(testcorpusfile, options)
print("Frequency in test (unconstrained):", fullmodel.frequency(sleep) )
print("Coverage in test (unconstrained):", fullmodel.coverage(sleep) )
Frequency in training: 2.1719535636328094e-05
Frequency in test (constrained): 0.08333333333333333
Coverage in test (constrained): 1.588297424575726e-05
Frequency in test (unconstrained): 0.07692307692307693
Coverage in test (unconstrained): 0.012698412698412698

Efficiently finding specific patterns in corpus data

Constrained models can be used if you want to search for a limited set of specific patterns in corpus data without the need to compute a full pattern model on the data. You create a pattern model from a pattern list, a plain text file with one pattern per line. Then you can use this model as a constraint model on your actual corpus data (the test data) and extract only occurrences of the patterns you are interested in, and in so doing conserving a lot of memory.

To do this most efficiently we are going to use in-place rebuilding, where we simply load the contraint model, reset any count information, and then we recompute the patterns anew on the test data instead, telling it to constrain on itself.

First, however, we construct a patternlist file, containing the patterns we want to extract from Plato's Republic. The pattern list can include skipgrams and flexgrams as well:

In [54]:
with open(TMPDIR + '/patternlist.txt','w', encoding='utf-8') as f:
    f.write(u'irony\n') #(the u'' syntax is so Python 2 works as well)
    f.write(u'and all\n')
    f.write(u'one part of\n')
    f.write(u'to {*} the\n') #skipgram
    f.write(u'both {**} and\n') #flexgram
    

#Load our existing class encoding that already cover's Plato's Republic (and Hamlet, though we don't need it)
classencoder = colibricore.ClassEncoder(TMPDIR + "/platoplushamlet.colibri.cls")

#Encode the patternlist, adding any new classes unseen classes to the encoder (none in this case though)
classencoder.encodefile(TMPDIR+"/patternlist.txt",TMPDIR+"/patternlist.colibri.dat",True,True)

#Save it
classencoder.save(TMPDIR+"/withpatternlist.colibri.cls")
#Load a class decoder
classdecoder = colibricore.ClassDecoder(TMPDIR+"/withpatternlist.colibri.cls")

Now we train an unindexed pattern model from our pattern list, by setting the dopatternperline option, which means our resulting model will only include exactly those patterns we specified in the list:

In [55]:
#Set the option to say we are dealing with a pattern list here
options = colibricore.PatternModelOptions(dopatternperline=True)

#Create an unindexed model for our patternlist, this will be the constraint model
patternlistmodel = colibricore.UnindexedPatternModel()
patternlistmodel.train(TMPDIR+'/patternlist.colibri.dat',options)

patternlistmodel.write(TMPDIR+"/patternlist.colibri.patternmodel")

print("Number of patterns in the model: ", len(patternlistmodel))
Number of patterns in the model:  5

The final step is to load our constraint model, and train it on the test data with doreset=True and using the model as its own contraint model. We are going to build an indexed model. It is possible to load unindexed models as indexed, but since you have no indices you automatically lose all counts. As we would explicitly force this with doreset=True, this is of no concern. A reverse index is always required for this, so we load the corpus data into an IndexedCorpus:

In [56]:
testcorpus = colibricore.IndexedCorpus(corpusfile_plato)
testmodel = colibricore.IndexedPatternModel(TMPDIR+"/patternlist.colibri.patternmodel", reverseindex=testcorpus)

options = colibricore.PatternModelOptions(doreset=True,doskipgrams=True, mintokens=1)


testmodel.train("",options,testmodel)
#Notes:
#1 - No need to pass a filename as first parameter since we already have a testcorpus loaded     
#2 - The 3rd parameter is our contraint model, which is our own model. We thus constrain on our own pre-loaded model


print("Found " + str(len(testmodel)) + " patterns")
Found 5 patterns

Iterate over all patterns found and instantiate any skipgrams/flexgrams using getinstance():

In [57]:
for pattern, indices in testmodel.items():
    print(pattern.tostring(classdecoder), end=": ")
    for index in indices:
        if pattern.category() == colibricore.Category.NGRAM:
            print(index, end=" ")
        else:
            #let's find out the precise instance of the skipgram/flexgram
            instance = testmodel.getinstance(index, pattern)
            print(str(index) + "[" + instance.tostring(classdecoder) + "]", end=" ")
    print()
    
both {**} and: (730, 10)[both , and] (2683, 31)[both from and] 
one part of: (101, 24) (1143, 56) (7644, 31) (7662, 56) (8987, 36) 
irony: (33, 7) (183, 6) (359, 4) (836, 13) (1061, 58) (1722, 97) (2811, 31) (4303, 25) 
to {*} the: (50, 5)[to represent the] (94, 3)[to determine the] (121, 17)[to express the] (135, 26)[to bear the] (159, 3)[to raise the] (177, 23)[to escape the] (187, 4)[to continue the] (271, 34)[to be the] (303, 13)[to be the] (312, 16)[to speak the] (334, 4)[to accept the] (342, 23)[to know the] (365, 23)[to admit the] (366, 15)[to understand the] (402, 14)[to all the] (404, 9)[to extend the] (432, 18)[to consider the] (456, 23)[to them the] (456, 31)[to be the] (482, 19)[to strengthen the] (483, 27)[to us the] (514, 19)[to which the] (517, 2)[to feel the] (525, 2)[to establish the] (555, 14)[to read the] (556, 3)[to construct the] (566, 21)[to attract the] (629, 14)[to trace the] (630, 30)[to save the] (655, 21)[to identify the] (725, 4)[to play the] (744, 3)[to affirm the] (749, 14)[to corrupt the] (751, 24)[to it the] (811, 6)[to confine the] (1014, 5)[to Apollo the] (1021, 6)[to us the] (1051, 28)[to be the] (1068, 14)[to make the] (1113, 2)[to discover the] (1120, 2)[to be the] (1134, 4)[to identify the] (1193, 11)[to explain the] (1214, 23)[to acknowledge the] (1225, 7)[to Plato the] (1231, 5)[to enumerate the] (1255, 4)[to have the] (1279, 17)[to be the] (1387, 5)[to describe the] (1418, 6)[to describe the] (1431, 34)[to distinguish the] (1464, 10)[to view the] (1485, 11)[to lose the] (1550, 2)[to make the] (1557, 18)[to take the] (1561, 1)[to others the] (1630, 15)[to be the] (1682, 13)[to another the] (1771, 48)[to be the] (1839, 16)[to draw the] (1926, 3)[to combine the] (1995, 15)[to analyse the] (2083, 17)[to lead the] (2092, 36)[to pay the] (2097, 25)[to all the] (2109, 15)[to restrain the] (2161, 4)[to depress the] (2172, 14)[to lighten the] (2184, 15)[to admit the] (2184, 34)[to be the] (2217, 15)[to repel the] (2295, 4)[to be the] (2446, 12)[to strengthen the] (2451, 13)[to degrade the] (2490, 5)[to estimate the] (2514, 18)[to be the] (2532, 5)[to be the] (2619, 6)[to say the] (2675, 33)[to refuse the] (2677, 37)[to avoid the] (2733, 30)[to afford the] (2770, 15)[to enlist the] (2771, 21)[to express the] (2848, 9)[to express the] (2849, 3)[to represent the] (2856, 25)[to connect the] (2865, 5)[to refuse the] (2865, 10)[to choose the] (2927, 20)[to prepare the] (2933, 9)[to unite the] (2949, 8)[to become the] (2953, 3)[to address the] (2960, 4)[to consider the] (2976, 7)[to oppose the] (3009, 2)[to use the] (3043, 17)[to be the] (3065, 6)[to be the] (3117, 20)[to prove the] (3132, 23)[to which the] (3146, 7)[to be the] (3227, 5)[to probe the] (3232, 22)[to make the] (3261, 16)[to be the] (3282, 8)[to ravish the] (3288, 12)[to view the] (3303, 11)[to govern the] (3310, 3)[to be the] (3314, 6)[to paraphrase the] (3331, 2)[to be the] (3359, 11)[to comprehend the] (3362, 17)[to modify the] (3363, 4)[to admit the] (3381, 8)[to him the] (3389, 17)[to them the] (3395, 16)[to be the] (3444, 2)[to find the] (3445, 4)[to recognize the] (3468, 22)[to distinguish the] (3497, 17)[to impress the] (3499, 6)[to train the] (3566, 7)[to be the] (3640, 10)[to be the] (3684, 12)[to be the] (3686, 2)[to be the] (3691, 5)[to maintain the] (3706, 8)[to be the] (3742, 8)[to include the] (3842, 24)[to be the] (3961, 4)[to be the] (3996, 4)[to doubt the] (4164, 1)[to them the] (4202, 6)[to be the] (4332, 24)[to pieces the] (4557, 3)[to determine the] (4650, 10)[to be the] (4755, 10)[to take the] (4775, 4)[to exceed the] (4851, 3)[to view the] (5031, 22)[to enjoy the] (5084, 3)[to be the] (5198, 11)[to disprove the] (5223, 10)[to let the] (5348, 1)[to mention the] (5438, 6)[to support the] (5500, 15)[to overtake the] (5987, 50)[to be the] (5987, 59)[to do the] (5996, 12)[to introduce the] (6042, 7)[to have the] (6089, 5)[to make the] (6291, 5)[to imitate the] (6343, 4)[to admit the] (6493, 9)[to discern the] (6649, 6)[to require the] (6764, 14)[to practise the] (6768, 10)[to stimulate the] (6839, 2)[to be the] (6903, 11)[to possess the] (6904, 12)[to possess the] (7232, 13)[to them the] (7256, 16)[to alter the] (7299, 6)[to make the] (7411, 34)[to take the] (7490, 7)[to be the] (7571, 9)[to be the] (7598, 4)[to be the] (7729, 9)[to be the] (8031, 3)[to tell the] (8064, 14)[to require the] (8109, 5)[to have the] (8173, 7)[to reach the] (8180, 11)[to have the] (8272, 6)[to have the] (8366, 18)[to consider the] (8498, 8)[to which the] (8503, 14)[to be the] (8586, 6)[to be the] (8588, 18)[to utter the] (8607, 7)[to be the] (8653, 4)[to mention the] (8747, 8)[to be the] (8803, 9)[to order the] (8900, 16)[to devastate the] (8934, 11)[to prove the] (8976, 13)[to show the] (9069, 18)[to distinguish the] 
and all: (43, 30) (136, 40) (445, 55) (540, 12) (970, 51) (1131, 30) (1437, 38) (1519, 9) (1573, 50) (1592, 25) (1735, 8) (1739, 15) (2102, 38) (2128, 10) (2538, 15) (2555, 12) (2942, 33) (2980, 7) (2988, 49) (3018, 9) (3622, 53) (3955, 94) (5422, 86) (5676, 29) (6299, 26) (6319, 44) (6331, 8) (6644, 101) (6674, 24) (7010, 10) (7021, 31) (7027, 107) (7048, 14) (7992, 31) (8653, 37) (9056, 24) (9374, 16) (9435, 87) (9892, 69) (10073, 7) (10861, 46) (10895, 7) (10994, 42) (11165, 54) (11332, 28) (11361, 46) (11361, 95) (11597, 7) (11730, 111) (12064, 17) (12090, 26) (12157, 13) (12623, 23) (12623, 42) (12939, 22) (12950, 58) (13058, 18) (13062, 25) (13356, 10) (13447, 24) (13488, 18) (13501, 69) (13593, 37) (13664, 7) (13779, 25)