(for Colibri Core v2.4 - revision 2016-06-10)
https://proycon.github.io/colibri-core
by Maarten van Gompel, Radboud University Nijmegen
This tutorial will show you how to work with Colibri Core's Python API, a tool for Natural Language Processing. It is assumed that you have already read the Colibri Core documentation, followed the installation instructions, and are familiar its purpose and concepts. The documentation also provides an API reference for all the Python classes and method. This tutorial is in the form of a Python Notebook, allowing you to interactively participate. Press shift+enter
in code field to evaluate it.
Colibri Core is written in C++ and the Python binding is writting in Cython. This offers the advantage of native-speed and memory efficiency, combined with the ease of a high-level pythonic interface. We will be using Python 3 conventions here, but Colibri Core and this tutorial can also work with Python 2.7.
We obviously start our adventure with an import of colibricore, so make sure you installed it properly:
#These first imports are for Python 2.7 compatibility, we approximate Python 3 as much as possible
from __future__ import print_function
from sys import version
PYTHON2 = version[0] == '2'
if PYTHON2:
from io import open
import colibricore
TMPDIR = "/tmp/" #this is where we'll store intermediate files
To give us something to work with, we will take an excerpt of Shakespeare's Hamlet as our corpus text:
corpustext = """To be, or not to be, that is the question
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them? To die, to sleep
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub,
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause. There's the respect
That makes Calamity of so long life:
For who would bear the Whips and Scorns of time,
Th' Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
That patient merit of the unworthy takes,
When he himself might his Quietus make
With a bare Bodkin? Who would these Fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered Country, from whose bourn
No Traveler returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.
Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And enterprises of great pitch and moment,
With this regard their Currents turn awry,
And lose the name of Action. Soft you now,
The fair Ophelia. Nymph, in all thy Orisons
Be all my sins remembered"""
#first we do some very rudimentary tokenisation
# Yes, I realise this is a very stupid way ;)
corpustext = corpustext.replace(',',' ,')
corpustext = corpustext.replace('.',' .')
corpustext = corpustext.replace(':',' :')
if PYTHON2: corpustext = unicode(corpustext,'utf-8')
corpusfile_plaintext = TMPDIR + "hamlet.txt"
with open(corpusfile_plaintext,'w',encoding='utf-8') as f:
f.write(corpustext)
To work with this data with Colibri Core. We need to class encode it, assigning integer values to each word type. Using Python, a class encoder is built as follows:
classfile = TMPDIR + "hamlet.colibri.cls"
#Instantiate class encoder
classencoder = colibricore.ClassEncoder()
#Build classes
classencoder.build(corpusfile_plaintext)
#Save class file
classencoder.save(classfile)
print("Encoded ", len(classencoder), " classes, well done!")
Now we have a class encoder we can encode our corpus, producing a new encoded file (which tends to be about 50% compressed compared to the original):
corpusfile = TMPDIR + "hamlet.colibri.dat" #this will be the encoded corpus file
classencoder.encodefile(corpusfile_plaintext, corpusfile)
To check whether that worked as planned, we will construct a Class Decoder, load our class file, and decode the corpus:
#Load class decoder from the classfile we just made
classdecoder = colibricore.ClassDecoder(classfile)
#Decode corpus data
decoded = classdecoder.decodefile(corpusfile)
#Show
print(decoded)
Now we have a class encoder and decoder, we can toy around with the most basic units in Colibri Core: patterns. These are using for n-grams, skipgrams, flexgrams and any kind of test. You would basically use an instance of Pattern
where you'd normally use a string, as Patterns are much smaller in memory. Let's build a pattern from a string using the classencoder, note that we will only be able to use words that are known by the class encoder:
#Build a pattern from a string, using the class encoder
p = classencoder.buildpattern("To be or not to be")
#To print it we need the decoder
print(p.tostring(classdecoder))
print(len(p))
Iteration over a pattern will produce all the tokens that it is made up of. Note that the concepts of characters is gone from patterns! As a consequence, the ability to lowercase or uppercase text is also lost.
#Iterate over the token in a pattern, each token will be a Pattern instance
for token in p:
print(token.tostring(classdecoder))
#Extracting subpatterns by offset
#Get first token
print(p[0].tostring(classdecoder))
#Get last token
print(p[-1].tostring(classdecoder))
#Get slice
print(p[2:4].tostring(classdecoder))
Given a pattern, we can now very easily extract all n-grams in it, one of the most common NLP tasks:
#let's get all bigrams
for ngram in p.ngrams(2):
print(ngram.tostring(classdecoder))
#or all n-grams:
for ngram in p.ngrams():
print(ngram.tostring(classdecoder))
#or particular ngrams, such as unigrams up to trigrams:
for ngram in p.ngrams(1,3):
print(ngram.tostring(classdecoder))
The in
operator can be used to check if a token OR ngram is part of a pattern
#token
p2 = classencoder.buildpattern("be")
print(p2 in p)
#ngram
p3 = classencoder.buildpattern("or not")
print(p3 in p)
The follow snippet is here just to prove that our Pattern representation is usually smaller than a string representation, and offers a sneak peek under the hood:
from sys import version
if version[0] == '3': #This works on Python 3 only
print(bytes(p), len(bytes(p)))
print(b"To be or not to be", len(b"To be or not to be"))
len(bytes(p)) < len(b"To be or not to be")
If we want to read an entire corpus, we can use the IndexedCorpus
class. This we can use, for example, if we are merely interested in moving a sliding window over our data and extracting n-grams without counting or storing them:
corpusdata = colibricore.IndexedCorpus(corpusfile) #encoded data, will be loaded into memory entirely
for sentence in corpusdata.sentences(): #will return a Pattern per sentence (generator)
for trigram in sentence.ngrams(3):
print(trigram.tostring(classdecoder), end= " | ")
Now you may be very tempted to start storing and counting n-grams this way, but don't. This method is only suitable for iterating and quickly discarding the ngrams. Colibri core has facilities to deal with storing and counting far more efficiently, these are pattern models which we will discuss in the next section.
First some more about IndexedCorpus
. We can also obtain any pattern using its index, a (sentence,token)
tuple:
unigram = corpusdata[(2,3)]
print(unigram.tostring(classdecoder))
A slice syntax is also supported, but may never cross line/sentence boundaries. As is customary in Python, the last index is non-inclusive.
ngram = corpusdata[(2,3):(2,8)]
print(ngram.tostring(classdecoder))
The number of sentences and the length of each sentence can be extracted as follows:
sentencecount = corpusdata.sentencecount()
for i in range(1, sentencecount+1): #note the 1..+1 range, sentences are 1-indexed (whereas tokens are 0-indexed)
print("Length of sentence " + str(i) + ":", corpusdata.sentencelength(i))
You can also find specific patterns in IndexedCorpus
data. However, it is usually more efficient to use a Pattern Model, as discussed in the next section.
searchpattern = classencoder.buildpattern("or not")
for (sentence,token), pattern in corpusdata.findpattern(searchpattern):
print("Pattern found at: " + str(sentence) + ":" + str(token))
You can pass an extra parameter with a sentence index to findpattern()
to limit your search to one particular sentence rather than the entire corpus.
Now it's time to build our first pattern model on the Hamlet excerpt. We will extract all patterns occurring at least twice and with maximum length 8.
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)
#Instantiate an empty unindexed model
model = colibricore.UnindexedPatternModel()
#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)
print("Found " , len(model), " patterns:")
#Let's see what patterns are in our model (the order will be 'random')
for pattern in model:
print(pattern.tostring(classdecoder), end=" | ")
Rather than just output the patterns, we of course now have the counts as well, let's output it:
#Models behave much alike to Python dictionaries:
for pattern, count in model.items():
print(pattern.tostring(classdecoder), count)
We can also query specific patterns:
querypattern = classencoder.buildpattern("sleep")
print("How much sleep?")
print(model[querypattern])
#Like dictionaries, unknown patterns will trigger a KeyError
querypattern = classencoder.buildpattern("insolence")
print("How much insolence?")
try:
print(model[querypattern])
except KeyError:
print("Nope, KeyError, no such pattern in model..")
We can check whether a pattern is in a model in the usual pythonic fashion:
if querypattern in model:
print("Insolence in model!")
else:
print("No insolence in model!")
Rather than the absolute counts, we can get the frequency of a pattern within its type and class. For example the frequency of a bigram amongst all bigrams:
querypattern = classencoder.buildpattern("and the")
print(model.frequency(querypattern))
To analyse the distribution of occurrences, we can extract a histogram from our model as follows:
for occurrencecount, frequency in model.histogram():
print(occurrencecount , " occurrences by ", frequency , "patterns")
Once we have a model, we can save it to file, to reload later, loading is much faster than training:
patternmodelfile = TMPDIR + "hamlet.colibri.patternmodel"
model.write(patternmodelfile)
#and reload just to show we can:
model = colibricore.UnindexedPatternModel(patternmodelfile, options)
Unindexed models are much smaller in memory than indexed models, but their functionality is also limited. Let's take a look at indexed models. Indexed models keep a forward index to all locations in the original corpus where patterns occur. The references are 2-tuples in the form (sentence,token)
, where sentence
is 1-indexed and token
is 0-indexed.
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)
#Instantiate an empty indexed model
model = colibricore.IndexedPatternModel()
#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)
print("Found " , len(model), " patterns:")
#Let's see what patterns are in our model (the order will be 'random')
for pattern, indices in model.items():
print(pattern.tostring(classdecoder),end=" ")
for index in indices:
print(index,end=" ") #(sentence,token) tuple, sentences start with 1, tokens with 0
print()
One interesting feature we can get from indexed models, is coverage information. This shows how many of the tokens in the original corpus data are covered by a particular pattern.
querypattern = classencoder.buildpattern("and the")
print(model.coverage(querypattern))
Some numbers on the original corpus data can be obtained from the model:
print("Total amount of tokens in the corpus data:" , model.tokens() )
print("Total amount of word types in the corpus data:" , model.types() )
Whilst we have a forward index, we can also include a reverse index in our pattern model, which then allows us to look what patterns begin at a particular location. To build a reverse model, we explicitly need to instantiate an IndexedCorpus
and pass it to the IndexedPatternModel
constructor:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)
#Load the corpus
corpus = colibricore.IndexedCorpus(corpusfile)
#Instantiate an empty unindexed model
model = colibricore.IndexedPatternModel(reverseindex=corpus)
#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)
print("Found " , len(model), " patterns:")
Now we have a model with a reverse index, we can compute what patterns from our model begin at a certain index, expressed as a (sentence,token)
tuple:
print("Patterns at (1,5): ")
for pattern in model.getreverseindex( (1,5) ):
print(pattern.tostring(classdecoder))
You can also use this to easily get all patterns in a sentence that are in the model:
print("Patterns in first sentence")
for (sentence, token), pattern in model.getreverseindex_bysentence(1):
print(sentence,token, " -- ", pattern.tostring(classdecoder))
It is easy to iterate over all indices in the reverse index:
for ref in model.reverseindex():
print(ref, end=" ") #ref is a (sentence,token) tuple
Alternatively, use model.items()
to get the pattern as well, it will return (ref, pattern)
tuples.
The reverse index, as returned by the reverseindex()
method, is just the same instance of IndexedCorpus
, which we passed to the constructor earlier.
Skipgrams are n-grams with one or more gaps of a particular size. Flexgrams have a gap of dynamic size. Colibri Core can deal with both. Let's start with a new, and somewhat bigger, corpus. As the data in our previous example was too sparse to find any skipgrams. To that end, we will download Plato's Republic, this version is already tokenised and has one sentence per line, just as Colibri Core likes it:
corpusfile_plato_plaintext = TMPDIR + "republic.txt"
if PYTHON2:
from urllib import urlopen
else:
from urllib.request import urlopen
f = urlopen('http://lst.science.ru.nl/~proycon/republic.txt')
with open(corpusfile_plato_plaintext,'wb') as of:
of.write(f.read())
print("Downloaded to " + corpusfile_plato_plaintext)
Now we create a class file and class encode the corpus, but because we may later on want to compare Shakespeare's Hamlet with Plato's Republic, we ensure that we use the same vocabulary. Note that it would have been better (more optimal classes, better compression) if we had built the original class encoder on both files right away, but you don't always have the luxury of foresight.
classfile_plato = TMPDIR + "republic.colibri.cls"
corpusfile_plato = TMPDIR + "republic.colibri.dat"
#Build classes, re-using our classencoder from Hamlet! Let's reload it just for completion's sake
classencoder = colibricore.ClassEncoder(TMPDIR + "hamlet.colibri.cls")
#Now we will extend it by buildiing classes on Plato's data. If we had done this earlier,
# we could have passed a list of filenames, ensuring more optimal encoding.
classencoder.build(corpusfile_plato_plaintext)
#Save new class file, this will be a superset of the original one.
classencoder.save(classfile_plato)
#Encode the corpus
classencoder.encodefile(corpusfile_plato_plaintext, corpusfile_plato)
#Load decoder because the old one will only handle Hamlet
classdecoder = colibricore.ClassDecoder(classfile_plato)
print("Done")
Now we have a proper class file and encoded corpus, we can build an indexed pattern model with skipgrams. Skipgrams can only be build most efficiently using indexed models.
#Set the options, doskipgrams=True is the key to enabling skipgrams
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8, doskipgrams=True)
#Instantiate an empty indexed model
corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
model = colibricore.IndexedPatternModel(reverseindex=corpus_plato)
#Train it on our corpus file (class-encoded data, not plain text)
print("Training")
model.train(corpusfile_plato, options)
print("Found " , len(model), " patterns:")
Now how many of those patterns are skipgrams? We can find out ourselves by iterating over the patterns and checking their category.
skipgrams = 0
for pattern in model:
if pattern.category() == colibricore.Category.SKIPGRAM:
skipgrams += 1
print("Found",skipgrams," skipgrams")
However, it is much faster to do this using the built-in filter()
method, which can also be used to filter patterns above a certain occurrence threshold, we can constrain it to a specific type such as skipgrams, and to a specific length (third argument, not used here):
skipgrams = 0
for pattern, occurrencecount in model.filter(0,colibricore.Category.SKIPGRAM): #the first parameter is the occurrence threshold
skipgrams += 1
print("Found",skipgrams," skipgrams")
Similar to filter()
is the top()
method, which we can use to extract the top patterns, let's get the top 20 of skipgrams. We will still need to relay it through a sorting function to get it in descending order:
for pattern, occurrencecount in sorted( model.top(20,colibricore.Category.SKIPGRAM), key=lambda x:x[1]*-1 ):
print(pattern.tostring(classdecoder), " -- ", occurrencecount)
Each occurrence of {*}
expresses a gap of exactly one word/token. We can create skipgrams from scratch using the same syntax with the classencoder, you can also use {*2*}
for a gap covering two words etc..:
skipgram = classencoder.buildpattern("To {*} or not to {*} is the question")
The consecutive non-gap parts of a skipgram can be obtained using the parts()
method. The skipgram above consists of three parts:
for part in skipgram.parts():
print(part.tostring(classdecoder))
Because an indexed model stores all the locations at which a pattern occurs, and a reverse index allows us to fill missing gaps, we can easily obtain all n-grams of which the skipgram is an abstraction:
#let's pick a common skipgram from the data:
skipgram = classencoder.buildpattern("to the {*} of")
for ngram, occurrences in model.getinstances(skipgram):
print(ngram.tostring(classdecoder), " -- occurring ", occurrences, " times" )
The reverse is also possible, given an ngram we can find what skipgrams are abstractions, or templates of it:
#let's pick something that should be covered by a skipgram from the data:
ngram = classencoder.buildpattern("to the question of")
for skipgram, occurrences in model.gettemplates(ngram):
print(skipgram.tostring(classdecoder), " -- occurring ", occurrences, " times" )
Another trait of indexed pattern models is the ability to extract co-occurrence information using the getcooc()
method. Let's see with what patterns the ngram "the law of" co-occurs more than five times (the second argument specifies this threshold, using it is always more efficient than doing a check on the variable occurrences
that is returned):
ngram = classencoder.buildpattern("the law")
for coocngram, occurrences in sorted( model.getcooc(ngram,5), key=lambda x: x[1] *-1): #let's sort the output too
print(coocngram.tostring(classdecoder), " -- occurring ", occurrences, " times")
There are also specific methods for extracting co-occurrences left or right of the pattern: getleftcooc()
and getrightcooc()
. Other relationships can be extracted in an identical fashion:
getleftneighbours
(pattern,threshold=0,category=0,size=0)
-- returns the neighbours to the immediate left of a pattern (threshold, category and size are constraints which are set to 0 by default)getrightneighbours
(pattern,threshold=0,category=0,size=0)
-- returns the neighbours to the immediate right of a patterngetsubchildren
(pattern,threshold=0,category=0,size=0)
-- returns patterns that are a subpart (subsumed by) the specifiedgetsubparents
(pattern,threshold=0,category=0,size=0)
-- the reverse of the above, returns patterns which subsume the specified patternsIn addition to skipgrams, Colibri Core core also supports flexgrams. Whereas the gaps in skipgrams are of a predefined size, in flexgrams they are by definition variable. A gap in a flexgram is represented as {**}
. All of the existing functions that work on skipgrams, including the methods to extract relationsships, should also work on flexgrams.
Flexgrams can be computed in two ways, but only on indexed pattern models:
computeflexgrams_fromskipgrams()
method.computeflexgrams_fromcooc(threshold)
method for this.You have to explicitly choose one of these methods. An example of the first strategy:
#Set the options, doskipgrams=True is the key to enabling skipgrams
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8, doskipgrams=True)
#Instantiate an empty indexed model
corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
flexmodel = colibricore.IndexedPatternModel(reverseindex=corpus_plato)
#Train it on our corpus file (class-encoded data, not plain text)
flexmodel.train(corpusfile_plato, options)
#compute the flexgrams
found = flexmodel.computeflexgrams_fromskipgrams()
print("Found " , str(found), " flexgrams")
Pattern Models can be used in a train/test paradigm. You can create a Pattern Model on the training corpus and then generated a Pattern Model on the test corpus constrained by the training model. This allows you to test what patterns from the training corpus also occur in the test corpus, and how often. Statistics on these two differing counts can provide insight into how much corpora differ.
We already saw the coverage metric previously, when applied to a train/test scenario it measures the number or ratio of tokens in the test corpus covered by patterns found during training. Let's perform such a comparison.
We made a Pattern Model on Plato's Republic and we have a small excerpt from Hamlet. Let's use the former as training and the letter as test.
When doing any kind of comparison, it is absolutely crucial that you make sure the training and test data are class encoded with the same classes. The best method for this is to build the class files for all data in advance. In the previous class encoding example we saw classencoder.build()
which does nothing more than provide us with a shortcut to call classencoder.processcorpus()
followed by classencoder.buildclasses()
. To process multiple corpora, we do this ourselves:
classfile2 = TMPDIR + "platoandhamlet.colibri.cls"
#Instantiate class encoder
classencoder2 = colibricore.ClassEncoder()
#Build classes
classencoder2.processcorpus(corpusfile_plato_plaintext)
classencoder2.processcorpus(corpusfile_plaintext)
classencoder2.buildclasses()
#Save class file
classencoder2.save(classfile2)
print("Encoded ", len(classencoder2), " classes, well done!")
It is important to realise that the Class Encoder we just built (classencoder2
) is now not compatible with the earlier class encoder used for previous examples!
Often, however, you do not have all data available in advance. You may add a different test set later on, long after training. The way to make sure you have a proper class encoding is to extend your original class encoding. Rather than using the class encoder we just build, let us opt for that method, as this will keep all the classes we already had for the training data (Plato's Republic). This we do by calling the encodefile()
method with two extra arguments set to True, indicating respectively that unknown words are allowed, and that unknown words are automatically added the the class encoding. If the second boolean is set to False, all unknown words would be encoded by one single class reserved for unknown words.
print("Class encoder has ", len(classencoder), " classes prior to extension")
testcorpusfile = TMPDIR + "hamlet_test.colibri.dat" #this will be the encoded test corpus file
classencoder.encodefile(corpusfile_plaintext, testcorpusfile, True, True)
classfile_test = TMPDIR + "platoplushamlet.colibri.cls"
classencoder.save(classfile_test)
print("Class encoder has ", len(classencoder), " classes after extension")
Do note that this method of encoding is not optimal, only encoding everything in one go ensures the smallest possible memory footprint.
We already created a pattern model on the training data in one of our earlier steps (called model
), to create our test model we train a constrained model on the test set, this model is constrained by the training model we made earlier. This will result in a new pattern model. The nomenclature may be a bit confusing at first. We simply do all this by instantiating a new model and calling the train()
method and passing the contraining model as the last argument.
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)
#Instantiate an empty indexed model
testmodel = colibricore.IndexedPatternModel()
#Train it on our test corpus file (class-encoded data, not plain text)
testmodel.train(testcorpusfile, options, model)
Now we have a test model (effectively the intersection an unconstrained model of the test corpus and the training model). We can see what patterns from the training corpus occur in the test corpus:
for pattern in testmodel:
print(pattern.tostring(classdecoder))
We can inspect the differences between the counts:
for pattern in testmodel:
print(pattern.tostring(classdecoder), " --- in training: ", model.occurrencecount(pattern), ", in test: ", testmodel.occurrencecount(pattern) )
This isn't so informative unless we apply some normalisation, so let's get the coverage instead:
for pattern in testmodel:
print(pattern.tostring(classdecoder), " --- in training: ", model.coverage(pattern), ", in test: ", testmodel.coverage(pattern) )
Particularly the total coverage may be an interesting metric for similarity accross of corpora, which we can compute as follows:
coverage = testmodel.totaltokensingroup() / testmodel.tokens()
print(coverage)
To get a more traditional frequency metric for a pattern, you have to be aware that the total that is used in normalisation is impacted by the fact that the model is constrained! It will not include any unseen n-grams, for that you'd need an unconstrained model.
sleep = classencoder.buildpattern("to sleep")
print("Frequency in training:", model.frequency(sleep))
print("Frequency in test (constrained):", testmodel.frequency(sleep) )
print("Coverage in test (constrained):", testmodel.coverage(sleep) )
fullmodel = colibricore.IndexedPatternModel()
fullmodel.train(testcorpusfile, options)
print("Frequency in test (unconstrained):", fullmodel.frequency(sleep) )
print("Coverage in test (unconstrained):", fullmodel.coverage(sleep) )
Constrained models can be used if you want to search for a limited set of specific patterns in corpus data without the need to compute a full pattern model on the data. You create a pattern model from a pattern list, a plain text file with one pattern per line. Then you can use this model as a constraint model on your actual corpus data (the test data) and extract only occurrences of the patterns you are interested in, and in so doing conserving a lot of memory.
To do this most efficiently we are going to use in-place rebuilding, where we simply load the contraint model, reset any count information, and then we recompute the patterns anew on the test data instead, telling it to constrain on itself.
First, however, we construct a patternlist file, containing the patterns we want to extract from Plato's Republic. The pattern list can include skipgrams and flexgrams as well:
with open(TMPDIR + '/patternlist.txt','w', encoding='utf-8') as f:
f.write(u'irony\n') #(the u'' syntax is so Python 2 works as well)
f.write(u'and all\n')
f.write(u'one part of\n')
f.write(u'to {*} the\n') #skipgram
f.write(u'both {**} and\n') #flexgram
#Load our existing class encoding that already cover's Plato's Republic (and Hamlet, though we don't need it)
classencoder = colibricore.ClassEncoder(TMPDIR + "/platoplushamlet.colibri.cls")
#Encode the patternlist, adding any new classes unseen classes to the encoder (none in this case though)
classencoder.encodefile(TMPDIR+"/patternlist.txt",TMPDIR+"/patternlist.colibri.dat",True,True)
#Save it
classencoder.save(TMPDIR+"/withpatternlist.colibri.cls")
#Load a class decoder
classdecoder = colibricore.ClassDecoder(TMPDIR+"/withpatternlist.colibri.cls")
Now we train an unindexed pattern model from our pattern list, by setting the dopatternperline
option, which means our resulting model will only include exactly those patterns we specified in the list:
#Set the option to say we are dealing with a pattern list here
options = colibricore.PatternModelOptions(dopatternperline=True)
#Create an unindexed model for our patternlist, this will be the constraint model
patternlistmodel = colibricore.UnindexedPatternModel()
patternlistmodel.train(TMPDIR+'/patternlist.colibri.dat',options)
patternlistmodel.write(TMPDIR+"/patternlist.colibri.patternmodel")
print("Number of patterns in the model: ", len(patternlistmodel))
The final step is to load our constraint model, and train it on the test data with doreset=True
and using the model as its own contraint model. We are going to build an indexed model. It is possible to load unindexed models as indexed, but since you have no indices you automatically lose all counts. As we would explicitly force this with doreset=True
, this is of no concern. A reverse index is always required for this, so we load the corpus data into an IndexedCorpus
:
testcorpus = colibricore.IndexedCorpus(corpusfile_plato)
testmodel = colibricore.IndexedPatternModel(TMPDIR+"/patternlist.colibri.patternmodel", reverseindex=testcorpus)
options = colibricore.PatternModelOptions(doreset=True,doskipgrams=True, mintokens=1)
testmodel.train("",options,testmodel)
#Notes:
#1 - No need to pass a filename as first parameter since we already have a testcorpus loaded
#2 - The 3rd parameter is our contraint model, which is our own model. We thus constrain on our own pre-loaded model
print("Found " + str(len(testmodel)) + " patterns")
Iterate over all patterns found and instantiate any skipgrams/flexgrams using getinstance()
:
for pattern, indices in testmodel.items():
print(pattern.tostring(classdecoder), end=": ")
for index in indices:
if pattern.category() == colibricore.Category.NGRAM:
print(index, end=" ")
else:
#let's find out the precise instance of the skipgram/flexgram
instance = testmodel.getinstance(index, pattern)
print(str(index) + "[" + instance.tostring(classdecoder) + "]", end=" ")
print()