Colibri Core
Public Types | Public Member Functions | Public Attributes | Static Public Attributes | List of all members
ClassEncoder Class Reference

Class for encoding plain-text to binary class-encoded data. The ClassEncoder maintains a mapping of words to classes (integers). It allows a corpus to be losslessly compressed by substituting words for classes. The classes are distributed based on word frequency, with frequent words receiving a lower class number that can be represented in fewer bytes, and rare words receiving a higher class number. More...

#include <classencoder.h>

Public Types

typedef std::unordered_map< std::string, unsigned int >::const_iterator const_iterator
 

Public Member Functions

 ClassEncoder (const unsigned int minlength=0, const unsigned int maxlength=0)
 
 ClassEncoder (const std::string &filename, const unsigned int minlength=0, const unsigned int maxlength=0)
 
void load (const std::string &filename, const unsigned int minlength=0, const unsigned int maxlength=0)
 
void build (const std::string &filename, unsigned int threshold=0)
 
void build (std::vector< std::string > &files, bool quiet=false, unsigned int threshold=0)
 
void buildclasses (const std::unordered_map< std::string, unsigned int > &freqlist, unsigned int threshold=0)
 
void processcorpus (const std::string &filename, std::unordered_map< std::string, unsigned int > &freqlist)
 
void processcorpus (std::istream *in, std::unordered_map< std::string, unsigned int > &freqlist)
 
int outputlength (const std::string &line)
 
int encodestring (const std::string &line, unsigned char *outputbuffer, bool allowunknown, bool autoaddunknown=false)
 
void encodefile (const std::string &inputfilename, const std::string &outputfilename, bool allowunknown, bool autoaddunknown=false, bool append=false, bool quiet=false)
 
void encodefile (std::istream *IN, std::ostream *OUT, bool allowunknown, bool autoaddunknown, bool quiet=false, bool append=false)
 
std::vector< unsigned int > encodeseq (const std::vector< std::string > &seq)
 
Pattern buildpattern (const std::string &patternstring, bool allowunknown=false, bool autoaddunknown=false)
 
Pattern buildpattern_safe (const std::string &patternstring, bool allowunknown=false, bool autoaddunknown=false)
 
void add (const std::string &, const unsigned int cls)
 
unsigned int gethighestclass ()
 
void save (const std::string &filename)
 
int size () const
 
unsigned int operator[] (const std::string &key)
 
const_iterator begin () const
 
const_iterator end () const
 

Public Attributes

std::unordered_map< unsigned int, std::string > added
 

Static Public Attributes

static const unsigned char delimiterclass = 0
 
static const unsigned char unknownclass = 2
 
static const unsigned char skipclass = 3
 
static const unsigned char flexclass = 4
 

Detailed Description

Class for encoding plain-text to binary class-encoded data. The ClassEncoder maintains a mapping of words to classes (integers). It allows a corpus to be losslessly compressed by substituting words for classes. The classes are distributed based on word frequency, with frequent words receiving a lower class number that can be represented in fewer bytes, and rare words receiving a higher class number.

Member Typedef Documentation

typedef std::unordered_map<std::string, unsigned int>::const_iterator ClassEncoder::const_iterator

Constructor & Destructor Documentation

ClassEncoder::ClassEncoder ( const unsigned int  minlength = 0,
const unsigned int  maxlength = 0 
)

Constructor for an empty ClassEncoder

Parameters
minlengthMinimum supported length of words (default: 0)
maxlengthMaximum supported length of words (default: 0 = unlimited)
ClassEncoder::ClassEncoder ( const std::string &  filename,
const unsigned int  minlength = 0,
const unsigned int  maxlength = 0 
)

Constructor for a ClassEncoder read from file

Parameters
filenameThe filename (*.colibri.cls)
minlengthMinimum supported length of words (default: 0)
maxlengthMaximum supported length of words (default: 0 = unlimited)

Member Function Documentation

void ClassEncoder::add ( const std::string &  s,
const unsigned int  cls 
)

Add the word with the specified class to the class encoding

const_iterator ClassEncoder::begin ( ) const
inline
void ClassEncoder::build ( const std::string &  filename,
unsigned int  threshold = 0 
)

Build a class encoding from a plain-text corpus

Parameters
filenameA plain text corpus with the units of interest (e.g sentences) each on one line
thresholdOccurrence threshold, words occurring less will be pruned
void ClassEncoder::build ( std::vector< std::string > &  files,
bool  quiet = false,
unsigned int  threshold = 0 
)

Build a class encoding from multiple plain-text corpus files

Parameters
filesA list of plain text corpus files with the units of interest (e.g sentences) each on one line
quietIf true, do not output progress to stderr (default: false)
thresholdOccurrence threshold, words occurring less will be pruned
void ClassEncoder::buildclasses ( const std::unordered_map< std::string, unsigned int > &  freqlist,
unsigned int  threshold = 0 
)

Assign classes based on the computed frequency list. This method should only be called once.

Parameters
freqlistThe data structure that will contain the frequency list
thresholdOccurrence threshold, words occurring less will be pruned
Pattern ClassEncoder::buildpattern ( const std::string &  patternstring,
bool  allowunknown = false,
bool  autoaddunknown = false 
)

Build a pattern from a string. Note: This function is not thread-safe! Use buildpattern_safe() instead if you need thread safety!

Parameters
patternstringThe string you want to turn into a Pattern
allowunknownIf the string contains unknown words, represent those using a single unknown class. If set to false, an exception will be raised when unknown words are present. (default: false)
autoaddunknownIf the string contains unknown words, automatically add these words to the class encoding. Note that the class encoding will no longer be optimal if this is used. (default: false)
Returns
a Pattern
Pattern ClassEncoder::buildpattern_safe ( const std::string &  patternstring,
bool  allowunknown = false,
bool  autoaddunknown = false 
)

Build a pattern from a string (thread-safe variant, slightly slower due to buffer allocation)

Parameters
patternstringThe string you want to turn into a Pattern
allowunknownIf the string contains unknown words, represent those using a single unknown class. If set to false, an exception will be raised when unknown words are present. (default: false)
autoaddunknownIf the string contains unknown words, automatically add these words to the class encoding. Note that the class encoding will no longer be optimal if this is used. (default: false)
Returns
a Pattern
void ClassEncoder::encodefile ( const std::string &  inputfilename,
const std::string &  outputfilename,
bool  allowunknown,
bool  autoaddunknown = false,
bool  append = false,
bool  quiet = false 
)

Create a class-encoded corpus file from a plain-text corpus file. Each of the units of interest (e.g sentences) should occupy a single line (i.e.,
delimited)

Parameters
inputfilenameFilename of the input file, a plain-text corpus file
outputfilenameFilename of the output file (binary class-encoded corpus file, *.colibri.dat)
allowunknownIf the string contains unknown words, represent those using a single unknown class. If set to false, an exception will be raised when unknown words are present. (default: false)
autoaddunknownIf the string contains unknown words, automatically add these words to the class encoding. Note that the class encoding will no longer be optimal if this is used. (default: false)
appendSet to true if this is not the first file to write to the stream
Returns
The number of bytes written to outputbuffer
void ClassEncoder::encodefile ( std::istream *  IN,
std::ostream *  OUT,
bool  allowunknown,
bool  autoaddunknown,
bool  quiet = false,
bool  append = false 
)

Create a class-encoded corpus file from a plain-text corpus file. Each of the units of interest (e.g sentences) should occupy a single line (i.e.,
delimited)

Parameters
INInput stream of a plain-text corpus file
OUTOutput stream of a binary class-encoded corpus file (*.colibri.dat)
allowunknownIf the string contains unknown words, represent those using a single unknown class. If set to false, an exception will be raised when unknown words are present. (default: false)
autoaddunknownIf the string contains unknown words, automatically add these words to the class encoding. Note that the class encoding will no longer be optimal if this is used. (default: false)
quietSet to true to suppress any output
appendSet to true if this is not the first file to write to the stream
Returns
The number of bytes written to outputbuffer
vector< unsigned int > ClassEncoder::encodeseq ( const std::vector< std::string > &  seq)
int ClassEncoder::encodestring ( const std::string &  line,
unsigned char *  outputbuffer,
bool  allowunknown,
bool  autoaddunknown = false 
)

Low-level function to encode a string of words as a binary representation of classes

Parameters
lineThe string you want to turn into a Pattern
outputbufferPointer to the output buffer, must be pre-allocated and have enough space
allowunknownIf the string contains unknown words, represent those using a single unknown class. If set to false, an exception will be raised when unknown words are present. (default: false)
autoaddunknownIf the string contains unknown words, automatically add these words to the class encoding. Note that the class encoding will no longer be optimal if this is used. (default: false)
Returns
The number of bytes written to outputbuffer
const_iterator ClassEncoder::end ( ) const
inline
unsigned int ClassEncoder::gethighestclass ( )
inline

Returns the highest assigned class in the class encoding

void ClassEncoder::load ( const std::string &  filename,
const unsigned int  minlength = 0,
const unsigned int  maxlength = 0 
)

Load a class encoding from file

Parameters
filenameThe filename (*.colibri.cls)
minlengthMinimum supported length of words (default: 0)
maxlengthMaximum supported length of words (default: 0 = unlimited)
unsigned int ClassEncoder::operator[] ( const std::string &  key)
inline

Return the class for the given word

int ClassEncoder::outputlength ( const std::string &  line)

Computes how many bytes the class repesentation for this input line would take

void ClassEncoder::processcorpus ( const std::string &  filename,
std::unordered_map< std::string, unsigned int > &  freqlist 
)

Count word frequency in a given plain-text corpus.

Parameters
filenameThe corpus file
freqlistThe resulting frequency list, should be shared between multiple calls to processcorpus()
void ClassEncoder::processcorpus ( std::istream *  in,
std::unordered_map< std::string, unsigned int > &  freqlist 
)

Count word frequency in a given plain-text corpus.

Parameters
inThe input stream
freqlistThe resulting frequency list, should be shared between multiple calls to processcorpus()
void ClassEncoder::save ( const std::string &  filename)

Save the class encoding to file

int ClassEncoder::size ( ) const
inline

Returns the number of classes, i.e. word types

Member Data Documentation

std::unordered_map<unsigned int, std::string> ClassEncoder::added
const unsigned char ClassEncoder::delimiterclass = 0
static
const unsigned char ClassEncoder::flexclass = 4
static
const unsigned char ClassEncoder::skipclass = 3
static
const unsigned char ClassEncoder::unknownclass = 2
static

The documentation for this class was generated from the following files: