Centre for Language and Speech Technology

FoLiA Format for Linguistic Annotation

“A practical XML-based Format for Linguistic Annotation„

About FoLiA

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA’s intended use is as a format for storing and/or exchanging language resources, including corpora. Our aim is to introduce a single rich format that can accomodate a wide variety of linguistic annotation types through a single generalised paradigm. We do not commit to any label set, language or linguistic theory. This is always left to the developer of the language resource, and provides maximum flexibility.

XML is an inherently hierarchic format. FoLiA does justice to this by maximally utilising a hierarchic, inline, setup.

Our aim with FoLiA is not to introduce yet another format, but to build a rich and practical infrastructure around this format. This includes tools, programming libraries, converters, visualisations and annotation environments.

Features

The FoLiA format makes mixed-use of inline and stand-off annotation. Inline annotation is used for annotations pertaining to single tokens, whilst stand-off annotation in a separate annotation layers is adopted for annotation types that span over multiple tokens. This provides FoLiA with the necessary flexibility and extensibility to deal with various kinds of annotations.

FoLiA Paradigm

FoLiA paradigm

Resources

The documentation, validation schema and other resources for the latest FoLiA version can be found below. Consult the FoLiA github repository for all available resources.

Two software libraries are available for working with the FoLiA format from within your own scripts and applications. Make sure to check back regularly for updates, as both are still being actively developed.

A web-based annotation tool is available, allowing creation and editing of FoLiA documents:

For additional support use our Issue tracker or mail lamasoftware@science.ru.nl .

Publications

If you make use of FoLiA in your work, please cite one or more of the following publications:

Posters & Presentations

FoLiA is currently used in a wide variety of projects in the Dutch and Flemish Natural Language Processing community. The largest Dutch corpus, SoNaR, is delivered in FoLiA format, and various CLARIN projects make use of it as well. Support for FoLiA is integrated into various software projects; including ucto, Frog, Valkuil.net.

FoLiA was developed by Maarten van Gompel at Tilburg University and now at Radboud University Nijmegen, with input from Antal van den Bosch, Ko van der Sloot, Martin Reynaert and many other people in the academic community.

FoLiA is open-source and all technical resources are licensed under the GNU Public License v3.

badge