Maarten van Gompel

Maarten van Gompel (proycon) BSc MA

PhD Candidate & Scientific Programmer Centre for Language Studies & Centre for Language and Speech Technology since Sept. 1, 2011
E4.06 proycon@anaproy.nl http://proycon.anaproy.nl/ proycon computer aided language learning, machine translation, multilingualism, word sense disambiguation, language resource formatting and infrastructure

Research Projects

Colibri: Constructions as Linguistic Bridges

Colibri: Constructions as Linguistic Bridges

Sept. 1, 2011 -- Sept. 1, 2016 Maarten van Gompel

Research into the modelling of source-side context in Machine Translation

Dream research

Dream research

June 2, 2014 -- Antal van den Bosch , Maarten van Gompel , Florian Kunneman , Ali Hürriyetoğlu , Folgert Karsdorp , Iris Hendrickx , Martin Reynaert , Wessel Stoop , Louis Onrust

Dreams, the involuntary perceptions that occur in our minds during sleep, have been the topic of studies in many fields of research, including psychiatry, psychology, neurobiology, and religious studies. Their narrative content also links dreams to other forms of storytelling, with sharp distinctions (such as the focus on one's personal life and the typical personal perspective) but also interesting overlaps with genres such as orally transmitted folktales. We present a study on dreams aimed at the large-scale analysis of dreams using text analytics.

Publications

M. Reynaert, M. van Gompel, K. van der Sloot, and A. van den Bosch
PICCL: Philosophical Integrator of Computational and Corpus Libraries
Proceedings of {CLARIN} {A}nnual {C}onference 2015 -- {B}ook of {A}bstracts, CLARIN ERIC, 2015
Full text (external), RIS, BibTex
M. van Gompel and A. van den Bosch
Translation Assistance by Translation of L1 Fragments in an L2 Context
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2014
Full text (external), RIS, BibTex
H. P. Maat, R. Kraf, A. van den Bosch, N. Dekker, M. van Gompel, S. Kleijn, T. Sanders, and K. van der Sloot
T-Scan: a new tool for analyzing Dutch text
Computational Linguistics in the Netherlands Journal, 4, 2014
RIS, BibTex
M. van Gompel and M. Reynaert
CLAM: Quickly deploy NLP command-line tools on the web
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, 2014
RIS, BibTex
M. van Gompel
CLAM: Computational Linguistics Application Mediator
Language and Speech Technology Technical Report Series 14-02, Radboud University Nijmegen, 2014
Full text (external), RIS, BibTex
M. van Gompel
FoLiA: Format for Linguistic Annotation. Documentation
Language and Speech Technology Technical Report Series 14-01, Radboud University Nijmegen, 2014
Full text (external), RIS, BibTex
M. van Gompel, A. van den Bosch, and A. Dykstra
Oersetter: Frisian-Dutch statistical machine translation
Philologia Frisica anno 2012, 2014
RIS, BibTex
M. van Gompel, I. Hendrickx, A. van den Bosch, E. Lefever, and V. Hoste
Semeval-2014 Task 5: L2 writing assistant
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014
RIS, BibTex
M. van Gompel and M. Reynaert
FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study
Computational Linguistics in the Netherlands Journal, 3, 2013
RIS, BibTex
M. van Gompel and A. van den Bosch
WSD2: parameter optimisation for memory-based cross-lingual word-sense disambiguation
Proceedings of the 7th International Workshop on Semantic Evaluation ({SemEval} 2013), in conjunction with the Second Joint Conference on Lexical and Computational Semantics, New Brunswick, NJ: Association for Computational Linguistics, 2013
RIS, BibTex
M. Reynaert, I. Schuurman, V. Hoste, N. Oostdijk, and M. van Gompel
Beyond SoNaR: towards the facilitation of large corpus building efforts
Proceedings of the Eighth International conference on Language Resources and Evaluation (LREC), 2012
RIS, BibTex
P. Vossen, A. Görög, F. Laan, M. van Gompel, R. Izquierdo-Bevia, and A. van den Bosch
DutchSemCor: building a semantically annotated corpus for Dutch
Electronic lexicography in the 21st century: New Applications for New Users: Proceedings of eLex 2011, Bled, 1 0-12 November 2011, 2011
RIS, BibTex
M. van Gompel
UvT-WSD1: A cross-lingual word sense disambiguation system
Proceedings of the 5th international workshop on semantic evaluation, 2010
RIS, BibTex
M. van Gompel, A. van den Bosch, and P. Berck
Extending memory-based machine translation to phrases
Proceedings of the Third Workshop on Example-Based Machine Translation, 2009
RIS, BibTex

Software

CLAM

CLAM

by Maarten van Gompel https://proycon.github.io/clam

CLAM allows you to quickly and transparently transform your Natural Language Processing application into a RESTful webservice, with which both human end-users as well as automated clients can interact.

Colibri Core

Colibri Core

by Maarten van Gompel https://proycon.github.io/colibri-core

Colibri Core is software, consisting of command line tools as well as programming libraries. to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns.

Colibri MT

Colibri MT

by Maarten van Gompel https://github.com/proycon/colibri-mt

A Machine Translation framework that wraps around the Moses Decoder and enables k-NN classifier techniques to be used for modelling source-side-context

Colibrita

Colibrita

by Maarten van Gompel https://github.com/proycon/colibrita

Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques.

FLAT: FoLiA Linguistic Annotation Tool

FLAT: FoLiA Linguistic Annotation Tool

by Maarten van Gompel https://github.com/proycon/flat

Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.

FoLiA: Format for Linguistic Annotation

FoLiA: Format for Linguistic Annotation

by Maarten van Gompel https://proycon.github.io/folia

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA’s intended use is as a format for storing and/or exchanging language resources, including corpora.

Frog

Frog

by Antal van den Bosch , Maarten van Gompel , Ko van der Sloot https://languagemachines.github.io/frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool, which is currently maintained and developed by the Language Machines Research Group and the Centre for Language and Speech Technology at Radboud University Nijmegen. A dependency parser, a base phrase chunker, and a named-entity recognizer module were added more recently. Where possible, Frog makes use of multi-processor support to run subtasks in parallel.

Gecco

Gecco

by Maarten van Gompel https://github.com/proycon/gecco

Gecco is a generic modular and distributed framework for spelling correction. Aimed to build complete context-aware spelling correction system given your own data set. Most modules will be language-independent and trainable from a source corpus. Training is explicitly included in the framework. The framework aims to easily extendible, modules can be written in Python 3. Moreover, the framework is scalable and distributable over multiple servers. Given an input text, Gecco will add various suggestions for correction. The system can be invoked from the command-line, as a Python binding, as a RESTful webservice, or through the web application (two interfaces).

LaMachine

LaMachine

by Maarten van Gompel https://proycon.github.io/LaMachine

LaMachine is not a single tool, but is a distribution of almost all our software bundled in three different ways to facilitate use on a wide variety of systems. LaMachine can be used as a Virtual Machine - Easiest, allowing you to run our software on any host OS, as a Docker application, or as a compilation/installation script in a virtual environment. It contains software such as Timbl, ucto, Frog, colibri core and all the Python bindings.

Oersetter

Oersetter

by Maarten van Gompel http://oersetter.nl/

Oersetter is a Frisian-Dutch, Dutch-Frisian Machine Translation system developed in collaboration with the Fryske Akademy.

PyNLPl: Python Natural Language Processing Library

PyNLPl: Python Natural Language Processing Library

by Maarten van Gompel https://github.com/proycon/pynlpl/

PyNLPl, pronounced as "pineapple", is a Python (2 & 3) library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks.

T-scan: Tekst Complexiteits Analyse voor het Nederlands

T-scan: Tekst Complexiteits Analyse voor het Nederlands

by Maarten van Gompel , Ko van der Sloot , Rogier Kraf, Martijn van der Klis https://github.com/proycon/tscan/

T-scan is an analysis tool for dutch texts to assess the complexity of the text, and is based on original work by Rogier Kraf (Utrecht University) [See: Kraf et al., 2009]. The code has been reimplemented and extended by Ko van der Sloot (Tilburg University), and is currently maintained and continued by Martijn van der Klis (Utrecht University)

TiMBL: Tilburg Memory-Based Learner

TiMBL: Tilburg Memory-Based Learner

by Antal van den Bosch , Maarten van Gompel , Ko van der Sloot , Walter Daelemans, Jakub Zavrel https://languagemachines.github.io/timbl

TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases. For over fifteen years TiMBL has been mostly used in natural language processing as a machine learning classifier component, but its use extends to virtually any supervised machine learning domain. Due to its particular decision-tree-based implementation, TiMBL is in many cases far more efficient in classification than a standard k-nearest neighbor algorithm would be.

Ucto: Unicode Tokenizer

Ucto: Unicode Tokenizer

by Maarten van Gompel , Ko van der Sloot https://languagemachines.github.io/ucto

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.

Valkuil.net

Valkuil.net

by Antal van den Bosch , Maarten van Gompel http://valkuil.net/

Valkuil is a Dutch spelling correction system.

python-frog

python-frog

by Maarten van Gompel https://github.com/proycon/python-frog

This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging, lemmatisation, morphological analysis, named entity recognition, shallow parsing, and dependency parsing. The tool itself is implemented in C++

python-timbl

python-timbl

by Maarten van Gompel , Sander Canisius https://github.com/proycon/python-timbl

python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. With this module, all functionality exposed through the C++ interface is also available to Python scripts. Being able to access the API from Python greatly facilitates prototyping TiMBL-based applications.

python-ucto

python-ucto

by Maarten van Gompel https://github.com/proycon/python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++