SoNaR
Persistent Identifierauto
http://hdl.handle.net/21.11114/COLL-0000-000B-CABA-0
Description0-1
SoNaR is a 500-million-word reference c…
SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. (SOURCE: CLARIN).
The SoNaR Corpus consists of three parts:
- SONAR 500: 500 M words automatically tokenized, lemmatized and POS-tagged
- SONAR 1: 1 M words with handchecked semantic annotations: named entities, co-reference relations, semantic roles and spatio-temporal relations
- SONAR New Media Corpus: Subpart of SoNaR 500 containing the material from new media only (tweets, chats, SMS)
LandingPage1
http://hdl.handle.net/11372/LRT-1498
Title(s)1-n
[1]:
SoNaR,
[2]:
SoNaR Reference Corpus
Genre(s)0-n
newspaper-article
,
other
,
fiction
,
social-media-texts
,
non-academic-non-fiction
,
academic-nonfiction
Domain(s)0-n
A large reference corpus of written Dutch is invaluable for linguistic research and the development of profitable services that require advanced language technology (SOURCE: N. Oostdijk et al.)
Language(s)1-n
Flemish [nld]
,
Dutch (Northern) [nld]
CLARIN centre0-1
HLT Centre (INL)
Creator(s)0-n
[1]:
Dr. N. Oostdijk (CLST, Radboud University Nijmegen),
[2]:
Dr. N. Oostdijk (Tilburg University (ILK)-Utrecht University (UiL-OTS)-Hogeschool Gent-Katholieke Universiteit Leuven (CCL)-Twente University)
Project(s)0-n
SoNaR-corpus site (Funder: NTU STEVIN)
Resource(s)1-n
Resource 1
Description0-1
SoNaR500 is a 500-million-word referenc…
SoNaR500 is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications.
SoNaR500 was automatically tokenized, lemmatized and POS-tagged (except for the new media part).
Annotation0-n
[1]:
[syntacticAnnotation] [mixed] [text/xml],
[2]:
[lemmatization] [mixed] [text/xml],
[3]:
[posTagging] [mixed] [text/xml]
Resource 2
Description0-1
SoNaR-1 is a dataset comprising one mil…
SoNaR-1 is a dataset comprising one million words. Although largely a
subset of SoNaR-500, SoNaR-1 includes far fewer text types. With SoNaR-
1 different types of semantic annotation have been provided, viz. named
entity labelling, annotation of co-reference relations, semantic role labelling
and annotation of spatial and temporal relations. All annotations have been
manually verified
Annotation0-n
[1]:
[semanticAnnotation-coreference] [manual] [text/html],
[2]:
[semanticAnnotation-namedEntities] [manual] [text/xml],
[3]:
[semanticAnnotation-roles] [manual] [text/xml],
[4]:
[semanticAnnotation-relations] [manual] [other]
Resource 3
Description0-1
SoNaR New Media texts (tweets, chats a…
SoNaR New Media texts (tweets, chats and sms's) were also collected in the STEVIN-project SoNaR but they are not part of the SoNaR-corpus 1.0. and can be obtained separately as the SoNar New Media Corpus.
Annotation0-n
[noAnnotation] [unknown] [other]
Provenance(s)0-n
Provenance 1
Country0-1
Netherlands (the) NL
Accessibility0-1
Accessibility
License name(s)0-n
License via HLT-centre, Dutch Language Union
Licence URL(s)0-n
http://tst-centrale.org/nl/tst-materialen/corpora/sonar-corpus-detail
Non-commercial usage0-1
yes
Website(s)0-n
http://lands.let.ru.nl/projects/SoNaR/intro.html
Contact(s)0-n
Dr. Nelleke Oostdijk: Centre for Language and Speech, (n.oostdijk@let.ru.nl)
Documentation0-1
Documentation
URL(s)0-n
http://ticclops.uvt.nl/SoNaR_end-user_documentation_v.1.0.4.pdf
Editing is disabled, since you are not signed in