SoNaR

Persistent Identifierauto
http://hdl.handle.net/21.11114/COLL-0000-000B-CABA-0
Description0-1
SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. (SOURCE: CLARIN). The SoNaR Corpus consists of three parts: - SONAR 500: 500 M words automatically tokenized, lemmatized and POS-tagged - SONAR 1: 1 M words with handchecked semantic annotations: named entities, co-reference relations, semantic roles and spatio-temporal relations - SONAR New Media Corpus: Subpart of SoNaR 500 containing the material from new media only (tweets, chats, SMS)
LandingPage1
http://hdl.handle.net/11372/LRT-1498
Title(s)1-n
[1]: SoNaR,
[2]: SoNaR Reference Corpus
Owner(s)0-n
Nederlandse Taalunie
Genre(s)0-n
newspaper-article , other , fiction , social-media-texts , non-academic-non-fiction , academic-nonfiction
Language disorder(s)0-n
none
Domain(s)0-n
A large reference corpus of written Dutch is invaluable for linguistic research and the development of profitable services that require advanced language technology (SOURCE: N. Oostdijk et al.)
Language(s)1-n
Flemish [nld] , Dutch (Northern) [nld]
CLARIN centre0-1
HLT Centre (INL)
Version0-1
1.2.1
Size(s)0-n
500000000 words , 60 GB
Creator(s)0-n
[1]: Dr. N. Oostdijk (CLST, Radboud University Nijmegen),
[2]: Dr. N. Oostdijk (Tilburg University (ILK)-Utrecht University (UiL-OTS)-Hogeschool Gent-Katholieke Universiteit Leuven (CCL)-Twente University)
Project(s)0-n
SoNaR-corpus site (Funder: NTU STEVIN)
Resource(s)1-n
Description0-1
SoNaR500 is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. SoNaR500 was automatically tokenized, lemmatized and POS-tagged (except for the new media part).
Dublin-Core Type1
Text
subtype0-1
-
Modality1-n
written
WC authors0-1
0
WC auth. demogr0-1
-
Size0-n
500000000 words
Annotation0-n
[1]: [syntacticAnnotation] [mixed] [text/xml],
[2]: [lemmatization] [mixed] [text/xml],
[3]: [posTagging] [mixed] [text/xml]
Media0-n
text/xml
Media0-n
text/folia
Description0-1
SoNaR-1 is a dataset comprising one million words. Although largely a subset of SoNaR-500, SoNaR-1 includes far fewer text types. With SoNaR- 1 different types of semantic annotation have been provided, viz. named entity labelling, annotation of co-reference relations, semantic role labelling and annotation of spatial and temporal relations. All annotations have been manually verified
Dublin-Core Type1
Text
subtype0-1
-
Modality1-n
written
WC authors0-1
0
WC auth. demogr0-1
-
Size0-n
1000000 words
Annotation0-n
[1]: [semanticAnnotation-coreference] [manual] [text/html],
[2]: [semanticAnnotation-namedEntities] [manual] [text/xml],
[3]: [semanticAnnotation-roles] [manual] [text/xml],
[4]: [semanticAnnotation-relations] [manual] [other]
Media0-n
text/folia
Description0-1
SoNaR New Media texts (tweets, chats and sms's) were also collected in the STEVIN-project SoNaR but they are not part of the SoNaR-corpus 1.0. and can be obtained separately as the SoNar New Media Corpus.
Dublin-Core Type1
Text
subtype0-1
-
Modality1-n
written
WC authors0-1
0
WC auth. demogr0-1
-
Annotation0-n
[noAnnotation] [unknown] [other]
Media0-n
text/folia
Provenance(s)0-n
Temporal0-1
1954-2002
Country0-1
Belgium BE
Country0-1
Netherlands (the) NL
Linguality0-1
Type0-n
monolingual
Nativeness0-n
native
AgeGroup0-n
unknown
Status0-n
normal
Variant0-n
standard , dialect
MultiType0-n
unknown
Accessibility0-1
Name1
SoNaR
Availability0-n
academic , restricted
License name(s)0-n
License via HLT-centre, Dutch Language Union
Licence URL(s)0-n
http://tst-centrale.org/nl/tst-materialen/corpora/sonar-corpus-detail
Non-commercial usage0-1
yes
Website(s)0-n
http://lands.let.ru.nl/projects/SoNaR/intro.html
ISBN0-1
-
ISLRN0-1
-
Contact(s)0-n
Dr. Nelleke Oostdijk: Centre for Language and Speech, (n.oostdijk@let.ru.nl)
Medium(s)0-n
internet
Documentation0-1
Language(s)1-n
English [eng]
Type(s)0-n
website , manual , other
File(s)0-n
SoNaR User Documentation
URL(s)0-n
http://ticclops.uvt.nl/SoNaR_end-user_documentation_v.1.0.4.pdf
Validation0-1
Type0-1
formal/content
Method(s)0-n
automatic , manual
 
Editing is disabled, since you are not signed in