Persistent Identifierauto

http://hdl.handle.net/21.11114/COLL-0000-000B-CABA-0

Description0-1

SoNaR is a 500-million-word reference c…

SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. (SOURCE: CLARIN). The SoNaR Corpus consists of three parts: - SONAR 500: 500 M words automatically tokenized, lemmatized and POS-tagged - SONAR 1: 1 M words with handchecked semantic annotations: named entities, co-reference relations, semantic roles and spatio-temporal relations - SONAR New Media Corpus: Subpart of SoNaR 500 containing the material from new media only (tweets, chats, SMS)

LandingPage1

http://hdl.handle.net/11372/LRT-1498

Title(s)1-n

[1]: SoNaR,

[2]: SoNaR Reference Corpus

Owner(s)0-n

Nederlandse Taalunie

Genre(s)0-n

newspaper-article , other , fiction , social-media-texts , non-academic-non-fiction , academic-nonfiction

Language disorder(s)0-n

none

Domain(s)0-n

A large reference corpus of written Dutch is invaluable for linguistic research and the development of profitable services that require advanced language technology (SOURCE: N. Oostdijk et al.)

Language(s)1-n

Flemish [nld] , Dutch (Northern) [nld]

CLARIN centre0-1

HLT Centre (INL)

Version0-1

1.2.1

Size(s)0-n

500000000 words , 60 GB

Creator(s)0-n

[1]: Dr. N. Oostdijk (CLST, Radboud University Nijmegen),

[2]: Dr. N. Oostdijk (Tilburg University (ILK)-Utrecht University (UiL-OTS)-Hogeschool Gent-Katholieke Universiteit Leuven (CCL)-Twente University)

Project(s)0-n

SoNaR-corpus site (Funder: NTU STEVIN)

Resource(s)1-n

Resource 1

Description0-1

SoNaR500 is a 500-million-word referenc…

SoNaR500 is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. SoNaR500 was automatically tokenized, lemmatized and POS-tagged (except for the new media part).

Dublin-Core Type1

Text

subtype0-1

Modality1-n

written

WC authors0-1

WC auth. demogr0-1

Size0-n

500000000 words

Annotation0-n

[1]: [syntacticAnnotation] [mixed] [text/xml],

[2]: [lemmatization] [mixed] [text/xml],

[3]: [posTagging] [mixed] [text/xml]

Media0-n

text/xml

Media0-n

text/folia

Resource 2

Description0-1

SoNaR-1 is a dataset comprising one mil…

SoNaR-1 is a dataset comprising one million words. Although largely a subset of SoNaR-500, SoNaR-1 includes far fewer text types. With SoNaR- 1 different types of semantic annotation have been provided, viz. named entity labelling, annotation of co-reference relations, semantic role labelling and annotation of spatial and temporal relations. All annotations have been manually verified

Dublin-Core Type1

Text

subtype0-1

Modality1-n

written

WC authors0-1

WC auth. demogr0-1

Size0-n

1000000 words

Annotation0-n

[1]: [semanticAnnotation-coreference] [manual] [text/html],

[2]: [semanticAnnotation-namedEntities] [manual] [text/xml],

[3]: [semanticAnnotation-roles] [manual] [text/xml],

[4]: [semanticAnnotation-relations] [manual] [other]

Media0-n

text/folia

Resource 3

Description0-1

SoNaR New Media texts (tweets, chats a…

SoNaR New Media texts (tweets, chats and sms's) were also collected in the STEVIN-project SoNaR but they are not part of the SoNaR-corpus 1.0. and can be obtained separately as the SoNar New Media Corpus.

Dublin-Core Type1

Text

subtype0-1

Modality1-n

written

WC authors0-1

WC auth. demogr0-1

Annotation0-n

[noAnnotation] [unknown] [other]

Media0-n

text/folia

Provenance(s)0-n

Provenance 1

Temporal0-1

1954-2002

Country0-1

Belgium BE

Country0-1

Netherlands (the) NL

Linguality0-1

Linguality

Type0-n

monolingual

Nativeness0-n

native

AgeGroup0-n

unknown

Status0-n

normal

Variant0-n

standard , dialect

MultiType0-n

unknown

Accessibility0-1

Accessibility

Name1

SoNaR

Availability0-n

academic , restricted

License name(s)0-n

License via HLT-centre, Dutch Language Union

Licence URL(s)0-n

http://tst-centrale.org/nl/tst-materialen/corpora/sonar-corpus-detail

Non-commercial usage0-1

yes

Website(s)0-n

http://lands.let.ru.nl/projects/SoNaR/intro.html

ISBN0-1

ISLRN0-1

Contact(s)0-n

Dr. Nelleke Oostdijk: Centre for Language and Speech, (n.oostdijk@let.ru.nl)

Medium(s)0-n

internet

Documentation0-1

Documentation

Language(s)1-n

English [eng]

Type(s)0-n

website , manual , other

File(s)0-n

SoNaR User Documentation

URL(s)0-n

http://ticclops.uvt.nl/SoNaR_end-user_documentation_v.1.0.4.pdf

Validation0-1

Validation

Type0-1

formal/content

Method(s)0-n

automatic , manual