Persistent Identifierauto

http://hdl.handle.net/21.11114/COLL-0000-000B-CA99-5

Description0-1

The Spoken Dutch Corpus project was aim…

The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available.

LandingPage1

http://tst-centrale.org/images/stories/producten/documentatie/cgn_website/doc_English/start.htm

Title(s)1-n

[1]: CGN 2.0,

[2]: Corpus Gesproken Nederlands,

[3]: The Spoken Dutch Corpus,

[4]: Corpus of Spoken Dutch

Owner(s)0-n

Nederlandse Taalunie

Genre(s)0-n

conversation , radio/TV-broadcast , interviews , prompted speech , speeches , other , fiction

Language disorder(s)0-n

none

Domain(s)0-n

Research and development with respect to contemporary spoken Dutch

Language(s)1-n

Dutch (Northern) [nld] , Flemish [nld]

CLARIN centre0-1

INT, formerly hosted at TST-centrale (INL)

Version0-1

2.0

Size(s)0-n

900000 words , 33 dvds , 115 GB , 800 HOURS

Relation(s)0-n

[CGN2.0] isNewVersionOf [CGN1.0]

Creator(s)0-n

Dr. N. Oostdijk-Ir. W. Goedertier (ELIS, Rijks Universiteit Gent-CLS, Radboud Universiteit (formerly known as Katho)

Project(s)0-n

The Spoken Dutch Corpus project site (Funder: the Flemish and Dutch governments and the Netherlands Organization for Scientific Research (NWO))

Resource(s)1-n

Resource 1

Description0-1

This part of the CGN is the core corpus…

This part of the CGN is the core corpus containing a selection of one million words which has been annotated syntactically. For this core corpus a (manually verified) broad phonetic transcription has been produced, and also the alignment of the transcripts and the speech files has been verified at the word level. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions.

Dublin-Core Type1

Sound

subtype0-1

speech

Modality1-n

speech

Recording environment0-n

home/office , other

Recording condition0-n

various

Channel0-n

broadcasting , face-to-face , telephone , other

Social context0-n

family , private , public

Planning type0-n

semi-spontaneous , spontaneous , planned

Interactivity0-n

interactive , non-interactive , semi-interactive

Involvement0-n

not-observed , elicited , non-elicited

Audience0-n

large , small , medium

SC duration speech0-1

unknown

SC duration full0-1

80 hours

SC speakers0-1

unknown

SC sp. demogr0-1

various

Size0-n

900000 words

Annotation0-n

[1]: [phoneticTranscription] [manually-verified] [text/x-cgn-bpt+xml],

[2]: [orthographicTranscription] [manually-verified] [text/praat-textgrid],

[3]: [posTagging] [manually-verified] [text/x-cgn-tag+xml],

[4]: [soundToTextAlignment] [manually-verified] [text/x-cgn-bpt+xml],

[5]: [segmentation] [manually-verified] [text/x-cgn-bpt+xml],

[6]: [lemmatization] [mixed] [text/x-cgn-tag+xml],

[7]: [prosodicAnnotation] [mixed] [text/x-cgn-prx+xml],

[8]: [syntacticAnnotation] [manually-verified] [text/x-cgn-tig+xml]

Media0-n

audio/x-wav

Media0-n

audio/x-alaw

Resource 2

Description0-1

The CGN Lexicon comprises 14 columns, t…

The CGN Lexicon comprises 14 columns, the first 4 of which (Id-Nummer Woordvorm, Orthografie Woordvorm, Woordsoort en Lemma) always contain information. In the column Gebruik only codes for regional and stylistic variants can be found, and in the columns Syntax, Uitspraak (4 subcolumns), Morfologie and Definitie codes occur that originate from (one of) the source lexicons (CELEX (Centrum voor Lexicale Informatie) 1 and RBN (Referentiebestand Nederlands)) 2, or that have been generated on the basis of the pronunciations in CELEX and FONILEX (Fonetisch Lexicon Vlaams). There is an updated and extended version avalaible as e-lex.

Dublin-Core Type1

Dataset

subtype0-1

lexicon

Modality1-n

written

Size0-n

229104 entries

Annotation0-n

[1]: [phoneticTranscription] [manual] [text/x-cgn-lxk+xml],

[2]: [orthographicTranscription] [manual] [text/x-cgn-lxk+xml],

[3]: [lemmatization] [manual] [text/x-cgn-lxk+xml],

[4]: [posTagging] [manual] [text/x-cgn-lxk+xml],

[5]: [morphology] [automatic] [text/x-cgn-lxk+xml]

Media0-n

text/xml

Resource 3

Description0-1

This resource is the frequency list of …

This resource is the frequency list of word types belonging to the Spoken Dutch Corpus (CGN). The list is available both as a ranked and an alphabetical list. Other frequency lists for the CGN are: pronunciation variants POS tags lemmas Separate lists are available for the Netherlands and Flanders.

Dublin-Core Type1

Dataset

subtype0-1

frequencylist-wordtypes

Modality1-n

written

Annotation0-n

[noAnnotation] [unknown] [other]

Media0-n

text/csv

Resource 4

Description0-1

This resource is the frequency list of …

This resource is the frequency list of lemmas belonging to the Spoken Dutch Corpus (CGN). Other frequency lists for the CGN are: word types POS tags pronunciation variants Separate lists are available for the Netherlands and Flanders.

Dublin-Core Type1

Dataset

subtype0-1

frequencylist-lemmas

Modality1-n

written

Annotation0-n

[noAnnotation] [unknown] []

Media0-n

text/csv

Resource 5

Description0-1

This resource is the frequency list of …

This resource is the frequency list of POS-tags belonging to the Spoken Dutch Corpus (CGN). Other frequency lists for the CGN are: word types lemmas pronunciation variants Separate lists are available for the Netherlands and Flanders.

Dublin-Core Type1

Dataset

subtype0-1

frequencylist-POStags

Modality1-n

written

Annotation0-n

[noAnnotation] [unknown] []

Media0-n

text/csv

Resource 6

Description0-1

This resource is the frequency list of …

This resource is the frequency list of pronunciation variants belonging to the Spoken Dutch Corpus (CGN). Other frequency lists for the CGN are: word types POS tags lemmas Separate lists are available for the Netherlands and Flanders.

Dublin-Core Type1

Dataset

subtype0-1

frequencylist-other

Modality1-n

written

Annotation0-n

[noAnnotation] [unknown] []

Media0-n

text/csv

Resource 7

Description0-1

This part is the full CGN Corpus. The …

This part is the full CGN Corpus. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. Further (automatically generated) word segmentations and (automatically generated) broad phonetic transcriptions are available.

Dublin-Core Type1

Sound

subtype0-1

speech

Modality1-n

speech

Recording environment0-n

home/office , other

Recording condition0-n

various

Channel0-n

broadcasting , face-to-face , telephone , other

Social context0-n

family , private , public

Planning type0-n

semi-spontaneous , spontaneous , planned

Interactivity0-n

interactive , non-interactive , semi-interactive

Involvement0-n

elicited , non-elicited , not-observed

Audience0-n

large , small , medium

SC duration speech0-1

unknown

SC duration full0-1

800 hours

SC speakers0-1

4235

SC sp. demogr0-1

various

Size0-n

9000000 words

Size0-n

800 hours

Size0-n

115 GB

Annotation0-n

[1]: [phoneticTranscription] [automatic] [text/x-cgn-bpt+xml],

[2]: [orthographicTranscription] [manual] [],

[3]: [posTagging] [manually-verified] [text/x-cgn-tag+xml],

[4]: [soundToTextAlignment] [automatic] [text/x-cgn-bpt+xml],

[5]: [segmentation] [automatic] [text/x-cgn-bpt+xml],

[6]: [lemmatization] [manually-verified] [text/x-cgn-tag+xml]

Media0-n

audio/x-wav

Media0-n

audio/x-alaw

Provenance(s)0-n

Provenance 1

Temporal0-1

1991-2003

Country0-1

Belgium BE

Country0-1

Netherlands (the) NL

Linguality0-1

Linguality

Type0-n

monolingual

Nativeness0-n

native

AgeGroup0-n

adult

Status0-n

normal

Variant0-n

standard , dialect

MultiType0-n

unknown

Accessibility0-1

Accessibility

Name1

CGN

Availability0-n

academic , restricted

License name(s)0-n

licenses for research and commercial use

Licence URL(s)0-n

http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail

Non-commercial usage0-1

Website(s)0-n

http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail

ISBN0-1

ISLRN0-1

Contact(s)0-n

Unknown: TST-centrale p/a Instituut v, (servicedesk@tst-centrale.org)

Medium(s)0-n

hard disk

Documentation0-1

Documentation

Language(s)1-n

English [eng] , Dutch (Northern) [nld]

Type(s)0-n

website , manual

File(s)0-n

unknown

URL(s)0-n

http://lands.let.ru.nl/cgn/ehome.htm , http://lands.let.ru.nl/cgn/doc_English/topics/version_1.0/overview.htm

Validation0-1

Validation

Type0-1

formal/content

Method(s)0-n

semi-automatic , automatic , manual