CGN2.0

Persistent Identifierauto
http://hdl.handle.net/21.11114/COLL-0000-000B-CA99-5
Description0-1
The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available.
LandingPage1
http://tst-centrale.org/images/stories/producten/documentatie/cgn_website/doc_English/start.htm
Title(s)1-n
[1]: CGN 2.0,
[2]: Corpus Gesproken Nederlands,
[3]: The Spoken Dutch Corpus,
[4]: Corpus of Spoken Dutch
Owner(s)0-n
Nederlandse Taalunie
Genre(s)0-n
conversation , radio/TV-broadcast , interviews , prompted speech , speeches , other , fiction
Language disorder(s)0-n
none
Domain(s)0-n
Research and development with respect to contemporary spoken Dutch
Language(s)1-n
Dutch (Northern) [nld] , Flemish [nld]
CLARIN centre0-1
INT, formerly hosted at TST-centrale (INL)
Version0-1
2.0
Size(s)0-n
900000 words , 33 dvds , 115 GB , 800 HOURS
Relation(s)0-n
[CGN2.0] isNewVersionOf [CGN1.0]
Creator(s)0-n
Dr. N. Oostdijk-Ir. W. Goedertier (ELIS, Rijks Universiteit Gent-CLS, Radboud Universiteit (formerly known as Katho)
Project(s)0-n
The Spoken Dutch Corpus project site (Funder: the Flemish and Dutch governments and the Netherlands Organization for Scientific Research (NWO))
Resource(s)1-n
Description0-1
This part of the CGN is the core corpus containing a selection of one million words which has been annotated syntactically. For this core corpus a (manually verified) broad phonetic transcription has been produced, and also the alignment of the transcripts and the speech files has been verified at the word level. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions.
Dublin-Core Type1
Sound
subtype0-1
speech
Modality1-n
speech
Recording environment0-n
home/office , other
Recording condition0-n
various
Channel0-n
broadcasting , face-to-face , telephone , other
Social context0-n
family , private , public
Planning type0-n
semi-spontaneous , spontaneous , planned
Interactivity0-n
interactive , non-interactive , semi-interactive
Involvement0-n
not-observed , elicited , non-elicited
Audience0-n
large , small , medium
SC duration speech0-1
unknown
SC duration full0-1
80 hours
SC speakers0-1
unknown
SC sp. demogr0-1
various
Size0-n
900000 words
Annotation0-n
[1]: [phoneticTranscription] [manually-verified] [text/x-cgn-bpt+xml],
[2]: [orthographicTranscription] [manually-verified] [text/praat-textgrid],
[3]: [posTagging] [manually-verified] [text/x-cgn-tag+xml],
[4]: [soundToTextAlignment] [manually-verified] [text/x-cgn-bpt+xml],
[5]: [segmentation] [manually-verified] [text/x-cgn-bpt+xml],
[6]: [lemmatization] [mixed] [text/x-cgn-tag+xml],
[7]: [prosodicAnnotation] [mixed] [text/x-cgn-prx+xml],
[8]: [syntacticAnnotation] [manually-verified] [text/x-cgn-tig+xml]
Media0-n
audio/x-wav
Media0-n
audio/x-alaw
Description0-1
The CGN Lexicon comprises 14 columns, the first 4 of which (Id-Nummer Woordvorm, Orthografie Woordvorm, Woordsoort en Lemma) always contain information. In the column Gebruik only codes for regional and stylistic variants can be found, and in the columns Syntax, Uitspraak (4 subcolumns), Morfologie and Definitie codes occur that originate from (one of) the source lexicons (CELEX (Centrum voor Lexicale Informatie) 1 and RBN (Referentiebestand Nederlands)) 2, or that have been generated on the basis of the pronunciations in CELEX and FONILEX (Fonetisch Lexicon Vlaams). There is an updated and extended version avalaible as e-lex.
Dublin-Core Type1
Dataset
subtype0-1
lexicon
Modality1-n
written
Size0-n
229104 entries
Annotation0-n
[1]: [phoneticTranscription] [manual] [text/x-cgn-lxk+xml],
[2]: [orthographicTranscription] [manual] [text/x-cgn-lxk+xml],
[3]: [lemmatization] [manual] [text/x-cgn-lxk+xml],
[4]: [posTagging] [manual] [text/x-cgn-lxk+xml],
[5]: [morphology] [automatic] [text/x-cgn-lxk+xml]
Media0-n
text/xml
Description0-1
This resource is the frequency list of word types belonging to the Spoken Dutch Corpus (CGN). The list is available both as a ranked and an alphabetical list. Other frequency lists for the CGN are: pronunciation variants POS tags lemmas Separate lists are available for the Netherlands and Flanders.
Dublin-Core Type1
Dataset
subtype0-1
frequencylist-wordtypes
Modality1-n
written
Annotation0-n
[noAnnotation] [unknown] [other]
Media0-n
text/csv
Description0-1
This resource is the frequency list of lemmas belonging to the Spoken Dutch Corpus (CGN). Other frequency lists for the CGN are: word types POS tags pronunciation variants Separate lists are available for the Netherlands and Flanders.
Dublin-Core Type1
Dataset
subtype0-1
frequencylist-lemmas
Modality1-n
written
Annotation0-n
[noAnnotation] [unknown] []
Media0-n
text/csv
Description0-1
This resource is the frequency list of POS-tags belonging to the Spoken Dutch Corpus (CGN). Other frequency lists for the CGN are: word types lemmas pronunciation variants Separate lists are available for the Netherlands and Flanders.
Dublin-Core Type1
Dataset
subtype0-1
frequencylist-POStags
Modality1-n
written
Annotation0-n
[noAnnotation] [unknown] []
Media0-n
text/csv
Description0-1
This resource is the frequency list of pronunciation variants belonging to the Spoken Dutch Corpus (CGN). Other frequency lists for the CGN are: word types POS tags lemmas Separate lists are available for the Netherlands and Flanders.
Dublin-Core Type1
Dataset
subtype0-1
frequencylist-other
Modality1-n
written
Annotation0-n
[noAnnotation] [unknown] []
Media0-n
text/csv
Description0-1
This part is the full CGN Corpus. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. Further (automatically generated) word segmentations and (automatically generated) broad phonetic transcriptions are available.
Dublin-Core Type1
Sound
subtype0-1
speech
Modality1-n
speech
Recording environment0-n
home/office , other
Recording condition0-n
various
Channel0-n
broadcasting , face-to-face , telephone , other
Social context0-n
family , private , public
Planning type0-n
semi-spontaneous , spontaneous , planned
Interactivity0-n
interactive , non-interactive , semi-interactive
Involvement0-n
elicited , non-elicited , not-observed
Audience0-n
large , small , medium
SC duration speech0-1
unknown
SC duration full0-1
800 hours
SC speakers0-1
4235
SC sp. demogr0-1
various
Size0-n
9000000 words
Size0-n
800 hours
Size0-n
115 GB
Annotation0-n
[1]: [phoneticTranscription] [automatic] [text/x-cgn-bpt+xml],
[2]: [orthographicTranscription] [manual] [],
[3]: [posTagging] [manually-verified] [text/x-cgn-tag+xml],
[4]: [soundToTextAlignment] [automatic] [text/x-cgn-bpt+xml],
[5]: [segmentation] [automatic] [text/x-cgn-bpt+xml],
[6]: [lemmatization] [manually-verified] [text/x-cgn-tag+xml]
Media0-n
audio/x-wav
Media0-n
audio/x-alaw
Provenance(s)0-n
Temporal0-1
1991-2003
Country0-1
Belgium BE
Country0-1
Netherlands (the) NL
Linguality0-1
Type0-n
monolingual
Nativeness0-n
native
AgeGroup0-n
adult
Status0-n
normal
Variant0-n
standard , dialect
MultiType0-n
unknown
Accessibility0-1
Name1
CGN
Availability0-n
academic , restricted
License name(s)0-n
licenses for research and commercial use
Licence URL(s)0-n
http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail
Non-commercial usage0-1
no
Website(s)0-n
http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail
ISBN0-1
-
ISLRN0-1
-
Contact(s)0-n
Unknown: TST-centrale p/a Instituut v, (servicedesk@tst-centrale.org)
Medium(s)0-n
hard disk
Documentation0-1
Language(s)1-n
English [eng] , Dutch (Northern) [nld]
Type(s)0-n
website , manual
File(s)0-n
unknown
URL(s)0-n
http://lands.let.ru.nl/cgn/ehome.htm , http://lands.let.ru.nl/cgn/doc_English/topics/version_1.0/overview.htm
Validation0-1
Type0-1
formal/content
Method(s)0-n
semi-automatic , automatic , manual
 
Editing is disabled, since you are not signed in