CGN1.0

Persistent Identifierauto
http://hdl.handle.net/21.11114/COLL-0000-000B-CA98-6
Description0-1
The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available. These are howver not available with this version 1.0 at MPI.
LandingPage1
https://hdl.handle.net/1839/00-0000-0000-0001-53A5-2@view
Title(s)1-n
[1]: CGN 1.0,
[2]: Corpus Gesproken Nederlands,
[3]: The Spoken Dutch Corpus,
[4]: Corpus of Spoken Dutch
Owner(s)0-n
Nederlandse Taalunie
Genre(s)0-n
conversation , radio/TV-broadcast , interviews , prompted speech , speeches , other , fiction
Language disorder(s)0-n
none
Domain(s)0-n
Research and development with respect to contemporary spoken Dutch
Language(s)1-n
Dutch (Northern) [nld] , Flemish [nld]
CLARIN centre0-1
MPI, TLA
Persistent identifier(s)0-n
https://hdl.handle.net/1839/00-0000-0000-0001-53A5-2
Version0-1
1.0
Size(s)0-n
900000 words , 33 dvds , 115 GB , 800 HOURS
Relation(s)0-n
[CGN1.0] isPreviousVersionOf [CGN2.0]
Creator(s)0-n
Dr. N. Oostdijk-Ir. W. Goedertier (ELIS, Rijks Universiteit Gent-CLS, Radboud Universiteit (formerly known as Katho)
Project(s)0-n
The Spoken Dutch Corpus project site (Funder: the Flemish and Dutch governments and the Netherlands Organization for Scientific Research (NWO))
Resource(s)1-n
Description0-1
This part of the CGN is the core corpus containing a selection of one million words which has been annotated syntactically. For this core corpus a (manually verified) broad phonetic transcription has been produced, and also the alignment of the transcripts and the speech files has been verified at the word level. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions.
Dublin-Core Type1
Sound
subtype0-1
speech
Modality1-n
speech
Recording environment0-n
home/office , other
Channel0-n
broadcasting , face-to-face , telephone , other
Social context0-n
family , private , public
Planning type0-n
semi-spontaneous , spontaneous , planned
Interactivity0-n
interactive , non-interactive , semi-interactive
Involvement0-n
not-observed , elicited , non-elicited
Audience0-n
large , small , medium
SC duration speech0-1
unknown
SC duration full0-1
80 hours
SC speakers0-1
unknown
SC sp. demogr0-1
various
Size0-n
900000 words
Annotation0-n
[1]: [phoneticTranscription] [manually-verified] [text/x-cgn-bpt+xml-text/x-eaf+xml],
[2]: [orthographicTranscription] [manually-verified] [text/x-eaf+xml],
[3]: [posTagging] [manually-verified] [text/x-cgn-tag+xml-text/x-eaf+xml],
[4]: [soundToTextAlignment] [manually-verified] [text/x-cgn-bpt+xml-text/x-eaf+xml],
[5]: [segmentation] [manually-verified] [text/x-cgn-bpt+xml-text/x-eaf+xml],
[6]: [lemmatization] [mixed] [text/x-cgn-tag+xml-text/x-eaf+xml],
[7]: [prosodicAnnotation] [mixed] [text/x-cgn-prx+xml],
[8]: [syntacticAnnotation] [manually-verified] [text/x-cgn-tig+xml]
Media0-n
audio/x-wav, audio/x-mpeg4
Media0-n
audio/x-alaw, audio/x-mpeg4
Description0-1
This part is the full CGN Corpus. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. Further (automatically generated) word segmentations and (automatically generated) broad phonetic transcriptions are available.
Dublin-Core Type1
Sound
subtype0-1
speech
Modality1-n
speech
Recording environment0-n
home/office , other
Channel0-n
broadcasting , face-to-face , telephone , other
Social context0-n
family , private , public
Planning type0-n
semi-spontaneous , spontaneous , planned
Interactivity0-n
interactive , non-interactive , semi-interactive
Involvement0-n
elicited , non-elicited , not-observed
Audience0-n
large , small , medium
SC duration speech0-1
unknown
SC duration full0-1
800 hours
SC speakers0-1
4235
SC sp. demogr0-1
various
Size0-n
9000000 words
Size0-n
800 hours
Size0-n
115 GB
Annotation0-n
[1]: [phoneticTranscription] [automatic] [text/x-cgn-bpt+xml-text/x-eaf+xml],
[2]: [orthographicTranscription] [manual] [text/x-eaf+xml],
[3]: [posTagging] [manually-verified] [text/x-cgn-tag+xml-text/x-eaf+xml],
[4]: [soundToTextAlignment] [automatic] [text/x-cgn-bpt+xml-text/x-eaf+xml],
[5]: [segmentation] [automatic] [text/x-cgn-bpt+xml-text/x-eaf+xml],
[6]: [lemmatization] [manually-verified] [text/x-cgn-tag+xml-text/x-eaf+xml]
Media0-n
audio/x-wav, audio/x-mpeg4
Media0-n
audio/x-alaw, audio/x-mpeg4
Provenance(s)0-n
Temporal0-1
1991-2003
Country0-1
Belgium BE
Country0-1
Netherlands (the) NL
Linguality0-1
Type0-n
monolingual
Nativeness0-n
native
AgeGroup0-n
adult
Status0-n
normal
Variant0-n
standard , dialect
MultiType0-n
unknown
Accessibility0-1
Name1
CGN
Availability0-n
academic , restricted
License name(s)0-n
licenses for research and commercial use
Licence URL(s)0-n
http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail
Non-commercial usage0-1
no
Website(s)0-n
http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail
ISBN0-1
-
ISLRN0-1
-
Contact(s)0-n
Unknown: TST-centrale p/a Instituut v, (servicedesk@tst-centrale.org)
Medium(s)0-n
hard disk
Documentation0-1
Language(s)1-n
English [eng] , Dutch (Northern) [nld]
Type(s)0-n
website , manual
File(s)0-n
unknown
URL(s)0-n
http://lands.let.ru.nl/cgn/ehome.htm , http://lands.let.ru.nl/cgn/doc_English/topics/version_1.0/overview.htm
Validation0-1
Type0-1
formal/content
Method(s)0-n
semi-automatic , automatic , manual
 
Editing is disabled, since you are not signed in