Since you are not logged-in, you are only able to view the collections.

Available collections

id
Description
Status
36 Aarssen-Bos This database contains 1021 transcripts collected in the Netherlands, Turkey, and Morocco by Jeroen Aarssen and Petra Bos, at Tilburg University. Bilingual data (either Turkish-Dutch or Moroccan Arabic-Dutch) were collected within the framework of a longitudinal study into development of bilingualism among Turkish and Moroccan children in the Netherlands.
17 ACADEMIA The Netherlands Institute for Sound and Vision (NISV) Academia collection contains audiovisual sources that can be used in higher education in the Netherlands. This unique selection from the NISV's archives has been made available and has been described by means of metadata specifically for education and science. These data are accessible only to education and research organisations that have an Academia License.
14 ADHD and SLI corpus UvA Video recordings of 67 children to compare the language and executive functioning profiles of children with ADHD and children with SLI and TD children: 26 Dutch children with ADHD 19 Dutch children with SLI 22 children Dutch controls
35 Asymmetries Corpus The Asymmetries Project collection contains Dutch language productions gathered in Groningen and neighboring towns in the northern Netherlands, between 2007 and 2012. The research was carried out by members of the NWO/Vici project “Asymmetries in Grammar” at the University of Groningen. This project investigates asymmetries between production and comprehension in unimpaired children, in young and elderly adults, and in autistic and ADHD children and adolescents. It is funded by a grant from the Netherlands Organization for Scientific Research (NWO) awarded to Petra Hendriks (grant no. 277-70-005). All participants are native Dutch speakers. The participants in the CK sub-corpus have no history of language problems. The CK sub-corpus includes 31 typically developing children (4;3-6;5, mean 5;6), 20 young adults (18-35, mean 26;2), and 20 elderly adults (69-87, mean 78;8). The groups are balanced for sex.
13 Bilingual deaf children RU Video recordings of 11 deaf children, for a longitudinal investigation of the bilingual language and communication development of young deaf children in Sign Language of the Netherlands (SLN) and Dutch (D).
37 Bol-Kuiken This corpus includes data from the SLI children of the GRAMAT research, carried out by Gerard Bol and Folkert Kuiken between 1984 and 1988. This research was supported by a grant from the Praeventiefonds at The Hague in the Netherlands (Nr. 28-798). The 31 Dutch normally developing children (15 male and 16 female) in this research ranged in age from 1;07.11 to 3;07.5. Sixteen children have been recorded at two moments in time so that the total number of recordings is 47. The 20 Dutch children with Down’s syndrome (10 male and 10 female) in this research ranged in age from 4;04.21 to 18;11.5. Their intelligence level determined with a Dutch adaptation of the Merrill-Palmer Preschool Performance Tests, differed from 20 to 56. Their mental age was at least 3;6 years. The 20 Dutch SLI children (5 female and 15 male) in this research ranged in age from 4;01.16 to 8;01.17. The children all lacked sufficient intellectual or physiological impairment to account for their difficulties in language production. The speech of the 20 SLI children was audiotaped at their school, while they were playing with their speech therapist in a free-play situation. One of the two investigators was present in the room of the speech therapist. From time to time the investigator participated in the conversation. From each child 100 analyzable utterances were transcribed by the investigator that had been present at the recording. After that the utterances were analyzed according to the GRAMAT framework. GRAMAT (Grammatical Analysis of Developmental Language Disorders) is a Dutch adaptation of the descriptive morphosyntacticframework used by Crystal, Fletcher and Garman (1976). The 20 Dutch hearing impaired children (11 male and 9 female) in this research ranged in age from 3;11.14 to 9;00.15. They all suffered from sensorineural or mixed hearing loss. Their hearing impairment had been diagnosed before they were 1;6 years of age. Their hearing loss differed from 40 to 85 dB pure tone average on the better ear. The cildren were of normal intelligence. The speech of the 20 hearing impaired children was audiotaped at their school, while they were playing with their speech therapist in a free-play situation. One of the two investigators was present in the room of the speech therapist. From time to time the investigator participated in the conversation. From each child 100 analyzable utterances were transcribed by the investigator that had been present at the recording.
50 CGN1.0 The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available. These are howver not available with this version 1.0 at MPI.
1 CGN2.0 The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available.
38 CLPF Corpus Audio recordings of twelve Dutch children in the stage of early words. Longitudinal study. Orthographic and phonetic transcriptions in CHAT.
62 CoDoSiS This data collection comprises a set of test files developed in the CoDoSiS project of the CLARIAH program. The project ‘Combining Data on slavery in Surinam’ (CoDoSiS) aimed to develop a strategy to convert existing datasets on Surinam slavery into Linked Data by using the CLARIAH wp4-tool Cows and to combine them into one database network with relevant connections using the CLARIN-tool TICCL. This collection contains the test files that were used in the project. The test files are subsets of the complete datasets of the complete registers focussed around the year 1846 and were used to test the two issues defined above in the CoDoSiS project. Included in the collection are: - Slavenregisters_Picturae_plantages_1846 (TICCL and LOD) - Slavenregisters_Everaert_1846 (TICCL and LOD) - Monsterrol_Catharina_Sophia_1846 (TICCL and LOD) - Slavenregisters_Picturae_eigenaren_1846 (TICCL) - Wijkregisters_Paramaribo_1846 (TICCL) - Surinaamsche_almanak_plantages_1846 (LOD) - Surinaamsche_almanak_personen_1846 (TICCL)
4 D-LUCEA The LUCEA corpus (Longitudinal University College utrecht Corpus of English Accents) was collected to study this type of phonetic convergence in a multilingual environment. Students and teachers at University College Utrecht (UCU) come from various countries and native languages, yet they all use English as the lingua franca on campus. Hence, phonetic convergence may result in a unique international version of English, influenced by the speakers’ native languages and accents. The corpus now contains data from about 850 interviews from 282 unique students. Each interview contains about 20 minutes of speech. The speech corpus is augmented with participants’ responses from entry and exit questionnaires, and supplementary data about the participants and about each recording. When finished in 2016, the total corpus will contain about 3 TB (about 3000 GB) of audio data.
18 dbd The Dutch Bilingual Database (DBD) is a rather substantial collection of data (over 1,500 sessions) from a number of projects and research programmes that were directed at investigating multilingualism and comprises data originating from Dutch, Sranan, Sarnami, Papiamentu, Moroccan(-Arabic), Berber and Turkish speakers.
15 Deaf adults RU database Recordings of the writing process to investigate the acquisition of Dutch by deaf Dutch adults (late L1/early L2) and comparison to hearing Turkish and Moroccan-Arabic L2-learners of Dutch (late L2) on morphosyntactic aspects. Participants: 46 deaf Dutch adults 38 hearing Turkish adults 24 hearing Moroccan adults 10 Dutch controls
19 Degand 2.2 The Degand 2.2 subcorpus was compiled for a study of the causal connectives 'aangezien', 'want' and 'omdat' in Dutch news. It consists of 143 cases from a Dutch newspaper (NRC Handelsblad from 1994). The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
39 DeHouwer Corpus This corpus of Dutch child language and child-directed speech was collected in Antwerp, Belgium. The corpus consists of 15 recordings transcribed orthographically and phonetically. Some transcripts also contain variety codes, speaker codes, addressee codes and utterance numbers (see further below). Participants are four children between the ages of ca. 4;9 and 5;0 (two boys Dieter and Michiel, and two girls Kim and Katrien) and their families, with some other persons on occasion present as well. The families are lower-middle to middle-middle class. All children are addressed in some form of Dutch common around the city of Antwerp and go to school fulltime (second year of nursery school). They are being raised monolingually. The interactions are mostly free and spontaneous, but include some structured interactions as well, in which the mother or father had a conversation with the 4-year-old about the past day at school, or prompted the child to describe a picture and tell a picture book story.
10 DIDDD The data in DiDDD (Diversity in Dutch DP Design; http://www.meertens.knaw.nl/diddd/ (link is external) were collected between 2005 and 2009 with oral and written interviews in about 200 locations in the Dutch language area, with a methodology highly parallel to DynaSAND. The data involve translations of and judgements on test sentences. The DIDDD data cover the morphosyntactic variation within nominal groups, in particular possessives, partitives, noun ellipsis, the demonstrative system, the numeral modification system, what-for constructions, quantitative er, adjectival inflection, negation and exclamatives.
23 DuELME DUELME LMF is an electronic lexicon that contains more than 5,000 Dutch multiword expressions (MWEs). MWEs with the same syntactic pattern are grouped in the same equivalence class. The DUELME LMF lexicon is suitable for theoretical research on multiword expressions as for use in NLP systems.
9 DynaSand the dynamic syntactic atlas of the Dutch dialects. The data in DynaSAND, the dynamic syntactic atlas of the Dutch dialects (http://www.meertens.knaw.nl/sand/ (link is external) (link is external)), were collected between 2000 and 2005 by oral interviews (fieldwork and telephone) in about 300 locations across The Netherlands, Belgium and a small part of north-west France. Dialect speakers were asked to judge and/or translate some 150 test sentences. DynaSAND makes available the full recordings and transcriptions of these interviews. Together, the DynSAND data cover the syntactic variation in the Dutch language area in the left periphery of the clause (the complementizer system and complementizer agreement), variation in subject pronoun form depending on syntactic position, subject pronoun doubling, cliticization on YES/NO, the reflexive system, fronting constructions (Wh-clauses, relative clauses, topicalization), word order and morphological variation in verb clusters, negation and quantification.
34 ESF The European Science Foundation Second Language Acquisition by Adult Immigrants collected spontaneous second language acquisition data of forty adult immigrant workers living in Western Europe. The program ran over 5 ½ years and was preceded by a one-year pilot study. It had been planned as a longitudinal comparative study in five European countries: France, (Federal Republic of) Germany, Great Britain, The Netherlands, and Sweden. Financial support came from the Max-Planck-Gesellschaft (Germany) and the research councils of France (CNRS), The Netherlands (NWO), Switzerland (FNS), and Norway (NAVS).
28 ETCBC4 The ETCBC database of the Hebrew Bible (formerly known as WIVU database), contains the scholarly text of the Hebrew Bible with linguistic markup.
61 FAME Radio Broadcast Corpus A large broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details. The collection comprises over 3000 hours and the transcription and speaker annotation have been performed automatically by the speech and speaker recognition technology developed in the NWO FAME! project. Metadata provided on the paper labels of the original audio tapes were digitized by Fryske Hannen under supervision of Omrop Fryslân and Tresoar. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. Transcriptions with time alignments are provided as CTM files. Speaker information is provided in RTTM files.
60 FAME Speech Corpus The components of the Frisian data collection are speech and language resources gathered for building a large vocabulary ASR system for the Frisian language. Firstly, a new broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details. The second component of this collection is a language model created on a text corpus with diverse vocabulary. Thirdly, a Frisian phonetic dictionary with the mappings between the Frisian words and phones is built to make the ASR viable for this under-resourced language. Finally, an ASR recipe is provided which uses all previous resources to perform recognition and present the recognition performances. The Corpus consists of short utterances extracted from 203 audio segments of approximately 5 minutes long which are parts of various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database. The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages. The total duration of the manually annotated radio broadcasts sums up to 18 hours, 33 minutes and 57 seconds. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. The available meta-information helped the annotators to identify these speakers and mark them either using their names or the same label (if the name is not known). There are 309 identified speakers in the FAME! Speech Corpus, 21 of whom appear at least 3 times in the database. These speakers are mostly program presenters and celebrities appearing multiple times in different recordings over years. There are 233 unidentified speakers due to lack of meta-information. The total number of word- and sentence-level code-switching cases in the FAME! Speech Corpus is equal to 3837. Music portions have been removed, except where these overlap with speech. Later, the components for speaker clustering and verification experiments are added by adding around 80 hours of raw speech data and reorganizing the manually annotated data respectively. Music portions of the raw data have been automatically removed. Moreover, we applied a publicly available speaker diarization system to the raw speech data and included the output in the corpus. Further details about the speaker clustering and verification database are available in the last reference below. A full description of the FAME! Speech Corpus is provided in: Yilmaz, E., Heuvel, H. van den, Van de Velde, H., Kampstra, F., Algra, J., Leeuwen, D. van (2016): Open Source Speech and Language Resources for Frisian Language. In: Proceedings Interspeech 2016, San Francisco, CA, USA, Sept. 2016. For the details of the ASR corpus, we refer the reader to: Yılmaz, E., Andringa, M., Kingma, S., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2016): A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-switching Research. In Proceedings LREC, pp. 4666-4669, Portorož, Slovenia, May 2016. The details of the speaker clustering and verification corpus is provided in: Yılmaz, E., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2017): Longitudinal Speaker Clustering and Verification Corpus with Code-switching Frisian-Dutch Speech in Proceedings Interspeech, pp. 37-41 Stockholm, Sweden, August 2017.
40 Gillis Corpus This corpus contains a longitudinal corpus from a boy learning Dutch. The corpus was donated to the CHILDES by Steven Gillis, Department of Germanic Linguistics, University of Antwerp, Belgium. The data are in CHAT format without English glosses. The child, Maarten, was a Flemish boy learning Dutch. Biweekly videotapings were taken at the child’s home between the ages of 0;11.15 and 1;11.28. Recordings began when the child’s vocalizations exhibited what Dore, Franklin, Miller, and Ramer (1976) called phonetically consistent forms. They lasted until the child’s MLU exceeded 1.5 for three consecutive sessions. The entire corpus consists of 29,324 intelligible child utterances. The child was recorded for an average of 3 hours a week for a total of 104 hours of recording (average: 1:18 hours per recording, with a range of 0:15:18 hours to 3:44:52 hours). The sessions included interactions between the child and an adult (usually his mother) as well as solitary play. All recordings were made in an unstructured regular home setting.
41 Groningen Corpus This corpus contains longitudinal data from seven Dutch children (six boys and one girl) between 1;05 and 3;07. The data (208 audio recordings totaling more than 170 hours) have been gathered in a research project supported by the Dutch Organisation for Scientific Research (NWO) grants
11 GTRP the data in GTRP (Goeman, Taeldeman, van Reenen Project; http://www.meertens.knaw.nl/mand/database/ (link is external) were collected between 1979 and 2000 with oral interviews in about 600 locations in the Dutch language area. Informants were asked to translate words or short sentences. Part of the transcriptions have been lined up with the sound recordings. The morphological data in GTRP include plural forms of nouns, diminutives, gender on nouns and adjectives, comparatives, superlatives, verbal inflection including participles, subject, object and possessive pronouns.
31 IFA speech The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker. Hand segmentation took 1,000 hours of labeling altogether. The asymptotic segmentation speed was about one word, or four boundaries, per minute.
30 IFAVID The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs. It is modelled on the Face-to-Face dialogs in the Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes we recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants were selected, either good friends, relatives, or long-time colleagues. The participants were allowed to talk about any topic they wanted.
24 INTER-VIEWS the Netherlands Veterans Institute (VI) hosts about 250 interviews (audio) in which Dutch former military personel speak about their experiences during World War II (interviews about the years 1935-1945) and decolonisation in the Dutch East Indies (1945-1950) and Dutch New Guinea (1960-1962). In the project Living Oral History Workbench these interviews have been indexed by automatic speech recognition techniques. The list of interviews and their metadata are available at the CLARIN Center; researchers may apply to VI for access to the data.
25 IPNV The IPNV Corpus is a corpus originally compiled by the Veteraneninstituut (VI). It comprises a collection of more than 1,100 (recorded) interviews with veterans who were involved in wars and other military actions that the Dutch military forces took part in. The average duration of an interview is 2.5 hours. Most interviews are with veterans of World War II, the decolonization wars with Indonesia and New Guinea, the UN action in Korea, the UN observe mission in Lebanon, UN missions in Cambodia and former Yugoslavia, and the NATO missions in Iraq and Afghanistan.
26 LAISEANG The geographical region of insular South East Asia and New Guinea is well-known as an area of mega-biodiversity. Less well-known is the extreme linguistic diversity in this area: over a quarter of the world’s 6000 languages are spoken here. As small minority languages, most of these will cease to be spoken in the coming few generations. The LAISEANG corpus ensures the preservation of unique records of languages and the cultures encapsulated by them in the region. The language resources have been gathered by twenty linguists at, or in collaboration with Dutch universities over the last 40 years, and are compiled and archived in collaboration with The Language Archive (TLA) in Nijmegen
5 LESLLA The LESLLA corpus dates from 2003­-2005 and contains speech of 15 low educated learners of Dutch as a second language. The learners are all women. Eight learners have a Turkish background, seven learners have a Moroccan background. From these learners, data were collected over time in three cycles, with an interval of 5 months. In each cycle the participants took part in three types of tasks: (1) production tasks, (2) perception tasks, and (3) a perception task with a metalinguistic component. The production tasks included open elicitation (14,000 utterances), closed completion (4,000 utterances) and imitation (6,000 utterances). Apart from the audio files, the data consist of orthographic transcriptions. For all 15 learners there are also metadata available.
29 LIEDNL De Nederlandse Liederenbank omvat zo'n 170.000 Nederlandse liederen* (stand medio 2014). In beginsel gaat het om Nederlandstalige liederen, zowel uit Nederland als Vlaanderen. De liederenbank bestrijkt zo'n 900 jaar, van de middeleeuwen tot de eenentwintigste eeuw. Het soort liederen en de dekkingsgraad verschilt per periode, collectie en repertoire. Percentueel zijn uit de middeleeuwen de meeste liederen opgenomen. Het percentage neemt geleidelijk af tot aan het heden. Uit de twintigste eeuw staan er vooral liederen uit volksliedbundels en veldwerkopnamen in. Het gaat daarbij nog steeds om vele duizenden liederen.
27 NEHOL NEHOL is a digitally accessible and searchable database with the Dutch-lexifier Creole language Negerhollands, in the same format as the parallel SUCA (SUriname Creole Archive) corpus, coordinated by Margot van den Berg (Radboud University Nijmegen). The NEHOL project was coordinated by Pieter Muysken (Radboud University Nijmegen), and technically supported by the TLA (‘The Language Archive’) unit at the MPI for Psycholinguistics in Nijmegen.
33 NGT The Corpus NGT is an open access online corpus of movies with annotations of Sign Language of the Netherlands (abbreviated as SLN or NGT).
2 NPCMC The Nijmegen Parsed Corpus of Modern Chechen contains a number of manually corrected syntactically annotated texts. Some of the texts originally come from a corpus of Chechen texts created by Ron Zacharski & Jim Cowie at the New Mexico State University. These texts were originally available at http://guidetodatamining.com/appendices/corpora/. They are not available there anymore, but see the paper talking about this corpus: http://mt-archive.info/AMTA-2006-Abdelali.pdf.
53 NRC2011 Newspaper texts taken from printed and and digital versions of the NRC newspaper (edition 2011). The texts cover blogs, hard news, background articles, opinion articles on related topics. Metadata per text are available in CMDI XML files. The 'NRC2011' corpus has been created for the CLARIAH sponsored ACAD project. See https://www.clariah.nl/projecten/research-pilots/acad/acad and https://cesar.science.ru.nl/. Cooperators: Micha Hulsbosch - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Wilbert Spooren - Radboud University Nijmegen, Faculty of arts, Dutch language Erwin R. Komen - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG The corpus contains 2225 newspaper texts taken from printed and and digital versions of the NRC newspaper (year 2011). The texts cover blogs, hard news, background articles, opinion articles on related topics. FLAT can be used to open and view the folia-files. See https://flat.science.ru.nl/ Metadata per article are available in CMDI XML files. The File textlist-folia.json contains an overview of all available texts in json format. The file NRCLicentieovereenkomst.pdf contains the License Agreement with NRC.
20 PanderMaatSanders The PanderMaatSanders subcorpus was compiled for a study of the causal connectives 'daardoor', 'daarom' and 'dus' in Dutch news. It consists of 100 cases from a Dutch newspaper (de Volkskrant 1994 and 1995). The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
21 Pit The Pit narrdow and totadow subcorpora were compiled for a study of the causal connectives "aangezien", "doordat", "omdat" and "want" in Dutch, German and French narratives and news. The narrdow subcorpus consists of cases from 22 Dutch novels published between 1990 and 1996. The totadow subcorpus consists of cases from a Dutch newspaper (de Volkskrant) from 1995. The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
22 SandersSpooren The SandersSpooren subcorpus was compiled for a study of the causal connectives 'want' and 'omdat' in several types of discourse. It consists of cases from news, spontaneous conversation and chat data. It comprises 553 cases in total of omdat and want from newspapers, spontaneous conversations and chat interaction;100 cases of omdat and 102 cases of want from newspapers (D-Coi); 100 cases of omdat and 100 cases of want from conversations (CGN; 51 cases of omdat and 100 cases of want from chat interaction (VU-Chat-corpusI) The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
42 Schaerlaekens Corpus The original database consists of the spontaneous language of two triplets (in total six children) between the ages of 1;10.18 and 3;1.7 for the first set and 1;6.17 and 2;10.23 for the second set.
12 SLI RU-Kentalis Corpus for investigation of the expression of spatial relations by children with SLI and normally developing children in their spoken language production.
3 SoNaR SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. (SOURCE: CLARIN). The SoNaR Corpus consists of three parts: - SONAR 500: 500 M words automatically tokenized, lemmatized and POS-tagged - SONAR 1: 1 M words with handchecked semantic annotations: named entities, co-reference relations, semantic roles and spatio-temporal relations - SONAR New Media Corpus: Subpart of SoNaR 500 containing the material from new media only (tweets, chats, SMS)
51 test_sle20 beschrijving
43 Van Kampen Corpus The van Kampen corpus is based on tapings of two Dutch girls. Laura was studied from the age of 1;9.18 to 5;10.9 and Sarah from 1;6.16 to 6;0. The child’s age at each session is given inside each file. The recordings were made roughly once or twice every month by the mother of the children (Jacqueline van Kampen). The Laura corpus exists of 72 45-minute recordings. The Sarah corpus consists of 50 45-minute recordings.
44 Van Oosten Bilingual Corpus Picture descriptions by 20 children in the age range of 4-13 years. Half of the children were Italian-Dutch bilinguals and the other 10 were monolinguals (5 Dutch monolinguals and 5 Italian monolinguals). The study is reported in an MA thesis at the University of Utrecht entitled “Lo sviluppo dell’acquizione del soggetto nei bambini bilingui ital-olandesi.” Funding was obtained through a scholarship of the Royal Dutch Institute in Rome. This research seeks to determine if Müller & Hulk (2001)’s hypothesis works also for other linguistic phenomena that occur at the interface between syntax and pragmatics, such as, for instance, subject acquisition. In the case of subject acquisition there is the possibility in early Dutch to drop the subject when it contains old information. Also in adult Italian there is this option, due to the same pragmatic rule. However, in early Dutch the omission of the subject is constrained by the position of the specifier of the root, while there is no such constraint in the Italian case. In adult Italian there is just one structural analysis when it comes to omit the subject: when the subject contains old information it has to be omitted, wherever the element may be located inside the phrase structure. If this analysis is correct, we predict that Italian grammar may influence Dutch grammar, for the reason that the bilingual child will choose the analysis that is favored by both languages. Italian/ Dutch bilinguals should produce more null subjects than their monolingual peers as bilinguals would over generalize the pragmatic rule that is common to both languages. In this study, bilingual children did produce significantly more null subjects in their Dutch corpus than their monolingual Dutch peers did. Müller & Hulk (2001) predict this influence to arise before the C-system is completed. However, the subjects of my research had completed the acquisition of their C-system, as we can see from their use of embedded structures and WH-phrases.
16 VU-DNC VU-DNC is a unique diachronic corpus of Dutch newspaper articles from five major Dutch newspapers from 1950/1951 and 2002 (2 MW). The VU-DNC has beete between the words directly under responsibility of the journalist.n annotated for quotations, which enables the researcher to differentia
6 WBD The Dictionary of Brabantic Dialects (WBD) covers together with the Dictionary of the Limburgian Dialects (WLD) and the Dictionary of the Flemish dialects (WVD) by a same type of descriptive dialect lexicography the entire Southern Dutch ­speaking region below the major rivers. This area stretches over three countries: the Netherlands, Belgium and France. The area under study included Flemish Brabant and Antwerp in Flanders and Brabant in the Netherlands for WBD. The collection comprises three parts: I: Agricultural vocabulary II: Non-agricultural vocabulary III: General vocabulary For each part the information is available as PDFs of the books, LMF-versions of the lexicon and text-versions (CSV) of the lexicon. Information per keyword comprises: - lemmatisation - dialect entry (more or less phonetic) - comments - locations (in Kloeke codes and place names) - source information for the dialect entries
7 WGD The Dictionary of the dialects of Gelderland describes specific parts of the general vocabulary. The dictionary has been written around a certain theme. Unlike other local dictionaries that have been published in our province, the vocabulary has been arranged thematically rather than alphabetically. The dictionary presently comprises two areas: the River Area (Rivierengebied) and the Veluwe. For each part the information is available as PDFs of the books, LMF-versions of the lexicon and text-versions (CSV) of the lexicon. Information per keyword comprises: - lemmatisation - dialect entry (more or less phonetic) - comments - locations (place names)
55 Whatsapp corpus Berntzen Whatsapp conversations collected by master students Communication & Information Studies (2013-2014; 2014-2015). All participants in the conversations are over 18 and have signed consent forms. Metadata per conversation are available in CMDI XML files. The corpus has been made available for the CLARIAH sponsored ACAD project. The 'WhatsAppManon' corpus has been made available for the CLARIAH sponsored ACAD project. See https://www.clariah.nl/projecten/research-pilots/acad/acad and https://cesar.science.ru.nl/. Cooperators: Micha Hulsbosch - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Wilbert Spooren - Radboud University Nijmegen, Faculty of arts, Dutch language Erwin R. Komen - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Patrick Sonsma - Radboud University Nijmegen, Faculty of arts, Dutch language Original researcher: Manon Berntzen - Radboud University Nijmegen, Faculty of arts, Dutch language The corpus contains 60 WhatsApp chat sessions that have been collected by Manon Berntzen for the course on "New media-new methods" and then for her Bachelor thesis. The exact date of each chat is included in the <event> tag attributes in the .folia.xml files. FLAT can be used to open and view the folia-files. See https://flat.science.ru.nl/ The participants have all indicated that their chats can be used (in an anonymized form) for research purposes. Metadata per chat are available in CMDI XML files. The File textlist-folia.json contains an overview of all available texts in json format. Note: the files are numbered 001-063 consecutively, but 058-060 (as well as 064) are excluded, because they lack permission.
52 Whatsapp corpus Verheijen Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent only from contributor and not from conversational partner. Consequently, the subcorpus only contains contributions from the submitter. Metadata per conversation are available in CMDI XML files. Ref: Verheijen, L., & Stoop, W. (2016, September). Collecting facebook posts and whatsapp chats. In International Conference on Text, Speech, and Dialogue (pp. 249-258). Springer, Cham. The corpus has been made available for the CLARIAH sponsored ACAD project. See https://www.clariah.nl/projecten/research-pilots/acad/acad and https://cesar.science.ru.nl/. Cooperators: Micha Hulsbosch - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Wilbert Spooren - Radboud University Nijmegen, Faculty of arts, Dutch language Erwin R. Komen - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Patrick Sonsma - Radboud University Nijmegen, Faculty of arts, Dutch language Original researcher: Lieke Verheijen - Radboud University Nijmegen, Faculty of arts, Dutch language The corpus contains 218 WhatsApp chat sessions that have been collected by Lieke Verheijen in 2012-2014 in the Netherlands. The exact date of each chat is included in the <event> tag attributes in the .folia.xml files. FLAT can be used to open and view the folia-files. See https://flat.science.ru.nl/ The participants have all indicated that their chats can be used (in an anonymized form) for research purposes. Metadata per chat are available in CMDI XML files. The File textlist-folia.json contains an overview of all available texts in json format.
45 Wijnen Corpus The corpus is based on home tapings of one Dutch boy, Niek, between the ages of 2;7 and 3;10. The recordings were made by Niek’s father (Frank Wijnen). The data were mainly used in a project focusing on the relation between language acquisition and developmental disfluency.
8 WLD The Dictionary of Limburgian Dialects (WLD) covers together with the Dictionary of the Brabantic Dialects (WBD) and the Dictionary of the Flemish dialects (WVD) by a same type of descriptive dialect lexicography the entire Southern Dutch ­speaking region below the major rivers. This area stretches over three countries: the Netherlands, Belgium and France. The area under study included both Limburg and the northeast of Liege for WLD. The collection comprises three parts: I: Agricultural vocabulary II: Non-agricultural vocabulary III: General vocabulary For each part the information is available as PDFs of the books, LMF-versions of the lexicon and text-versions (CSV) of the lexicon. Information per keyword comprises: - lemmatisation - dialect entry (more or less phonetic) - comments - locations (in Kloeke codes and place names) - source information for the dialect entries
46 Zink Corpus The recordings for this corpus were made in Leuven, Brabant, Belgium (3 children: Meinder, Judith, Laurien) and Antwerp, Belgium (1 child: David). The participants were recorded every two weeks from 8 months to 25 months of age. Each recording session lasted approximately 60 minutes