Since you are not logged-in, you are only able to view the collections.

Available collections

id	Identifier	Description
194	Aari Wordlist	Dataset on Aari in the form of audio material, collected in Dimeka, Ethiopia.
36	Aarssen-Bos	This database contains 1021 transcripts collected in the Netherlands, Turkey, and Morocco by Jeroen Aarssen and Petra Bos, at Tilburg University. Bilingual data (either Turkish-Dutch or Moroccan Arabic-Dutch) were collected within the framework of a longitudinal study into development of bilingualism among Turkish and Moroccan children in the Netherlands.
173	Aasax, Akiek Dataset	Dataset on Aasax and Akiek in the form of audio material, collected in Tanzania.
17	ACADEMIA	The Netherlands Institute for Sound and Vision (NISV) Academia collection contains audiovisual sources that can be used in higher education in the Netherlands. This unique selection from the NISV's archives has been made available and has been described by means of metadata specifically for education and science. These data are accessible only to education and research organisations that have an Academia License.
65	ACADEMIA_65	The Netherlands Institute for Sound and Vision (NISV) Academia collection contains audiovisual sources that can be used in higher education in the Netherlands. This unique selection from the NISV's archives has been made available and has been described by means of metadata specifically for education and science. These data are accessible only to education and research organisations that have an Academia License.
110	Adamorobe Sign Language Dataset	Dataset on Adamorobe Sign Language, Berbey Sign Language and Malian Sign Language in the form of 50 hours of audiovisual material, collected in Adamorobe and Berbey, Ghana.
150	Adamorobe Sign Language Dataset 2	Dataset on Adamorobe Sign Language in the form of 27 GB of audiovisual material, collected in Adamorobe, Ghana.
14	ADHD and SLI corpus UvA	Video recordings of 67 children to compare the language and executive functioning profiles of children with ADHD and children with SLI and TD children: 26 Dutch children with ADHD 19 Dutch children with SLI 22 children Dutch controls
206	Akan Dataset	Dataset on Akan in the form of 5 hours of audio material, collected in Accra, Ghana. Transciption included, type unknown
179	Akiek Dataset	Dataset on Akiek in the form of audio material, collected in Tanzania.
90	Alagwa Dataset	Dataset on Alagwa in the form of notebooks, collected in Tanzania. Transciption included, type unknown
141	Alorese Dataset	Dataset on Alorese in the form of 2 hours of audio material, collected in Alor and Pantar, Indonesia. Transciption included, type unknown
139	Alorese, Adang Dataset	Dataset on Alorese and Adang in the form of 35 hours of audio material, collected in Alor and Pantar, Indonesia. Transciption included, type unknown
144	American Sign Language Dataset	Dataset on American Sign Language and Ghanaian Sign Language in the form of 21 hours of audiovisual material, collected in Accra and Koforidua, Ghana. Transciption included, type unknown
142	Arabic Dataset	Dataset on Arabic dialects in the form of 60 interviews, collected in Al- Qasimi, Saudi Arabia.
159	Artificial Languages Dataset	Dataset on syllabic patterns in Artificial Language in the form of an EEG study on syllabic patterns, done in the Baby Lab in Leiden, The Netherlands. Transciption included, type unknown
160	Artificial Languages Dataset 2	Dataset on Artificial Language in the form of reaction time measuring, collected in the Baby Lab in Leiden, The Netherlands.
35	Asymmetries Corpus	The Asymmetries Project collection contains Dutch language productions gathered in Groningen and neighboring towns in the northern Netherlands, between 2007 and 2012. The research was carried out by members of the NWO/Vici project “Asymmetries in Grammar” at the University of Groningen. This project investigates asymmetries between production and comprehension in unimpaired children, in young and elderly adults, and in autistic and ADHD children and adolescents. It is funded by a grant from the Netherlands Organization for Scientific Research (NWO) awarded to Petra Hendriks (grant no. 277-70-005). All participants are native Dutch speakers. The participants in the CK sub-corpus have no history of language problems. The CK sub-corpus includes 31 typically developing children (4;3-6;5, mean 5;6), 20 young adults (18-35, mean 26;2), and 20 elderly adults (69-87, mean 78;8). The groups are balanced for sex.
109	Austronesian languages Datatset	Simulation data on Austronesian languages in the form of cross-linguistic data formats (CLDF), collected in Indonesia. Aggregated from published sources and simulations. Transciption included, type unknown
96	Automated Similarity Judgement Program Dataset	Database of the Automated Similarity Judgment Program (ASJP), containing wordlists for roughly 2/3rd of the world's languages. https://asjp.clld.org/software. Transciption included, type unknown
176	Awa Pit, Spanish Dataset	Dataset on Awa Pit and Spanish in the form of 17 hours of audio material, collected in Ecuador. Transciption included, type unknown
197	Aymara, Spanish Dataset	Dataset on Aymara and Spanish in the form of book scans, collected in Peru. Transciption included, type unknown
123	Bambara Dataset	Dataset on Bambara in the form of digitalized cassettes, collected in Mali.
13	Bilingual deaf children RU	Video recordings of 11 deaf children, for a longitudinal investigation of the bilingual language and communication development of young deaf children in Sign Language of the Netherlands (SLN) and Dutch (D).
37	Bol-Kuiken	This corpus includes data from the SLI children of the GRAMAT research, carried out by Gerard Bol and Folkert Kuiken between 1984 and 1988. This research was supported by a grant from the Praeventiefonds at The Hague in the Netherlands (Nr. 28-798). The 31 Dutch normally developing children (15 male and 16 female) in this research ranged in age from 1;07.11 to 3;07.5. Sixteen children have been recorded at two moments in time so that the total number of recordings is 47. The 20 Dutch children with Down’s syndrome (10 male and 10 female) in this research ranged in age from 4;04.21 to 18;11.5. Their intelligence level determined with a Dutch adaptation of the Merrill-Palmer Preschool Performance Tests, differed from 20 to 56. Their mental age was at least 3;6 years. The 20 Dutch SLI children (5 female and 15 male) in this research ranged in age from 4;01.16 to 8;01.17. The children all lacked sufficient intellectual or physiological impairment to account for their difficulties in language production. The speech of the 20 SLI children was audiotaped at their school, while they were playing with their speech therapist in a free-play situation. One of the two investigators was present in the room of the speech therapist. From time to time the investigator participated in the conversation. From each child 100 analyzable utterances were transcribed by the investigator that had been present at the recording. After that the utterances were analyzed according to the GRAMAT framework. GRAMAT (Grammatical Analysis of Developmental Language Disorders) is a Dutch adaptation of the descriptive morphosyntacticframework used by Crystal, Fletcher and Garman (1976). The 20 Dutch hearing impaired children (11 male and 9 female) in this research ranged in age from 3;11.14 to 9;00.15. They all suffered from sensorineural or mixed hearing loss. Their hearing impairment had been diagnosed before they were 1;6 years of age. Their hearing loss differed from 40 to 85 dB pure tone average on the better ear. The cildren were of normal intelligence. The speech of the 20 hearing impaired children was audiotaped at their school, while they were playing with their speech therapist in a free-play situation. One of the two investigators was present in the room of the speech therapist. From time to time the investigator participated in the conversation. From each child 100 analyzable utterances were transcribed by the investigator that had been present at the recording.
118	Bondei Dataset	Dataset on Bondei in the form of digitalized cassettes, collected in Tanzania.
153	Bouakako Sign Language Dataset	Dataset on Bouakako Sign Language in the form of 155 GB of audiovisual material, collected in Ivory Coast.
81	Bété Dataset	Dataset on Bété in the form of 7 digitalized cassettes, collected in Ivory Coast.
63	CAREGIVER	A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The motivation behind the corpus and its design relies on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, are covered. An orthographic transcription is available for every utterance. Also, time-aligned word and phone annotations for some of the sub-corpora exist. The design of the corpus which is a good source of documentation is described in a paper published in LREC 2010: Altosaar, T., Bosch, L. ten, Aimetti, G., Koniaris, Chr., Demuynck, K., Heuvel, H. van den (2010): A Speech Corpus for Modeling Language Acquisition: CAREGIVER. Proceedings LREC2010, Malta, pp. 1062-1068. http://www.lrec-conf.org/proceedings/lrec2010/pdf/597_Paper.pdf. However, in the actual corpus there are a couple of deviations from this setup. The corpus contains nearly 66,000 utterance-based audio files spoken over a two-year period by 16 male and 14 female native speakers of Dutch, English, and Finnish. Swedish is missing. For Dutch only Y2 recordings are available. Here is an overview: ACORNS_Y1/UK/: SpeakerD SpeakerC SpeakerB SpeakerA: Four English speakers Y1 recordings (2 male, 2 female). There are 1000 recordings and orthographic transcriptions (in xml) per speaker. ACORNS_Y1/fin/: FIN-M-SF FIN-M-MT FIN-F-KA FIN-F-JL Four Finnish speakers Y1 recordings (2 male, 2 female). There are 2000 recordings and orthographic transcriptions (in xml) per speaker. ACORNS_Y2/UK/: recordings of 10 speakers, Speaker01-04 are the same as for Y1. For each speaker there are 2397 recordings. The other six speakers (3 male, 3 female) are test speakers with each 600 recordings. Y2-UK-XML: orthographic transcriptions in xml Y2-UK-WAV: speech recordings old_xml: old version of Y2-UK-XML. May be discarded. annotation: ACORNS-Y2-UK-v2-FA: time stamps at word level by Forced Alignment Y2-UK-v2-FA-phone: time stamps at phone level by Forced Alignment list_of_errors: errors in time stamps at word level ACORNS_Y2/NL/: Recordings of Dutch speakers. 4 speakers were recorded twice, 2 males (henk, peter) and 2 females (els, margot), the other six were test speakers with one recording session, 4 males (eric, folkert, helmer, vico) and 2 females (daphne, hella). The .cor files contain the orthographic transcriptions with time stamps (sentence level only). ACORNS_Y2/FIN/: recordings of 10 speakers, Speaker01-04 are the same as for Y1. For each speaker there are 2397 recordings. The other six speakers (3 male, 3 female) are test speakers with each 600 recordings. Y2-FIN-XML: orthographic transcriptions in xml Y2-FIN-WAV: speech recordings
50	CGN1.0	The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available. These are howver not available with this version 1.0 at MPI.
1	CGN2.0	The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. The total number of words available is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands. The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available. The corpus comes with metadata pertaining to speakers and recording conditions. Furthermore with the Corpus a lexicon and frequency lists (word tokens, lemmas, POS tags, pronunciation variants) are available.
95	Classical Armenian Dataset	Etymological database of sustinence terminology of (Classical) Armenian. It also includes cognate info from other languages.
38	CLPF Corpus	Audio recordings of twelve Dutch children in the stage of early words. Longitudinal study. Orthographic and phonetic transcriptions in CHAT.
102	Coastal Terengganu Malay	Dataset on Coastal Terengganu Malay in the form of 9 GB of audio material, collected in Terengganu, Malaysia. Considers a local variety. Transciption included, type unknown
62	CoDoSiS	This data collection comprises a set of test files developed in the CoDoSiS project of the CLARIAH program. The project ‘Combining Data on slavery in Surinam’ (CoDoSiS) aimed to develop a strategy to convert existing datasets on Surinam slavery into Linked Data by using the CLARIAH wp4-tool Cows and to combine them into one database network with relevant connections using the CLARIN-tool TICCL. This collection contains the test files that were used in the project. The test files are subsets of the complete datasets of the complete registers focussed around the year 1846 and were used to test the two issues defined above in the CoDoSiS project. Included in the collection are: - Slavenregisters_Picturae_plantages_1846 (TICCL and LOD) - Slavenregisters_Everaert_1846 (TICCL and LOD) - Monsterrol_Catharina_Sophia_1846 (TICCL and LOD) - Slavenregisters_Picturae_eigenaren_1846 (TICCL) - Wijkregisters_Paramaribo_1846 (TICCL) - Surinaamsche_almanak_plantages_1846 (LOD) - Surinaamsche_almanak_personen_1846 (TICCL)
216	Connecting Conditionals (Reuneker 2022; dissertation)	Connecting Conditionals: A Corpus-based Approach to Conditionals in Dutch
92	Cuneiform Luwian Dataset	Dataset on Cuneiform Luwian in the form of a Python script, stemming from the following dictionary: Melchert (1993) Cuneiform Luvian Lexicon.
203	Cypriot Greek Dataset	Dataset on urban speech in Cypriot Greek in the form of 110 GB of IPA transcripts, collected in Cyprus. Transciption included, type unknown
4	D-LUCEA	The LUCEA corpus (Longitudinal University College utrecht Corpus of English Accents) was collected to study this type of phonetic convergence in a multilingual environment. Students and teachers at University College Utrecht (UCU) come from various countries and native languages, yet they all use English as the lingua franca on campus. Hence, phonetic convergence may result in a unique international version of English, influenced by the speakers’ native languages and accents. The corpus now contains data from about 850 interviews from 282 unique students. Each interview contains about 20 minutes of speech. The speech corpus is augmented with participants’ responses from entry and exit questionnaires, and supplementary data about the participants and about each recording. When finished in 2016, the total corpus will contain about 3 TB (about 3000 GB) of audio data.
103	Danish Datatset	Dataset on dialects in Danish in the form of 300 GB of audio material, collected in Denmark.
214	Dataset on Danish kæmpe	Dataset on Danish kæmpe
215	Dataset on Voicing in Danish stops	Dataset on voicing in Danish stops
217	Dataset on Ælfric vocabulary	Dataset on Ælfric vocabulary
121	Datoga Dataset	Dataset on Datoga in the form of 1 digitalized cassette, collected in Tanzania. Contains singing.
85	Datooga Dataset	Dataset on Datooga in the form of 1 digitalized cassettes, collected in Tanzania.
18	dbd	The Dutch Bilingual Database (DBD) is a rather substantial collection of data (over 1,500 sessions) from a number of projects and research programmes that were directed at investigating multilingualism and comprises data originating from Dutch, Sranan, Sarnami, Papiamentu, Moroccan(-Arabic), Berber and Turkish speakers.
15	Deaf adults RU database	Recordings of the writing process to investigate the acquisition of Dutch by deaf Dutch adults (late L1/early L2) and comparison to hearing Turkish and Moroccan-Arabic L2-learners of Dutch (late L2) on morphosyntactic aspects. Participants: 46 deaf Dutch adults 38 hearing Turkish adults 24 hearing Moroccan adults 10 Dutch controls
19	Degand 2.2	The Degand 2.2 subcorpus was compiled for a study of the causal connectives 'aangezien', 'want' and 'omdat' in Dutch news. It consists of 143 cases from a Dutch newspaper (NRC Handelsblad from 1994). The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
39	DeHouwer Corpus	This corpus of Dutch child language and child-directed speech was collected in Antwerp, Belgium. The corpus consists of 15 recordings transcribed orthographically and phonetically. Some transcripts also contain variety codes, speaker codes, addressee codes and utterance numbers (see further below). Participants are four children between the ages of ca. 4;9 and 5;0 (two boys Dieter and Michiel, and two girls Kim and Katrien) and their families, with some other persons on occasion present as well. The families are lower-middle to middle-middle class. All children are addressed in some form of Dutch common around the city of Antwerp and go to school fulltime (second year of nursery school). They are being raised monolingually. The interactions are mostly free and spontaneous, but include some structured interactions as well, in which the mother or father had a conversation with the 4-year-old about the past day at school, or prompted the child to describe a picture and tell a picture book story.
234	Deliver	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Verloskunde
242	Dialoog	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Nefrologie
10	DIDDD	The data in DiDDD (Diversity in Dutch DP Design; http://www.meertens.knaw.nl/diddd/ (link is external) were collected between 2005 and 2009 with oral and written interviews in about 200 locations in the Dutch language area, with a methodology highly parallel to DynaSAND. The data involve translations of and judgements on test sentences. The DIDDD data cover the morphosyntactic variation within nominal groups, in particular possessives, partitives, noun ellipsis, the demonstrative system, the numeral modification system, what-for constructions, quantitative er, adjectival inflection, negation and exclamatives.
189	Diverse Sign Language Dataset	Dataset on several Sign Languages in the form of 60 GB of audiovisual material, collected in Ghana and Ivory Coast. Transciption included, type unknown
152	Dogon Sign Language Dataset	Dataset on Dogon Sign Language in the form of 89 GB of audiovisual material, collected in the Dogon area, Mali.
23	DuELME	DUELME LMF is an electronic lexicon that contains more than 5,000 Dutch multiword expressions (MWEs). MWEs with the same syntactic pattern are grouped in the same equivalence class. The DUELME LMF lexicon is suitable for theoretical research on multiword expressions as for use in NLP systems.
104	Dutch Dataset	Collection of 638 Dutch quotes, collected from online corpora. Short citations stemming from national and international sources.
207	Dutch Dataset 10	Dataset on Dutch in the form of an EEG study on semantics and syntax, collected in Leiden, The Netherlands.
107	Dutch Dataset 2	Grammaticality judgements on Dutch fragments, questions and ellipsis.
108	Dutch Dataset 3	Dataset on Dutch in the form of 1000+ letters, collected in various regions of The Netherlands.
138	Dutch Dataset 4	Dataset on Dutch dialects in the form of 4 GB of texts stemming from corpora.
163	Dutch Dataset 5	Dataset on Dutch in the form of an EEG study considering code-switching, long distance dependencies, gender mismatch and negative polarity items.
168	Dutch Dataset 6	Dataset on Dutch in the form of SPSS analyses of a free writing task, collected in Zeeland and Noord-Holland, The Netherlands.
186	Dutch Dataset 7	Dataset on Dutch in the form of 1000+ interviews, SPSS analyses and PRAAT text grids, stemming from Corpus Gesproken Nederlands .
188	Dutch Dataset 8	Dataset on Dutch in the form of 2 hours of audio material, collected in Leiden, The Netherlands.
222	Dutch Dataset 9	Dataset on Dutch in the form of SPSS analyses etc. of orthographic transcription, repeating tasks and rating accent tasks, collected in Leiden, The Netherlands.
155	Dutch, English Dataset	Dataset on Dutch and English in the form of EEG study and reading tasks, collected in Leiden, The Netherlands.
157	Dutch, English Dataset 2	Dataset on Dutch and English in the form of 4 GB of audio material, collected in The Netherlands. Transciption included, type unknown
140	Dutch, English, German Dataset	Dataset on Dutch, English and German in the form of 78 KB of text stemming from corpora from The Netherlands, UK and Germany.
145	Dutch, French Dataset	Dataset on Dutch and French in the form of 1000+ letters, collected in London, UK.
9	DynaSand	the dynamic syntactic atlas of the Dutch dialects. The data in DynaSAND, the dynamic syntactic atlas of the Dutch dialects (http://www.meertens.knaw.nl/sand/ (link is external) (link is external)), were collected between 2000 and 2005 by oral interviews (fieldwork and telephone) in about 300 locations across The Netherlands, Belgium and a small part of north-west France. Dialect speakers were asked to judge and/or translate some 150 test sentences. DynaSAND makes available the full recordings and transcriptions of these interviews. Together, the DynSAND data cover the syntactic variation in the Dutch language area in the left periphery of the clause (the complementizer system and complementizer agreement), variation in subject pronoun form depending on syntactic position, subject pronoun doubling, cliticization on YES/NO, the reflexive system, fronting constructions (Wh-clauses, relative clauses, topicalization), word order and morphological variation in verb clusters, negation and quantification.
143	Early Modern Dutch, French Dataset	Dataset on Early Modern Dutch and French in the form of scans and newspapers. Ongoing project.
220	East Rote languages Dataset	Dataset on East Rote languages in the form of 4 word lists, collected in East Rote Island, Indonesia. Transciption included, type unknown
170	Ecuador Spanish Dataset	Dataset on Ecuador Spanish in the form of 110 hours of audio material, collected in Imbabura, Ecuador. Transciption included, type unknown
106	English Dataset	Grammaticality judgements on English fragments, questions and ellipsis.
147	English Dataset 2	Dataset on English in the form of letters, in total 300.000 words.
148	English Dataset 3	Dataset on 160 letters of Jane Austen in English.
149	English Dataset 4	Dataset on English from 1770 until 2010 in the form of 77 usage guides, collected in the UK and USA.
34	ESF	The European Science Foundation Second Language Acquisition by Adult Immigrants collected spontaneous second language acquisition data of forty adult immigrant workers living in Western Europe. The program ran over 5 ½ years and was preceded by a one-year pilot study. It had been planned as a longitudinal comparative study in five European countries: France, (Federal Republic of) Germany, Great Britain, The Netherlands, and Sweden. Financial support came from the Max-Planck-Gesellschaft (Germany) and the research councils of France (CNRS), The Netherlands (NWO), Switzerland (FNS), and Norway (NAVS).
28	ETCBC4	The ETCBC database of the Hebrew Bible (formerly known as WIVU database), contains the scholarly text of the Hebrew Bible with linguistic markup.
205	Ewe Dataset	Dataset on Ewe in the form of 50 hours of audio material, collected in Anfoega, Ho, Tegbi and Keta, Ghana. Transciption included, type unknown
61	FAME Radio Broadcast Corpus	A large broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details. The collection comprises over 3000 hours and the transcription and speaker annotation have been performed automatically by the speech and speaker recognition technology developed in the NWO FAME! project. Metadata provided on the paper labels of the original audio tapes were digitized by Fryske Hannen under supervision of Omrop Fryslân and Tresoar. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. Transcriptions with time alignments are provided as CTM files. Speaker information is provided in RTTM files.
60	FAME Speech Corpus	The components of the Frisian data collection are speech and language resources gathered for building a large vocabulary ASR system for the Frisian language. Firstly, a new broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details. The second component of this collection is a language model created on a text corpus with diverse vocabulary. Thirdly, a Frisian phonetic dictionary with the mappings between the Frisian words and phones is built to make the ASR viable for this under-resourced language. Finally, an ASR recipe is provided which uses all previous resources to perform recognition and present the recognition performances. The Corpus consists of short utterances extracted from 203 audio segments of approximately 5 minutes long which are parts of various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database. The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages. The total duration of the manually annotated radio broadcasts sums up to 18 hours, 33 minutes and 57 seconds. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. The available meta-information helped the annotators to identify these speakers and mark them either using their names or the same label (if the name is not known). There are 309 identified speakers in the FAME! Speech Corpus, 21 of whom appear at least 3 times in the database. These speakers are mostly program presenters and celebrities appearing multiple times in different recordings over years. There are 233 unidentified speakers due to lack of meta-information. The total number of word- and sentence-level code-switching cases in the FAME! Speech Corpus is equal to 3837. Music portions have been removed, except where these overlap with speech. Later, the components for speaker clustering and verification experiments are added by adding around 80 hours of raw speech data and reorganizing the manually annotated data respectively. Music portions of the raw data have been automatically removed. Moreover, we applied a publicly available speaker diarization system to the raw speech data and included the output in the corpus. Further details about the speaker clustering and verification database are available in the last reference below. A full description of the FAME! Speech Corpus is provided in: Yilmaz, E., Heuvel, H. van den, Van de Velde, H., Kampstra, F., Algra, J., Leeuwen, D. van (2016): Open Source Speech and Language Resources for Frisian Language. In: Proceedings Interspeech 2016, San Francisco, CA, USA, Sept. 2016. For the details of the ASR corpus, we refer the reader to: Yılmaz, E., Andringa, M., Kingma, S., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2016): A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-switching Research. In Proceedings LREC, pp. 4666-4669, Portorož, Slovenia, May 2016. The details of the speaker clustering and verification corpus is provided in: Yılmaz, E., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2017): Longitudinal Speaker Clustering and Verification Corpus with Code-switching Frisian-Dutch Speech in Proceedings Interspeech, pp. 37-41 Stockholm, Sweden, August 2017.
164	Figuig Berber Dataset	Dataset on Figuig Berber in the form of 5 hours of digitalized cassettes, collected in Figuig, Morocco. Transciption included, type unknown
119	French Dataset	Dataset on French in the form of 13 GB of audio material, collected in Nantes, France. Contains PRAAT text grids.
180	French Dataset 2	Dataset on French in the form of an EEG study on code-switching, collected in Lyon, France.
195	French Dataset 3	Straattaal (slang) corpus on French in the form of 91 KB of audio material, collected in Bretagne, France. Transciption included, type unknown
133	Funai Helong Dataset	Dataset on Funai Helong in the form of 2 word lists, collected in Oeletsala, Indonesia. Transciption included, type unknown
137	German, Dutch Dataset	Dataset on German and Dutch in the form of 3 GB of texts stemming from corpora from The Netherlands and Belgium.
40	Gillis Corpus	This corpus contains a longitudinal corpus from a boy learning Dutch. The corpus was donated to the CHILDES by Steven Gillis, Department of Germanic Linguistics, University of Antwerp, Belgium. The data are in CHAT format without English glosses. The child, Maarten, was a Flemish boy learning Dutch. Biweekly videotapings were taken at the child’s home between the ages of 0;11.15 and 1;11.28. Recordings began when the child’s vocalizations exhibited what Dore, Franklin, Miller, and Ramer (1976) called phonetically consistent forms. They lasted until the child’s MLU exceeded 1.5 for three consecutive sessions. The entire corpus consists of 29,324 intelligible child utterances. The child was recorded for an average of 3 hours a week for a total of 104 hours of recording (average: 1:18 hours per recording, with a range of 0:15:18 hours to 3:44:52 hours). The sessions included interactions between the child and an adult (usually his mother) as well as solitary play. All recordings were made in an unstructured regular home setting.
238	Goed Begrepen	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Palliatieve zorg
41	Groningen Corpus	This corpus contains longitudinal data from seven Dutch children (six boys and one girl) between 1;05 and 3;07. The data (208 audio recordings totaling more than 170 hours) have been gathered in a research project supported by the Dutch Organisation for Scientific Research (NWO) grants
11	GTRP	the data in GTRP (Goeman, Taeldeman, van Reenen Project; http://www.meertens.knaw.nl/mand/database/ (link is external) were collected between 1979 and 2000 with oral interviews in about 600 locations in the Dutch language area. Informants were asked to translate words or short sentences. Part of the transcriptions have been lined up with the sound recordings. The morphological data in GTRP include plural forms of nouns, diminutives, gender on nouns and adjectives, comparatives, superlatives, verbal inflection including participles, subject, object and possessive pronouns.
239	Gyneacologen	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Gynaecologie
115	Hadza Dataset	Dataset on Hadza in the form of digitalized cassettes, collected in Tanzania.
191	Hamar Dataset	Dataset on Hamar in the form of 5 hours of audio material, collected in Dimeka, Ethiopia. Transciption included, type unknown
190	Hausa Dataset	Grammaticality judgements on pluractionals in Hausa, collected in Sokoto, Nigeria.
91	Hieroglyphic Luwian Dataset	Dataset on Hieroglyphic Luwian , stemming from the following dictionary: Hawkins, J.D. (2002) Corpus of Hieroglyphic Luwian Inscriptions. Contains cognate forms.
97	Hieroglyphic Luwian Dataset 2	Dataset on Hieroglyphic Luwian in the form of an AWK script which reads Hawkins (2000): Corpus of Hieroglyphic Luwian Inscriptions.
240	Huisartsgeneeskunde	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Huisartsgeneeskunde
31	IFA speech	The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker. Hand segmentation took 1,000 hours of labeling altogether. The asymptotic segmentation speed was about one word, or four boundaries, per minute.
30	IFAVID	The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs. It is modelled on the Face-to-Face dialogs in the Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes we recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants were selected, either good friends, relatives, or long-time colleagues. The participants were allowed to talk about any topic they wanted.
94	Indo-European Etymologies	Etymologies of Indo-European words. Mostly Indo-European, but also occasionally Semitic, Berber, etc.
101	Inland Terengganu Malay Dataset	Dataset on Inland Terengganu Malay in the form of 18 GB of audio material, collected in Terengganu, Malaysia. Considers a local variety. Transciption included, type unknown
24	INTER-VIEWS	the Netherlands Veterans Institute (VI) hosts about 250 interviews (audio) in which Dutch former military personel speak about their experiences during World War II (interviews about the years 1935-1945) and decolonisation in the Dutch East Indies (1945-1950) and Dutch New Guinea (1960-1962). In the project Living Oral History Workbench these interviews have been indexed by automatic speech recognition techniques. The list of interviews and their metadata are available at the CLARIN Center; researchers may apply to VI for access to the data.
237	Internisten Zwolle	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Diabetologie
25	IPNV	The IPNV Corpus is a corpus originally compiled by the Veteraneninstituut (VI). It comprises a collection of more than 1,100 (recorded) interviews with veterans who were involved in wars and other military actions that the Dutch military forces took part in. The average duration of an interview is 2.5 hours. Most interviews are with veterans of World War II, the decolonization wars with Indonesia and New Guinea, the UN action in Korea, the UN observe mission in Lebanon, UN missions in Cambodia and former Yugoslavia, and the NATO missions in Iraq and Afghanistan.
64	Iraqw dataset	Dataset on Iraqw in the form of 3 digitalized cassettes, collected in Tanzania. It includes Iraqw Verbal Art.
73	Iraqw Dataset	Dataset on Iraqw in the form of 3 digitalized cassettes, collected in Tanzania. It includes Iraqw Verbal Art.
89	Iraqw Dataset 2	Dataset on Iraqw in the form of notebooks, collected in Tanzania. Transciption included, type unknown
112	Iraqw Dataset 3	Dataset on Iraqw in the form of 1 digitalized cassette, collected in Tanzania.
192	Iraqw, Swahili Dataset	Dataset on Iraqw and Swahili in the form of 2 GB of audio material, collected in Dar Es Salaam, Tanzania.
193	Karo Dataset	Dataset on Kara/Karo in the form of audio material, collected in Dimeka, Ethiopia.
100	Kelantan Malay Dataset	Dataset on Kelantan Malay in the form of 16 GB of audio material, collected in Kelantan, Malaysia. Transciption included, type unknown
135	Kemak, Welaun Dataset	Dataset on Kemak and Welaun in the form of 9 word lists, collected in West Timor, Indonesia. Transciption included, type unknown
172	Kikuyu Dataset	Dataset on Kikuyu in the form of audio material, collected in Kenya.
241	Kinderartsen	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Kindergeneeskunde
77	Konso Dataset	Dataset on Konso in the form of 9 digitalized cassettes, collected in Ethiopia.
116	Konso Dataset 2	Dataset on Konso in the form of 30 VHS tapes, collected in Ethiopia.
124	Konso Dataset 3	Dataset on Konso in the form of digitalized cassettes, collected in Ethiopia.
131	Kopas Dataset	Dataset on Kopas in the form of 9 word lists and 9 stories, collected in West Timor, Indonesia.Transciption included, type unknown
128	Kotos Amarasi Dataset	Dataset on Kotos Amarasi in the form of 15 hours of audio material, collected in Nekmese, Indonesia. Transciption included, type unknown
134	Kusa Manea Dataset	Dataset on Kusa Manea in the form of 4 hours of audio material and 2 word lists, collected in As Manuela, Indonesia. Transciption included, type unknown
26	LAISEANG	The geographical region of insular South East Asia and New Guinea is well-known as an area of mega-biodiversity. Less well-known is the extreme linguistic diversity in this area: over a quarter of the world’s 6000 languages are spoken here. As small minority languages, most of these will cease to be spoken in the coming few generations. The LAISEANG corpus ensures the preservation of unique records of languages and the cultures encapsulated by them in the region. The language resources have been gathered by twenty linguists at, or in collaboration with Dutch universities over the last 40 years, and are compiled and archived in collaboration with The Language Archive (TLA) in Nijmegen
98	Latin Dataset	Dataset on Latin in the form of 10772 entries, stemming from dictionaries.
209	Leiden Learner Corpus, the description Learner corpus on Romance languages acquired by Dutch native speakers	Dataset on Romance Languages
5	LESLLA	The LESLLA corpus dates from 2003-2005 and contains speech of 15 low educated learners of Dutch as a second language. The learners are all women. Eight learners have a Turkish background, seven learners have a Moroccan background. From these learners, data were collected over time in three cycles, with an interval of 5 months. In each cycle the participants took part in three types of tasks: (1) production tasks, (2) perception tasks, and (3) a perception task with a metalinguistic component. The production tasks included open elicitation (14,000 utterances), closed completion (4,000 utterances) and imitation (6,000 utterances). Apart from the audio files, the data consist of orthographic transcriptions. For all 15 learners there are also metadata available.
29	LIEDNL	De Nederlandse Liederenbank omvat zo'n 170.000 Nederlandse liederen* (stand medio 2014). In beginsel gaat het om Nederlandstalige liederen, zowel uit Nederland als Vlaanderen. De liederenbank bestrijkt zo'n 900 jaar, van de middeleeuwen tot de eenentwintigste eeuw. Het soort liederen en de dekkingsgraad verschilt per periode, collectie en repertoire. Percentueel zijn uit de middeleeuwen de meeste liederen opgenomen. Het percentage neemt geleidelijk af tot aan het heden. Uit de twintigste eeuw staan er vooral liederen uit volksliedbundels en veldwerkopnamen in. Het gaat daarbij nog steeds om vele duizenden liederen.
182	LiLi Wu Dataset	Dataset on LiLi Wu in the form of 300 hours of audio material, collected in LiLi and Suzhou, China. Considers tone. Transciption included, type unknown
114	Lingala Youth Language Dataset	Dataset on Lingala Youth Language in the form of digitalized cassettes, collected in Congo.
208	Luganda Dataset	Dataset on Luganda in the form of 24 GB of audio material, collected in Kampala, Uganda.
151	Malian Sign Language Dataset	Dataset on Malian Sign Language in the form of audiovisual material, collected in Bamako, Mali.
161	Mandarin Dataset	Dataset on Mandarin Chinese in the form of an EEG study on in-situ questions and prosody, collected in Xuzhou, China.
146	Mapudungun Dataset	Dataset on Mapudungun in the form of surveys, collected in Chile.
120	Mawayana Dataset	Dataset on Mawayana, Wayana in the form of 14 hours of audio material, collected in Kwamalasamutu and Apetina, Surinam. Considers oral tradition.
219	Mbugu Dataset	Dataset on Mbugu in the form of 5 digitalized cassettes, collected in Tanzania.
83	Mbugu Dataset 2	Dataset on Mbugu in the form of 4 digitalized cassettes, collected in Tanzania. Considers topological relations.
113	Mbugu Dataset 3	Dataset on Mbugu in the form of 4 digitalized cassettes, collected in Tanzania. Contains music and singing.
178	Mbugu Dataset 4	Dataset on Mbugu in the form of audio material, collected in Tanzania.
87	Minkana Dataset	Dataset on Minkana in the form of digitalized cassettes, collected in Cameroon.
154	Multilingual Dataset	33 interviews on language attitude, considering the following languages: Russian, Croatian, Berber, British English, Haags, Romanian, Mexican Spanish, Icelandic, Sarnámi, Papiamento, French, Dutch, Ngiemboon, Malay, German, Italian, Chechen, Greek, Polish, Kurdish, Chinese, Georgian, Hungarian, Armenian, Catalan, Turkish, Moluccan Malay, Farsi, Finnish, Biaks, Fries, collected in The Hague, The Netherlands.
184	Multilingual Dataset 2	Dataset on the following languages: Malay, Malaysia Chinese, Mandarin, Cantonese, English, Telugu, Hakka, Hokkien, Teochew, Fuchou, Tamil, Hindi, Malayalam, Punjabi, Arabic in the form of 12 hours of audio material, collected in Malaysia.
202	Multilingual Dataset 3	Dataset on Maale, Oyda, Wolaytta and Zergulla (Omotic languages) in the form of 220 hours of audio material, collected in Shafite, Koibe, Imalle and Soddo, Ethiopia. Transciption included, type unknown
27	NEHOL	NEHOL is a digitally accessible and searchable database with the Dutch-lexifier Creole language Negerhollands, in the same format as the parallel SUCA (SUriname Creole Archive) corpus, coordinated by Margot van den Berg (Radboud University Nijmegen). The NEHOL project was coordinated by Pieter Muysken (Radboud University Nijmegen), and technically supported by the TLA (‘The Language Archive’) unit at the MPI for Psycholinguistics in Nijmegen.
33	NGT	The Corpus NGT is an open access online corpus of movies with annotations of Sign Language of the Netherlands (abbreviated as SLN or NGT).
2	NPCMC	The Nijmegen Parsed Corpus of Modern Chechen contains a number of manually corrected syntactically annotated texts. Some of the texts originally come from a corpus of Chechen texts created by Ron Zacharski & Jim Cowie at the New Mexico State University. These texts were originally available at http://guidetodatamining.com/appendices/corpora/. They are not available there anymore, but see the paper talking about this corpus: http://mt-archive.info/AMTA-2006-Abdelali.pdf.
53	NRC2011	Newspaper texts taken from printed and and digital versions of the NRC newspaper (edition 2011). The texts cover blogs, hard news, background articles, opinion articles on related topics. Metadata per text are available in CMDI XML files. The 'NRC2011' corpus has been created for the CLARIAH sponsored ACAD project. See https://www.clariah.nl/projecten/research-pilots/acad/acad and https://cesar.science.ru.nl/. Cooperators: Micha Hulsbosch - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Wilbert Spooren - Radboud University Nijmegen, Faculty of arts, Dutch language Erwin R. Komen - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG The corpus contains 2225 newspaper texts taken from printed and and digital versions of the NRC newspaper (year 2011). The texts cover blogs, hard news, background articles, opinion articles on related topics. FLAT can be used to open and view the folia-files. See https://flat.science.ru.nl/ Metadata per article are available in CMDI XML files. The File textlist-folia.json contains an overview of all available texts in json format. The file NRCLicentieovereenkomst.pdf contains the License Agreement with NRC.
174	Nyoro Dataset	Dataset on Nyoro in the form of audio material, collected in Cameroon.
99	Old Khotanese Dataset	Dataset on Old Khotanese in the form of an AWK script which reads Hawkins (2000) Corpus of Hieroglyphic Luwian Inscriptions, stemming from dictionaries and grammars.
20	PanderMaatSanders	The PanderMaatSanders subcorpus was compiled for a study of the causal connectives 'daardoor', 'daarom' and 'dus' in Dutch news. It consists of 100 cases from a Dutch newspaper (de Volkskrant 1994 and 1995). The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
162	Papiamento, Dutch Dataset	Dataset on Papiamento and Dutch in the form of visual and written tests on code-switching.
111	Papuan Dataset	Dataset on Papuan and Austronesian languagues, in the form of 1000 GB of audio material and texts, collected in Indonesia. Transciption included, type unknown
235	PatientVOICE	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Oncologie
21	Pit	The Pit narrdow and totadow subcorpora were compiled for a study of the causal connectives "aangezien", "doordat", "omdat" and "want" in Dutch, German and French narratives and news. The narrdow subcorpus consists of cases from 22 Dutch novels published between 1990 and 1996. The totadow subcorpus consists of cases from a Dutch newspaper (de Volkskrant) from 1995. The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
158	Portuguese, English Dataset	Dataset on Portuguese and English in the form of 117 surveys. Ongoing project.
93	Proto-Germanic Dictionary	Dictionary of Proto-Germanic (reconstructed) forms.
199	Proto-Quechua, Proto-Aymara Dataset	Dataset on Proto-Quechua and Proto-Aymara in the form of fragments from dictionaries, collected in The Netherlands. Transciption included, type unknown
198	Puquina, Quechua Dataset	Dataset on the following languages: Puquina, Quechua, Aymara and Spanish, in the form of one book, collected in Peru. Transciption included, type unknown
125	Purepecha Dataset	Dataset on Purepecha in the form of 13 hours of audio material, collected in Michoacán, Mexico.
127	Purepecha Dataset 2	Dataset on Purepecha in the form of 14 hours of audio material and texts, collected in Michoacán, Mexico. Elicitation test children in Michoacán have to do for school.
196	Quechua Dataset	Dataset on Southern Peruvian: Matsigenka in the form of 250 hours of audio material, collected in Peru. Transciption included, type unknown
129	Ro'is Amarasi Dataset	Dataset on Ro'is Amarasi in the form of 5 hours of audio material, collected in Burain and Tumbaun, Indonesia. Transciption included, type unknown
156	Rukiga Dataset	Dataset on Rukiga in the form of 18 GB of audio material, collected in Kabale, Uganda. Transciption included, type unknown
201	Russian Dataset	Dataset on Russian in the form of analyses from corpus data, collected in Russia.
22	SandersSpooren	The SandersSpooren subcorpus was compiled for a study of the causal connectives 'want' and 'omdat' in several types of discourse. It consists of cases from news, spontaneous conversation and chat data. It comprises 553 cases in total of omdat and want from newspapers, spontaneous conversations and chat interaction;100 cases of omdat and 102 cases of want from newspapers (D-Coi); 100 cases of omdat and 100 cases of want from conversations (CGN; 51 cases of omdat and 100 cases of want from chat interaction (VU-Chat-corpusI) The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.
42	Schaerlaekens Corpus	The original database consists of the spontaneous language of two triplets (in total six children) between the ages of 1;10.18 and 3;1.7 for the first set and 1;6.17 and 2;10.23 for the second set.
204	Sekpele Dataset	Dataset on Sekpele in the form of 30 hours of audio material, collected in Likpele, Ghana. Transciption included, type unknown
79	Seme Dataset	Dataset on Seme in the form of 3 digitalized cassettes, collected in Burkina Faso.
88	Seme Dataset 1	Dataset on Seme in the form of notebooks, collected in Burkina Faso. Transciption included, type unknown
183	Serbian, English Dataset	Dataset on English with code-switching to Serbian in the form of 10 hours of audio material, collected in Belgrade, Novi Sad and Nis, Serbia. Data collected online through Facebook and Twitter. Transciption included, type unknown
75	Sereer Dataset	Dataset on Sereer in the form of digitalized cassettes, collected in Senegal.
181	ShuangFeng Xian Dataset	Dataset on ShuangFeng Xiang in the form of 12 hours of audio material, collected in Loudi, China. Considers tone. Transciption included, type unknown
175	Siona, Spanish Dataset	Dataset on Siona and Spanish in the form of 47 hours of audio material, collected in Ecuador. Transciption included, type unknown
12	SLI RU-Kentalis	Corpus for investigation of the expression of spatial relations by children with SLI and normally developing children in their spoken language production.
117	Somali Dataset	Dataset on Somali in the form of digitalized cassettes, collected in Somalia.
3	SoNaR	SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. (SOURCE: CLARIN). The SoNaR Corpus consists of three parts: - SONAR 500: 500 M words automatically tokenized, lemmatized and POS-tagged - SONAR 1: 1 M words with handchecked semantic annotations: named entities, co-reference relations, semantic roles and spatio-temporal relations - SONAR New Media Corpus: Subpart of SoNaR 500 containing the material from new media only (tweets, chats, SMS)
132	South Amanuban	Dataset on South Amanuban in the form of word lists, collected in Nekamese, Indonesia. Transciption included, type unknown
126	Spanish Dataset	Dataset on Spanish in the form of half an hour of audio material, collected in Michoacán, Mexico.
169	Spanish, Dutch Dataset	Dataset on aspect and imperfect in Spanish and Dutch in the form of 1192 tokens. Transciption included, type unknown
166	Spanish, English Dataset	Dataset on Spanish and English in the form of 30 hours of audio material, collected in Miami, USA. Transciption included, type unknown
167	Spanish, English Dataset 2	Acceptability judgements on code-switching in Spanish and English, collected in New Mexico, USA.
165	Tarifyt Berber Dataset	Dataset on Tarifyt Berber in the form of 10 hours of digitalized cassettes, collected in Driouch, Morocco. Transciption included, type unknown
187	Tassawaq, Hausa Dataset	Dataset on Tassawaq and Hausa in the form of 2 interviews of audio material, collected in Agadez, Niger. Transciption included, type unknown
51	test_sle20	beschrijving
210	The Andreas Thesaurus Linked Data Annotation	The Andreas Thesaurus is a set of Linked Data annotations that reference the IRIs of A Thesaurus of Old English (https://oldenglishthesaurus.arts.gla.ac.uk/). The dataset can be browsed as a textual thesaurus within the web application Evoke and is available via http://evoke.ullet.net/content.
211	The Beowulf Thesaurus Linked Data Annotation	The Beowulf Thesaurus is a set of Linked Data annotations that reference the IRIs of A Thesaurus of Old English (https://oldenglishthesaurus.arts.gla.ac.uk/). The dataset can be browsed as a textual thesaurus within the web application Evoke and is available via http://evoke.ullet.net/content.
212	The Old English Martyrology Thesaurus Linked Data Annotation	The Old English Martyrology Thesaurus is a set of Linked Data annotations that reference the IRIs of A Thesaurus of Old English (https://oldenglishthesaurus.arts.gla.ac.uk/). The dataset can be browsed as a textual thesaurus within the web application Evoke and is available via http://evoke.ullet.net/content.
236	Thuiszorg	This is a collection of transcriptions of interviews in the medical domain. All interviews are in Dutch. These interviews are between medical specialists and patients, and are collected and hosted at Nivel, Utrecht (https://www.nivel.nl/en). The interviews were transcribed in the context of the Homed project (https://homed.ruhosting.nl/) and intended for finetuning automatic speech recognition for Dutch in the medical domain. This collection belongs to part: Thuiszorg
130	Timaus Dataset	Dataset on Timaus in the form of 2 hours of audio material, collected in Sanenu and Oekona, Indonesia. Transciption included, type unknown
80	Toussian Dataset	Dataset on Toussian in the form of 2 digitalized cassettes, collected in Burkina Faso.
86	Tunen Dataset	Dataset on Tunen in the form of 5 digitalized cassettes, collected in Cameroon.
105	Typological Dataset	Typological Database System (TDS) that has 1200 different properties of languages. For the collection in total about 200 different sources were used for creation (some sources are collected in fieldwork, some from books).
43	Van Kampen Corpus	The van Kampen corpus is based on tapings of two Dutch girls. Laura was studied from the age of 1;9.18 to 5;10.9 and Sarah from 1;6.16 to 6;0. The child’s age at each session is given inside each file. The recordings were made roughly once or twice every month by the mother of the children (Jacqueline van Kampen). The Laura corpus exists of 72 45-minute recordings. The Sarah corpus consists of 50 45-minute recordings.
44	Van Oosten Bilingual Corpus	Picture descriptions by 20 children in the age range of 4-13 years. Half of the children were Italian-Dutch bilinguals and the other 10 were monolinguals (5 Dutch monolinguals and 5 Italian monolinguals). The study is reported in an MA thesis at the University of Utrecht entitled “Lo sviluppo dell’acquizione del soggetto nei bambini bilingui ital-olandesi.” Funding was obtained through a scholarship of the Royal Dutch Institute in Rome. This research seeks to determine if Müller & Hulk (2001)’s hypothesis works also for other linguistic phenomena that occur at the interface between syntax and pragmatics, such as, for instance, subject acquisition. In the case of subject acquisition there is the possibility in early Dutch to drop the subject when it contains old information. Also in adult Italian there is this option, due to the same pragmatic rule. However, in early Dutch the omission of the subject is constrained by the position of the specifier of the root, while there is no such constraint in the Italian case. In adult Italian there is just one structural analysis when it comes to omit the subject: when the subject contains old information it has to be omitted, wherever the element may be located inside the phrase structure. If this analysis is correct, we predict that Italian grammar may influence Dutch grammar, for the reason that the bilingual child will choose the analysis that is favored by both languages. Italian/ Dutch bilinguals should produce more null subjects than their monolingual peers as bilinguals would over generalize the pragmatic rule that is common to both languages. In this study, bilingual children did produce significantly more null subjects in their Dutch corpus than their monolingual Dutch peers did. Müller & Hulk (2001) predict this influence to arise before the C-system is completed. However, the subjects of my research had completed the acquisition of their C-system, as we can see from their use of embedded structures and WH-phrases.
200	Variants of English Dataset	Dataset on Old English, Middle English, Modern English and Present-Day English, stemming from the Penn Parsed corpora of historical English and the British National Corpus.
171	Variants of Spanish Dataset	Dataset on Argentinian, Peruvian and Peninsular Spanish in the form of 88 interviews on dialects, collected in Argentina, Peru and Spain.
213	Voice Onsets in Danish Dialects	Dataset on voice onset time in Danish dialects
16	VU-DNC	VU-DNC is a unique diachronic corpus of Dutch newspaper articles from five major Dutch newspapers from 1950/1951 and 2002 (2 MW). The VU-DNC has beete between the words directly under responsibility of the journalist.n annotated for quotations, which enables the researcher to differentia
185	Walikan, Javanese	Dataset on Basa Walikan Malangan, a colloquial language variety of Javanese in the form of 50 hours of audio material, collected in Malang, Indonesia. Transciption included, type unknown
6	WBD	The Dictionary of Brabantic Dialects (WBD) covers together with the Dictionary of the Limburgian Dialects (WLD) and the Dictionary of the Flemish dialects (WVD) by a same type of descriptive dialect lexicography the entire Southern Dutch speaking region below the major rivers. This area stretches over three countries: the Netherlands, Belgium and France. The area under study included Flemish Brabant and Antwerp in Flanders and Brabant in the Netherlands for WBD. The collection comprises three parts: I: Agricultural vocabulary II: Non-agricultural vocabulary III: General vocabulary For each part the information is available as PDFs of the books, LMF-versions of the lexicon and text-versions (CSV) of the lexicon. Information per keyword comprises: - lemmatisation - dialect entry (more or less phonetic) - comments - locations (in Kloeke codes and place names) - source information for the dialect entries
136	West Timor Dataset	1100 reconstructions and cognate lists of West Timor languages stemming from dictionaries.
7	WGD	The Dictionary of the dialects of Gelderland describes specific parts of the general vocabulary. The dictionary has been written around a certain theme. Unlike other local dictionaries that have been published in our province, the vocabulary has been arranged thematically rather than alphabetically. The dictionary presently comprises two areas: the River Area (Rivierengebied) and the Veluwe. For each part the information is available as PDFs of the books, LMF-versions of the lexicon and text-versions (CSV) of the lexicon. Information per keyword comprises: - lemmatisation - dialect entry (more or less phonetic) - comments - locations (place names)
55	Whatsapp corpus Berntzen	Whatsapp conversations collected by master students Communication & Information Studies (2013-2014; 2014-2015). All participants in the conversations are over 18 and have signed consent forms. Metadata per conversation are available in CMDI XML files. The corpus has been made available for the CLARIAH sponsored ACAD project. The 'WhatsAppManon' corpus has been made available for the CLARIAH sponsored ACAD project. See https://www.clariah.nl/projecten/research-pilots/acad/acad and https://cesar.science.ru.nl/. Cooperators: Micha Hulsbosch - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Wilbert Spooren - Radboud University Nijmegen, Faculty of arts, Dutch language Erwin R. Komen - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Patrick Sonsma - Radboud University Nijmegen, Faculty of arts, Dutch language Original researcher: Manon Berntzen - Radboud University Nijmegen, Faculty of arts, Dutch language The corpus contains 60 WhatsApp chat sessions that have been collected by Manon Berntzen for the course on "New media-new methods" and then for her Bachelor thesis. The exact date of each chat is included in the <event> tag attributes in the .folia.xml files. FLAT can be used to open and view the folia-files. See https://flat.science.ru.nl/ The participants have all indicated that their chats can be used (in an anonymized form) for research purposes. Metadata per chat are available in CMDI XML files. The File textlist-folia.json contains an overview of all available texts in json format. Note: the files are numbered 001-063 consecutively, but 058-060 (as well as 064) are excluded, because they lack permission.
52	Whatsapp corpus Verheijen	Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent only from contributor and not from conversational partner. Consequently, the subcorpus only contains contributions from the submitter. Metadata per conversation are available in CMDI XML files. Ref: Verheijen, L., & Stoop, W. (2016, September). Collecting facebook posts and whatsapp chats. In International Conference on Text, Speech, and Dialogue (pp. 249-258). Springer, Cham. The corpus has been made available for the CLARIAH sponsored ACAD project. See https://www.clariah.nl/projecten/research-pilots/acad/acad and https://cesar.science.ru.nl/. Cooperators: Micha Hulsbosch - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Wilbert Spooren - Radboud University Nijmegen, Faculty of arts, Dutch language Erwin R. Komen - Radboud University Nijmegen, Faculty of Arts, Humanities Lab, TSG Patrick Sonsma - Radboud University Nijmegen, Faculty of arts, Dutch language Original researcher: Lieke Verheijen - Radboud University Nijmegen, Faculty of arts, Dutch language The corpus contains 218 WhatsApp chat sessions that have been collected by Lieke Verheijen in 2012-2014 in the Netherlands. The exact date of each chat is included in the <event> tag attributes in the .folia.xml files. FLAT can be used to open and view the folia-files. See https://flat.science.ru.nl/ The participants have all indicated that their chats can be used (in an anonymized form) for research purposes. Metadata per chat are available in CMDI XML files. The File textlist-folia.json contains an overview of all available texts in json format.
45	Wijnen Corpus	The corpus is based on home tapings of one Dutch boy, Niek, between the ages of 2;7 and 3;10. The recordings were made by Niek’s father (Frank Wijnen). The data were mainly used in a project focusing on the relation between language acquisition and developmental disfluency.
8	WLD	The Dictionary of Limburgian Dialects (WLD) covers together with the Dictionary of the Brabantic Dialects (WBD) and the Dictionary of the Flemish dialects (WVD) by a same type of descriptive dialect lexicography the entire Southern Dutch speaking region below the major rivers. This area stretches over three countries: the Netherlands, Belgium and France. The area under study included both Limburg and the northeast of Liege for WLD. The collection comprises three parts: I: Agricultural vocabulary II: Non-agricultural vocabulary III: General vocabulary For each part the information is available as PDFs of the books, LMF-versions of the lexicon and text-versions (CSV) of the lexicon. Information per keyword comprises: - lemmatisation - dialect entry (more or less phonetic) - comments - locations (in Kloeke codes and place names) - source information for the dialect entries
177	Wolof Dataset	Dataset on Wolof in the form of audio material, collected in Senegal.
76	Yaaku Dataset	Dataset on Yaaku in the form of 4 digitalized cassettes, collected in Kenya.
221	Yaaku Dataset 2	Dataset on Yaaku in the form of audio material, collected in Kenya.
84	Zigula Dataset	Dataset on Zigula in the form of 2 digitalized cassettes, collected in Tanzania.
46	Zink Corpus	The recordings for this corpus were made in Leuven, Brabant, Belgium (3 children: Meinder, Judith, Laurien) and Antwerp, Belgium (1 child: David). The participants were recorded every two weeks from 8 months to 25 months of age. Each recording session lasted approximately 60 minutes