Persistent Identifierauto

http://hdl.handle.net/21.11114/COLL-0000-000E-BCB3-4

Description0-1

A multi-lingual speech corpus used for …

A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The motivation behind the corpus and its design relies on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, are covered. An orthographic transcription is available for every utterance. Also, time-aligned word and phone annotations for some of the sub-corpora exist. The design of the corpus which is a good source of documentation is described in a paper published in LREC 2010: Altosaar, T., Bosch, L. ten, Aimetti, G., Koniaris, Chr., Demuynck, K., Heuvel, H. van den (2010): A Speech Corpus for Modeling Language Acquisition: CAREGIVER. Proceedings LREC2010, Malta, pp. 1062-1068. http://www.lrec-conf.org/proceedings/lrec2010/pdf/597_Paper.pdf. However, in the actual corpus there are a couple of deviations from this setup. The corpus contains nearly 66,000 utterance-based audio files spoken over a two-year period by 16 male and 14 female native speakers of Dutch, English, and Finnish. Swedish is missing. For Dutch only Y2 recordings are available. Here is an overview: ACORNS_Y1/UK/: SpeakerD SpeakerC SpeakerB SpeakerA: Four English speakers Y1 recordings (2 male, 2 female). There are 1000 recordings and orthographic transcriptions (in xml) per speaker. ACORNS_Y1/fin/: FIN-M-SF FIN-M-MT FIN-F-KA FIN-F-JL Four Finnish speakers Y1 recordings (2 male, 2 female). There are 2000 recordings and orthographic transcriptions (in xml) per speaker. ACORNS_Y2/UK/: recordings of 10 speakers, Speaker01-04 are the same as for Y1. For each speaker there are 2397 recordings. The other six speakers (3 male, 3 female) are test speakers with each 600 recordings. Y2-UK-XML: orthographic transcriptions in xml Y2-UK-WAV: speech recordings old_xml: old version of Y2-UK-XML. May be discarded. annotation: ACORNS-Y2-UK-v2-FA: time stamps at word level by Forced Alignment Y2-UK-v2-FA-phone: time stamps at phone level by Forced Alignment list_of_errors: errors in time stamps at word level ACORNS_Y2/NL/: Recordings of Dutch speakers. 4 speakers were recorded twice, 2 males (henk, peter) and 2 females (els, margot), the other six were test speakers with one recording session, 4 males (eric, folkert, helmer, vico) and 2 females (daphne, hella). The .cor files contain the orthographic transcriptions with time stamps (sentence level only). ACORNS_Y2/FIN/: recordings of 10 speakers, Speaker01-04 are the same as for Y1. For each speaker there are 2397 recordings. The other six speakers (3 male, 3 female) are test speakers with each 600 recordings. Y2-FIN-XML: orthographic transcriptions in xml Y2-FIN-WAV: speech recordings

LandingPage1

Title(s)1-n

A Speech Corpus for Modeling Language Acquisition: CAREGIVER

Owner(s)0-n

[1]: Faculty of Arts of the Radboud University, Erasmusplein 1, 6525 HT Nijmegen, the Netherlands,

[2]: Aalto Univ. School of Science and Tech., Dept. of Signal Proc. & Acoustics, P.O. Box 3000, FI-02015 TKK, Finland,

[3]: Univ. of Sheffield, Speech & Hearing group, Dept. of Computer Science, 211 Portobello Street, Sheffield, S1 4DP, UK

Genre(s)0-n

prompted speech , conversation

Domain(s)0-n

Building and testing computational models of the speech understanding component of first language acquisition , Analysis of child directed speech

Language(s)1-n

English [eng] , Dutch (Northern) [nld] , Finnish [fin]

CLARIN centre0-1

TLA MPI, Nijmegen, the Netherlands

Version0-1

1.0

Size(s)0-n

17 GB

Project(s)0-n

Acquisition of Communication and Recognition Skills (ACORNS) site (Funder: EC: contract no. FP6-034362)

Resource(s)1-n

Resource 1

Description0-1

Dublin-Core Type1

Dataset

subtype0-1

Modality1-n

Provenance(s)0-n

Provenance 1

Temporal0-1

2006-2009

Linguality0-1

Linguality

Status0-n

normal

Variant0-n

standard

Accessibility0-1

Accessibility

Name1

Open

Availability0-n

public

License name(s)0-n

??MPI

Non-commercial usage0-1

ISBN0-1

ISLRN0-1

???

Contact(s)0-n

Henk van den Heuvel: CLST, Radboud University, (h.vandenheuvel@let.ru.nl) , Louis ten Bosch: CLST, Radboud University, (l.tenbosch@let.ru.nl)

Medium(s)0-n

internet

Documentation0-1

Documentation

Language(s)1-n

English [eng]

Type(s)0-n

manual

URL(s)0-n

http://www.lrec-conf.org/proceedings/lrec2010/

Validation0-1

Validation

not specified