Persistent Identifierauto

http://hdl.handle.net/21.11114/COLL-0000-000B-D210-5

Description0-1

The components of the Frisian data coll…

The components of the Frisian data collection are speech and language resources gathered for building a large vocabulary ASR system for the Frisian language. Firstly, a new broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details. The second component of this collection is a language model created on a text corpus with diverse vocabulary. Thirdly, a Frisian phonetic dictionary with the mappings between the Frisian words and phones is built to make the ASR viable for this under-resourced language. Finally, an ASR recipe is provided which uses all previous resources to perform recognition and present the recognition performances. The Corpus consists of short utterances extracted from 203 audio segments of approximately 5 minutes long which are parts of various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database. The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages. The total duration of the manually annotated radio broadcasts sums up to 18 hours, 33 minutes and 57 seconds. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. The available meta-information helped the annotators to identify these speakers and mark them either using their names or the same label (if the name is not known). There are 309 identified speakers in the FAME! Speech Corpus, 21 of whom appear at least 3 times in the database. These speakers are mostly program presenters and celebrities appearing multiple times in different recordings over years. There are 233 unidentified speakers due to lack of meta-information. The total number of word- and sentence-level code-switching cases in the FAME! Speech Corpus is equal to 3837. Music portions have been removed, except where these overlap with speech. Later, the components for speaker clustering and verification experiments are added by adding around 80 hours of raw speech data and reorganizing the manually annotated data respectively. Music portions of the raw data have been automatically removed. Moreover, we applied a publicly available speaker diarization system to the raw speech data and included the output in the corpus. Further details about the speaker clustering and verification database are available in the last reference below. A full description of the FAME! Speech Corpus is provided in: Yilmaz, E., Heuvel, H. van den, Van de Velde, H., Kampstra, F., Algra, J., Leeuwen, D. van (2016): Open Source Speech and Language Resources for Frisian Language. In: Proceedings Interspeech 2016, San Francisco, CA, USA, Sept. 2016. For the details of the ASR corpus, we refer the reader to: Yılmaz, E., Andringa, M., Kingma, S., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2016): A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-switching Research. In Proceedings LREC, pp. 4666-4669, Portorož, Slovenia, May 2016. The details of the speaker clustering and verification corpus is provided in: Yılmaz, E., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2017): Longitudinal Speaker Clustering and Verification Corpus with Code-switching Frisian-Dutch Speech in Proceedings Interspeech, pp. 37-41 Stockholm, Sweden, August 2017.

LandingPage1

https://fame.ruhosting.nl/wordpress/?page_id=34

Title(s)1-n

FAME! Speech Corpus

Owner(s)0-n

[1]: Omrop Fryslân,

[2]: Fryske Akademy,

[3]: Radboud University

Genre(s)0-n

radio/TV-broadcast

Language disorder(s)0-n

none

Language(s)1-n

Western Frisian [fry] , Dutch (Northern) [nld]

CLARIN centre0-1

CLST & ELRA

Version0-1

1.0

Size(s)0-n

18 hours

Relation(s)0-n

[FAME Speech Corpus] isSiblingOf [FAME Radio Broadcast Corpus]

Creator(s)0-n

[1]: Frederik Kampstra (Omrop Fryslan),

[2]: Jelske Dijkstra-Hans Van de Velde (Fryske Akademy),

[3]: Emre Yilmaz-Henk van den Heuvel-David van Leeuwen (CLST, Radboud University)

Project(s)0-n

FAME! site (Funder: NWO Creative Industry)

Resource(s)1-n

Resource 1

Description0-1

The components of the Frisian data coll…

The components of the Frisian data collection are speech and language resources gathered for building a large vocabulary ASR system for the Frisian language. Firstly, a new broadcast database is created by collecting recordings from the archives of the regional broadcaster Omrop Fryslân, and annotating them with various information such as the language switches and speaker details. The second component of this collection is a language model created on a text corpus with diverse vocabulary. Thirdly, a Frisian phonetic dictionary with the mappings between the Frisian words and phones is built to make the ASR viable for this under-resourced language. Finally, an ASR recipe is provided which uses all previous resources to perform recognition and present the recognition performances. The Corpus consists of short utterances extracted from 203 audio segments of approximately 5 minutes long which are parts of various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database. The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages. The total duration of the manually annotated radio broadcasts sums up to 18 hours, 33 minutes and 57 seconds. The stereo audio data has a sampling frequency of 48 kHz and 16-bit resolution per sample. The available meta-information helped the annotators to identify these speakers and mark them either using their names or the same label (if the name is not known). There are 309 identified speakers in the FAME! Speech Corpus, 21 of whom appear at least 3 times in the database. These speakers are mostly program presenters and celebrities appearing multiple times in different recordings over years. There are 233 unidentified speakers due to lack of meta-information. The total number of word- and sentence-level code-switching cases in the FAME! Speech Corpus is equal to 3837. Music portions have been removed, except where these overlap with speech. The total amount of audio segments containing speech is approximately equal to 14 hours. This data is divided into training, development and test sets to be able to perform ASR experiments. The training data of the database comprises of 8.5 hours and 3 hours of speech from Frisian and Dutch speakers respectively. The development and test sets each consist of 1 hour of speech from Frisian speakers and 20 minutes of speech from Dutch speakers. For the details of the ASR corpus, we refer the reader to: Yılmaz, E., Andringa, M., Kingma, S., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2016): A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-switching Research. In Proceedings LREC, pp. 4666-4669, Portorož, Slovenia, May 2016.

Dublin-Core Type1

Sound

subtype0-1

speech

Modality1-n

speech , transcribed

Recording environment0-n

home/office , studio , public-place

Channel0-n

broadcasting

Social context0-n

public

Planning type0-n

semi-spontaneous , spontaneous

Interactivity0-n

interactive , semi-interactive

Involvement0-n

unknown

Audience0-n

large

SC duration speech0-1

unknown

SC duration full0-1

over 3000 hours

SC speakers0-1

unknown

SC sp. demogr0-1

Size0-n

18 hours

Annotation0-n

[1]: [orthographicTranscription] [manual] [text/praat-textgrid],

[2]: [alignment] [manual] [text/praat-textgrid]

Media0-n

audio/x-wav

Resource 2

Description0-1

This part of the collection is known as…

This part of the collection is known as: FAME! Speaker Clustering and Verification Corpus The components for speaker clustering and verification experiments are added by adding around 80 hours of raw speech data and reorganizing the manually annotated data respectively. Music portions of the raw data have been automatically removed. Moreover, we applied a publicly available speaker diarization system to the raw speech data and included the output in the corpus. The details of the speaker clustering and verification corpus is provided in: Yılmaz, E., Dijkstra, J., Kuip, van der F., Van de Velde, H., Kampstra, F., Algra, J., Heuvel, H. van den, Leeuwen, D. Van (2017): Longitudinal Speaker Clustering and Verification Corpus with Code-switching Frisian-Dutch Speech in Proceedings Interspeech, pp. 37-41 Stockholm, Sweden, August 2017.

Dublin-Core Type1

Sound

subtype0-1

speech

Modality1-n

transcribed

Recording environment0-n

studio , home/office , public-place

Channel0-n

broadcasting

Social context0-n

public

Planning type0-n

semi-spontaneous , spontaneous

Interactivity0-n

interactive , semi-interactive

Involvement0-n

other

Audience0-n

large

SC duration speech0-1

unknown

SC duration full0-1

over 3000 hours

SC speakers0-1

unknown

SC sp. demogr0-1

Size0-n

18 hours

Annotation0-n

[speakerIdentification] [manual] [text/praat-textgrid]

Media0-n

audio/x-wav

Provenance(s)0-n

Provenance 1

Temporal0-1

1966-2015

Cities0-n

Frisia

Country0-1

Netherlands (the) NL

Linguality0-1

Linguality

Status0-n

normal

Variant0-n

dialect , standard

MultiType0-n

codeSwitching

Accessibility0-1

Accessibility

Name1

ELRA

License name(s)0-n

NTU CGN license , ELRA license (End User or VAR)

Non-commercial usage0-1

Website(s)0-n

http://catalog.elra.info/en-us/repository/browse/the-fame-speech-corpus/652adfe0a9ef11e7a093ac9e1701ca02c9639239f1e84440a99e2c0e92546aef/ , http://catalog.elra.info/en-us/

ISBN0-1

ISLRN0-1

340-994-352-616-4

Contact(s)0-n

Henk van den Heuvel: CLST, Radboud Univbersity, Nij, (clst@let.ru.nl) , Valerie Mapelli: ELRA/ELDA, (info@elda.org)

Medium(s)0-n

internet

Documentation0-1

Documentation

Language(s)1-n

English [eng]

Type(s)0-n

manual

Validation0-1

Validation

not specified