TalTech Estonian Speech Dataset 1.0

This dataset contains long-form speech data in the Estonian language, with manual transcriptions.

The dataset contains the following materials:

Most subsets are divided into training, dev and test splits, but some only contain training data. Training split contains 1334 hours of audio in total, but not all of it is speech (there are also segments containing music, etc). Development and test splits contains 21 and 23 hours of audio.

Transcriptions are produced using Transcriber according to the attached manual (see file „Trankribeerimise_juhend.pdf“). Most of the material has been transcribed by non-professional transcribers and there is a significant amount of errors, that shouldn’t however exceed 5% of the words.

The Transcriber trs files are pre-converted to STM and VTT formats.

Most of the material has been transcribed in the past 20 years by the Laboratory of Language Technology at Tallinn University of Technology, using funding provided by the national programme for Estonian language technology.

Licence

CC BY-SA 4.0 DEED: https://creativecommons.org/licenses/by-sa/4.0/

Copyright of the audio material in the dataset belongs to corresponding parties.

Downloading

taltech-asr-speech-dataset-1.0.tar (150 GB)

Contact

Tanel Alumäe tanel.alumae@taltech.ee

Citing

Tanel Alumäe, Joonas Kalda, Külliki Bode, and Martin Kaitsa. 2023. Automatic Closed Captioning for Estonian Live Broadcasts. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 492–499, Tórshavn, Faroe Islands. University of Tartu Library.

@inproceedings{alumae-etal-2023-automatic,
    title = "Automatic Closed Captioning for {E}stonian Live Broadcasts",
    author = {Alum{\"a}e, Tanel  and
      Kalda, Joonas  and
      Bode, K{\"u}lliki  and
      Kaitsa, Martin},
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.49",
    pages = "492--499"
}