This dataset contains long-form speech data in the Estonian language, with manual transcriptions.
The dataset contains the following materials:
ERR2020, aktuaalne2021, intervjuukorpus, paevakaja, MKK, aktuaalne-kaamera, er-uudised, jutusaated
konverentsid, veebiseminarid
Riigikogu_salvestused
Most subsets are divided into training, dev and test splits, but some only contain training data. Training split contains 1334 hours of audio in total, but not all of it is speech (there are also segments containing music, etc). Development and test splits contains 21 and 23 hours of audio.
Transcriptions are produced using Transcriber according to the attached manual (see file „Trankribeerimise_juhend.pdf“). Most of the material has been transcribed by non-professional transcribers and there is a significant amount of errors, that shouldn’t however exceed 5% of the words.
The Transcriber trs files are pre-converted to STM and VTT formats.
Most of the material has been transcribed in the past 20 years by the Laboratory of Language Technology at Tallinn University of Technology, using funding provided by the national programme for Estonian language technology.
CC BY-SA 4.0 DEED: https://creativecommons.org/licenses/by-sa/4.0/
Copyright of the audio material in the dataset belongs to corresponding parties.
taltech-asr-speech-dataset-1.0.tar (150 GB)
Tanel Alumäe tanel.alumae@taltech.ee
Tanel Alumäe, Joonas Kalda, Külliki Bode, and Martin Kaitsa. 2023. Automatic Closed Captioning for Estonian Live Broadcasts. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 492–499, Tórshavn, Faroe Islands. University of Tartu Library.
@inproceedings{alumae-etal-2023-automatic,
title = "Automatic Closed Captioning for {E}stonian Live Broadcasts",
author = {Alum{\"a}e, Tanel and
Kalda, Joonas and
Bode, K{\"u}lliki and
Kaitsa, Martin},
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may,
year = "2023",
address = "T{\'o}rshavn, Faroe Islands",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2023.nodalida-1.49",
pages = "492--499"
}