< Back

Noise-Robust, Domain-Adaptable, Large-Vocabulary Automatic Speech Recognition System for the Romanian Language
[Project's page]

Coordinated by: Horia Cucu

Other department members involved in this project: Dragoş Burileanu, Lucian Petrică


The main goal of this project was to develop a Rich Speech Transcription (RST) service for audio documents. The final outcome of the project is a web-service that enables individuals to access the textual content of an audio document (news bulletin, interview, lecture, meeting recording, etc.) without listening it. This feature is of critical importance in many applications such as multimedia databases indexing and retrieval, real-time radio/TV monitoring, transcription of self-recorded documents, etc.

The RST service is based on the first speaker-independent, large vocabulary continuous speech recognition (LVCSR) system for Romanian, developed by our research laboratory in 2011. The RST service development implied enhancing and adapting the LVCSR system to the particularities of multimedia documents transcription. To achieve the main objective, the current LVCSR system was augmented with several modules:

a speech enhancement module that reduces the noise effect on the transcription accuracy,
a speaker diarization module that divides the speech signal into segments (based on the speaker who uttered the speech) and identifies the speaker (from a set of previously known speakers),
a text post-processing module that formats paragraphs, numbers, dates, etc. and restores diacritics, punctuation marks and capital letters, increasing the intelligibility of the output text,
better acoustic and language models that improve the accuracy of the system.

The first version of the service transcribes into text the Romanian speech within multimedia documents, while future versions may be adapted for other low-resourced languages as well. As opposed to high-resourced languages, such as English, Spanish, Mandarin Chinese, under-resourced languages are those languages for which there aren’t sufficient acoustic, phonetic and linguistic databases for the straight-forward development of spoken language technology (SLT) systems and applications.

We believe that adapting the system to other under-resourced languages will have an important social and economic impact, because for many such languages, there are currently no automatic solutions for speech transcription. In this context, the continuous growth of multimedia production, sharing and consumption leaves us with large multimedia databases that cannot be efficiently accessed and exploited. Their content can only be classified and accessed based on metadata and this is insufficient when one wants to find multimedia documents on specific topics or sub-topics. Moreover, complete and rich transcriptions of these multimedia documents can only be generated manually and this is a non-scalable and time/cost inefficient process.

The beneficiaries of this service could be: a) the individuals and companies that need to transcribe multimedia documents, b) the companies that possess large, un-annotated multimedia databases and have no means of efficiently accessing and exploiting them and c) individual users of public multimedia libraries and online multimedia-sharing websites.