Romanian Language Phonetic Analysis: Study and Applications
Coordinated by: Dragoş Burileanu
Other persons involved in the project
Ștefan Diaconescu (Softwin)
Studies leading to significant results in the field of Natural Language Processing (NLP) for Romanian language are carried out in multiple Romanian research centers such as “Politehnica” University of Bucharest, "Iorgu Iordan - Al. Rosetti" Institute of Linguistics, Bucharest, Romanian Academy Research Institute for Artificial Intelligence, The Technical University of Cluj-Napoca, The Military Technical Academy, The Alexandru Ioan Cuza University of Iasi, and SOFTWIN SRL (The Research and Development Department). However, many aspects need to be further investigated for elaborating large linguistic knowledge bases, dictionaries, tools, and applications. The current project will integrate the results of previous work (existing linguistic knowledge bases, linguistic tools and applications for exploiting linguistic data, etc.) in order to develop various products with scientific and commercial value:
1. Phonetic Study for Romanian Language starting from the already existing linguistic data written in GRAALAN. This study will also specify the method of the dictionary that will be created starting from the GRAALAN principles and with the aid of the already developed tools such as LKT (Language Knowledge Tool) and MKT (Morphological Knowledge Tool). These linguistic tools will help enriching the linguistic data bases as to cover:
a. Approx. 90.000 – 100.000 lemmas at the level of the lexicon;
b. Approx. 1.250.000 inflected forms (single words) with phonetic transcriptions accompanied by phonetic/morphological syllabifications;
c. Approx. 2.500.000 inflection situations (synthetic forms), meaning various morphological categories in agreement with the Morphological Configurator written in GRAALAN;
d. Approx. 12.500.000 inflected forms (multi-words) with phonetic transcriptions accompanied by phonetic/morphological syllabifications;
e. Approx. 18.750.000 inflection situations corresponding to the analytic forms.
2. Romanian Morphological and Phonetic Dictionary. The Dictionary will include:
a. Lemmas, synthetic and analytic inflected forms, as well as phonetic transcriptions and morphological/phonetic syllabifications; every dictionary entry will contain details about the corresponding inflection situation;
b. The design of an application allowing fast and easy access to information;
3. The Phonetic Dictionary of Romanian Syllables. The dictionary will contain:
a. The description of the GRAALAN methods and principles as well as of the linguistic tools used for introducing the linguistic data as, for example, MKT – Morphological Knowledge Tool.
b. The Dictionary will contain syllables written in both normal and phonetic alphabet along with indications about the stress patterns.
c. The soft associated to the dictionary will facilitate access to information, providing various facilities of filtering the large amount of data.
4. Application of Speech Recognition for Romanian Language. The application will
a. Use of original, patented signal shape analysis algorithms.
b. Use of the syllabic bases for audio representations codified and normalized using specific VTS (Variable Time Stretching) algorithms.
c. Use of the GRAALAN encoded linguistic knowledge such as phoneticization, syllabification, and other linguistic information implied by the processes of spelling and grammar checking.