SEMINAR

 

DEPARTMENT OF COMPUTER ENGINEERING

 

ABSTRACT

 

Statistical Modeling of Agglutinative Languages

 

by

 

Dilek Zeynep Hakkani Tür

 

Recent advances in computer hardware and availability of very large corpora have made the application of statistical techniques to natural language processing a possible, and a very appealing research area. Many good results have been obtained by applying these techniques to English (and similar languages) in parsing, word sense disambiguation, part-of-speech tagging, and speech recognition. However, languages like Turkish, which have a number of characteristics that differ from English have mainly been left unstudied. Turkish presents an interesting problem for statistical modeling. In contrast to languages like English, for which there is a very small number of possible word forms with a given root word, for languages like Turkish or Finnish with very productive agglutinative morphology, it is possible to produce thousands of forms for a given root word. This causes a serious data sparseness problem for language modeling.

 

In this Ph.D. thesis, I present our work on the development and application of statistical language modeling techniques for Turkish, and test such techniques on basic applications of natural language and speech processing like morphological disambiguation, spelling correction, and n-best list rescoring for speech recognition.

 

The Seminar will be on August 28, 2000 at 10:00 in EA 331