SEMINAR
DEPARTMENT OF COMPUTER ENGINEERING
ABSTRACT
APPLICATION OF FEATURE PROJECTION BASED TEXT CATEGORIZATION ALGORITHM ON THE TURKISH DATASET
Ufuk İlhan
M.S. in Computer Engineering
Supervisor: Assoc. Prof. Halil Altay Güvenir
February 21, 2001 at 14:40 in EB267
This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Nearly all researchers have been concerned with English or with languages morphologically similar to English. In such languages, words contain only a small number of affixes, or none at all, almost all of parsing models for them consider recognizing those affixes as being trivial, and thus do not make morphological analyses. This feature allows easy stemming of the words to find their root words. On the other hand, agglutinative languages as Turkish, words contain no direct indication where the morpheme boundaries are, and furthermore morphemes take a shape dependent on the morphological and phonological context. In Turkish the process of adding one suffix to another can result in relatively long words, which often contain an amount of semantic information equivalent to a whole English phrase, clause or sentence. Due to this complex morphological structure, a single Turkish word can give rise to a very large number of variants. Therefore, Turkish requires text processing techniques different than English and similar languages.
This thesis also presents the evaluation and comparison of the well-known k-NN classification algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm which is based on the idea of representing training instances as their projections on each feature
dimension.
Keywords: text categorization, classification, feature projections, stemming, wild card matching, stopword.