To achieve this goal, the Web pages and queries will be first categorized into semantic groups. To this end, taxonomies from the Web directories or text clustering algorithms can be employed. Next, these semantic groups will be exploited in index pruning and caching methods. Essentially, we will concentrate on pruning based on semantic groups (partially or entirely) instead of pruning pages or terms as proposed in the literature. For index caching, we will investigate methods that cache a single result list for a set of queries in the same semantic group, or cache only a particular set of semantic groups for each term in the collection. The proposed methods will be evaluated on the largest available datasets in the literature in terms of result quality, index size, query processing speed, cache hit-rate, etc. Finally, a prototype search engine will be implemented for two major areas -namely news and health, which are frequently queried by users. The promising approaches from the earlier stages will also be embedded in this prototype. We consider that the Web pages and queries in these areas are amenable to form semantic groups. Thus, a search engine operating on top of these areas will be a reasonable application to observe the viability of proposed approaches in a real life setting.
It is envisioned that the proposed methods in this project may provide significant savings in terms of resources used by the search engines, given that they prove to be successful. Besides, these techniques would also be beneficial for the existence of local, national and/or alternative systems against giant search engine companies with enormous resources. The proposed approaches are not limited to general purpose search engines and can also be used by vertical search engines and Web portals. This means that this project may also provide valuable efficiency and scalability improvements for such systems with rather limited resources. Finally, the prototype search engine that will be implemented in this project would serve as a by-product for Turkish speaking Web users.