Automatic Construction and Sophisticated Querying of Domain-Specific Web Search Engines
(Project No: 105E024)

SPONSORS: Scientific and Technical Research Council of Turkey - TÜBITAK

ABSTRACT: The goal of this project is developing an automated system that allows constructing specialized Web portals with sophisticated querying features. In particular, the construction of such a specialized portal requires crawling all the Web pages that are related to a specific target domain or topic, extracting structured information from these pages (with least possible human intervention) and finally providing effective and advanced querying features over the whole Web repository. Algorithms used by currently existing domain-specific search engines and portals can not be easily adapted to different domains. Moreover, these systems still don't provide more advenced querying options than a typical keyword-based search engine. At the first stage of this project, a focused crawler that can be trained to identify and gather all relevant pages on a target domain will be constructed. Next, by using infromation extraction techniques, both generic (such as name, location, e-mail address, etc.) and more specific (e.g., "data structure" names from the computer science domain) structured information will be obtained from these Web pages and stored in a traditional relational DBMS. At the last stage, sophisticated querying features will be supported over both the Web pages in the repository and the structured information asssociated with these pages. These queries are different from typical database queries, as they can allow ranked outputs by modifying scores associated with the records according to the predicates in the query during the query evaluation. In this sense, the domain-specific portal to be constructed would be neither a plain compilation of bunch of links, nor yet-another domain-specific engine that can only allow keyword-based queries. The extraction of information at the second stage and sophisticated score-based querying of the database at the last stage will make the proposed system superior than the existing ones. During this project, sophisticated algorithms will be tailored for focused-crawling and information extraction, and SQL will be extended to natively support score-management issues. Besides, a prototype Web portal for computer science resources (both in English and Turkish) will be constructed and maintained. This will be a valuable service as we expect this portal to be a frequently visited site by both researchers and students. Finally, the modular system constructed in this project may be adapted to and facilitate building other domain-specific portals and search engines.

DURATION: October 2005 - October 2007
PRINCIPAL INVESTIGATOR: Özgür Ulusoy
RESEARCHERS: Ismail Sengor Altingovde, Rıfat Özcan
BUDGET: 124,560 YTL (~$95,000)

Automatic Construction and Sophisticated Querying of Domain-Specific Web Search Engines (Project No: 105E024)

SPONSORS: Scientific and Technical Research Council of Turkey - TÜBITAK

Automatic Construction and Sophisticated Querying of Domain-Specific Web Search Engines
(Project No: 105E024)