Automatic Construction and Sophisticated Querying of Domain-Specific Web Search
Engines
(Project No: 105E024)
SPONSORS:
Scientific and Technical Research Council of Turkey - TÜBITAK
ABSTRACT:
The goal of this project is developing an automated system that allows
constructing
specialized Web portals with sophisticated querying features. In particular,
the construction of such a specialized portal requires crawling all the Web
pages that are related to a specific target domain or topic, extracting structured
information from these pages (with least possible human intervention) and finally
providing effective and advanced querying features over the whole Web repository.
Algorithms used by currently existing domain-specific search engines and portals
can not be easily adapted to different domains. Moreover, these systems still
don't provide more advenced querying options than a typical keyword-based search
engine. At the first stage of this project, a focused crawler that can be trained
to identify and gather all relevant pages on a target domain will be constructed.
Next, by using infromation extraction techniques, both generic (such as
name, location, e-mail address, etc.) and more specific (e.g., "data structure"
names from the computer science domain) structured information will be obtained
from these Web pages and stored in a traditional relational DBMS. At the last stage,
sophisticated querying features will be supported over both the Web pages in
the repository and the structured information asssociated with these pages. These
queries are different from typical database queries, as they can allow ranked
outputs by modifying scores associated with the records according to the predicates
in the query during the query evaluation. In this sense, the domain-specific
portal to be constructed would be neither a plain compilation of bunch of links,
nor yet-another domain-specific engine that can only allow keyword-based queries.
The extraction of information at the second stage and sophisticated
score-based querying of the database at the last stage will make the proposed system
superior than the existing ones.
During this project, sophisticated algorithms will be tailored for focused-crawling
and information extraction, and SQL will be extended to natively support
score-management issues. Besides, a prototype Web portal for computer science resources
(both in English and Turkish) will be constructed and maintained. This will
be a valuable service as we expect this portal to be a frequently visited site
by both researchers and students. Finally, the modular system constructed in this
project may be adapted to and facilitate building other domain-specific portals
and search engines.