BU-CE-1101: PDF
Nearest-Neighbor based Metric Functions for Indoor Scene Recognition
Indoor scene recognition is a challenging problem in the classical scene recognition domain due to the severe intra-class variations and inter-class similarities of man-made indoor structures. State-of-the-art scene recognition techniques such as capturing holistic representations of a scene demonstrate low performance on indoors. Other methods that introduce intermediate steps such as identifying objects and associating them to scenes have the handicap of successfully localizing and recognizing the objects in a highly cluttered and sophisticated environment.We propose a classification method that can handle such difficulties of the problem domain by employing a metric function based on the nearest-neighbor classification procedure using the bag-of-visual words scheme, the so-called codebooks. Considering the codebook construction as a Voronoi tessellation of the feature space, we have observed that, given an image, a learned weighted distance of the extracted feature vectors to the center of the Voronoi cells gives a strong indication of its category. Our method outperforms state-of-the-art approaches on an indoor scene recognition benchmark and achieves competitive results on a general scene dataset, using a single type of descriptor.
Keywords: Scene classification; indoor scene recognition; nearest neighbor classifier; bag-of-visual words.
BU-CE-1102: PDF
Efficient Query Processing on Term-Based-Partitioned Inverted Indexes
In a shared-nothing, parallel text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the inverted index is either document-based or term-based partitioned, depending on properties of the underlying hardware infrastructure, query traffic, and some performance and availability constraints. In query processing on term-based-partitioned indexes, the high communication overhead incurred due to transfer of large amounts of information from index servers to the central broker forms a major performance bottleneck, deteriorating the scalability of the parallel text retrieval system. In this work, to alleviate this problem, we propose a novel combinatorial model that tries to assign concurrently accessed index entries to the same index servers, based on the inverted index access patterns extracted from past query logs. The model aims to minimise the communication overhead incurred by future queries, also enforcing a balance on workloads of index servers. We evaluate the performance of this model over a real-life text collection and a query log obtained from Yahoo!. Our results indicate significant reduction in the communication overhead, relative to a baseline strategy that only balances query workloads.BU-CE-1103: PDF
Combining Term and Document Popularities with Query Views for Static Index Pruning
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; i.e., without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, i.e., retrieves this document among its top results. In this study, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine.BU-CE-1104: PDF
Proposal and Referee Biclustering for Panel Formation
We are given a set of referees and and a set of proposals to be coclustered such that the proficiency between the referees and the proposals within a cluster is maximized. Due to capacity of the referees, one should consider an additional constraint that balances the number of proposals and the number of referees within each cluster, as well. Current biclustering approaches do not capture this balance constraint. As a remedy to this deficiency, in this study, we redefined the problem and proposed a graph partitioning based model to solve.BU-CE-1105: PDF
Nearest-Neighbor Based Metric Functions for Indoor Scene Recognition
Indoor scene recognition is a challenging problem in the classical scene recognition domain due to the severe intra-class variations and inter-class similarities of man-made indoor structures. State-of-the-art scene recognition techniques such as capturing holistic representations of an image demonstrate low performance on indoor scenes. Other methods that introduce intermediate steps such as identifying objects and associating them with scenes have the handicap of successfully localizing and recognizing the objects in a highly cluttered and sophisticated environment.We propose a classification method that can handle such difficulties of the problem domain by employing a metric function based on the nearest-neighbor classication procedure using the bag-of-visual words scheme, the so-called codebooks. Considering the codebook construction as a Voronoi tessellation of the feature space, we have observed that, given an image, a learned weighted distance of the extracted feature vectors to the center of the Voronoi cells gives a strong indication of the image's category. Our method outperforms state-of-the-art approaches on an indoor scene recognition benchmark and achieves competitive results on a general scene dataset, using a single type of descriptor.
In this study although our primary focus is indoor scene categorization, we also employ the proposed metric function to create a baseline implementation for the auto-annotation problem. With the growing amount of digital media, the problem of auto-annotating images with semantic labels has received signi cant interest from researches in the last decade. Traditional approaches where such content is manually tagged has been found to be too tedious and a time-consuming process. Hence, succesfully labeling images with keywords describing the semantics is a crucial task yet to be accomplished.
Keywords: scene classification, indoor scene recognition, nearest neighbor classifier, bag-of-visual words, image auto-annotation.
BU-CE-1106: PDF
Caching Techniques for Large Scale Web Search Engines (Ph. D. Thesis)
Large scale search engines have to cope with increasing volume of web content and increasing number of query requests each day. Caching of query results is one of the crucial methods that can increase the throughput of the system. In this thesis, we propose a variety of methods to increase the efficiency of caching for search engines.We first provide cost-aware policies for both static and dynamic query result caches. We show that queries have significantly varying costs and processing cost of a query is not proportional to its frequency. Based on this observation, we develop caching policies that take the query cost into consideration in addition to frequency, while deciding which items to cache. Second, we propose a query intent aware caching scheme such that navigational queries are identified and cached differently from other queries. Query results are cached and presented in terms of pages, which typically includes 10 results each. In navigational queries, the aim is to reach a particular web site which would be typically listed at the top ranks by the search engine, if found. We argue that caching and presenting the results of navigational queries in this 10-per-page manner is not cost effective and thus we propose alternative result presentation models and investigate the effect of these models on caching performance. Third, we propose a cluster based storage model for query results in a static cache. Queries with common result documents are clustered using single link clustering algorithm. We provide a compact storage model for those clusters by exploiting the overlap in query results. Finally, a five-level static cache that consists of all cacheable data items (query results, part of index, and document contents) in a search engine setting is presented. A greedy method is developed to determine which items to cache. This method prioritizes items for caching based on gains computed using items' past frequency, estimated costs, and storage overheads. This approach also considers the inter-dependency between items such that caching of an item may affect the gain of items that are not cached yet.
We experimentally evaluate all our methods using a real query log and document collections. We provide comparisons to corresponding baseline methods in the literature and we present improvements in terms of throughput, number of cache misses, and storage overhead of query results.
Keywords: Search engine, caching techniques, cost-aware caching, navigational queries.
BU-CE-1107: PDF
A Result Cache Invalidation Scheme for Web Search Engines (M. Sc. Thesis)
The result cache is a vital component for efficiency of large-scale web search engines, and maintaining the freshness of cached query results is a current research challenge. As a remedy to this problem, our work proposes a new mechanism to identify queries whose cached results are stale. The basic idea behind our mechanism is to maintain and compare generation time of query results with update times of posting lists and documents to decide on staleness of query results.The proposed technique is evaluated using a Wikipedia document collection with real update information and a real-life query log. Throughout the experiments, we compare our approach with two baseline strategies from literature together with a detailed evaluation. We show that our technique has good prediction accuracy, relative to the baseline based on the time-to-live (TTL) mechanism. Moreover, it is easy to implement and incurs less processing overhead on the system relative to a recently proposed, more sophisticated invalidation mechanism.
Keywords: Web search, result cache, cache invalidation, time-to-live, freshness, adaptive.
BU-CE-1108: PDF
Software Design, Implementation, Application, and Refinement of a Bayesian Approach for the Assessment of Content and User Qualities (M. Sc. Thesis)
The internet provides unlimited access to vast amounts of information. Technical innovations and internet coverage allow more and more people to supply contents for the web. As a result, there is a great deal of material which is either inaccurate or out-of-date, making it increasingly difficult to find relevant and up-to-date content. In order to solve this problem, recommender systems based on collaborative filtering have been introduced. These systems cluster users based on their past preferences, and suggest relevant contents according to user similarities. Trustbased recommender systems consider the trust level of users in addition to their past preferences, since some users may not be trustworthy in certain categories even though they are trustworthy in others. Content quality levels are important in order to present the most current and relevant contents to users. The study presented here is based on a model which combines the concepts of content quality and user trust. According to this model, the quality level of contents cannot be properly determined without considering the quality levels of evaluators. The model uses a Bayesian approach, which allows the simultaneous co-evaluation of evaluators and contents. The Bayesian approach also allows the calculation of the updated quality values over time. In this thesis, the model is further refined and configurable software is implemented in order to assess the qualities of users and contents on the web. Experiments were performed on a movie data set and the results showed that the Bayesian co-evaluation approach performed more effectively than a classical approach which does not consider user qualities. The approach also succeeded in classifying users according to their expertise level.Keywords: Information quality, web 2.0, collaborative systems, recommender systems, collaborative filtering, Bayesian networks, co-evaluation, trust-based systems.