All technical report files are in compressed (gzip) postscript format.
The file names are of the sort BU-CE-YYXX.ps.gz or BU-CE-YYXX.pdf
BU-CE-0201
TITLE: A Histogram-Based Approach for Object-Based Query-by-Shape-and-Color in
Multimedia Databases
AUTHORS: Ediz Saykol, Ugur Güdükbay and Özgür Ulusoy
ABSTRACT:
Considering the fact that most of the multimedia database systems
include querying by object features, a novel approach for color and
shape querying and retrieval is proposed. The approach is
histogram-based and it is supported by an auxiliary object extraction
tool in the query-by-feature component of a multimedia database
system. The object extraction tool is semi-automatic, hence it can
successfully capture salient object regions for most of the images
and/or video frames. The use of the object extraction tool facilitates
object-based querying of color and shape content. In the
histogram-based approach, the color and shape can be integrated to
improve the performance of querying. It is shown through performance
experiments that the histogram-based approach overcomes the
deficiencies of the existing methods for querying by shape. The
evaluation of the effectiveness and the robustness of the
histogram-based approach is also presented via performance experiments.
Keywords: Histogram-based approach, query-by-shape, query-by-color,
histogram comparison, object-based querying, kinematics model for
polygon simplification, random noise model, retrieval effectiveness.
BU-CE-0202
TITLE: Componentwise Bounds for Nearly Completely Decomposable Markov
Chains using Stochastic Comparison and Reordering
AUTHORS: Nihal Pekergin, Tugrul Dayar, and Denizhan N. Alparslan
ABSTRACT:
This paper presents an improved version of a componentwise bounding
algorithm for the state probability vector of nearly completely
decomposable Markov chains, and on an application it provides the
first numerical results with the type of algorithm discussed. The
given two-level algorithm uses aggregation and stochastic comparison
with the strong stochastic (st) order. In order to improve accuracy,
it employs reordering of states and a better componentwise probability
bounding algorithm given st upper- and lower-bounding probability
vectors. A thorough analysis of the two-level algorithm from the point
of view of irreducibility is provided. Results in sparse storage show
that there are cases in which the given algorithm proves to be useful.
BU-CE-0203
TITLE: Permuting Sparse Rectangular Matrices into Block-Diagonal Form
AUTHORS: Cevdet Aykanat, Ali Pinar, and Umit V. Catalyurek
ABSTRACT:
This work investigates the problem of permuting a sparse rectangular
matrix into block diagonal form.
Block diagonal form of a matrix grants an inherent parallelism for the
solution of the deriving problem, which has been recently investigated
in the context of mathematical programming, LU factorization and
QR factorization.
We propose bipartite graph and hypergraph models to represent the nonzero
structure of a matrix, which reduce the permutation problem to those
of graph partitioning by vertex separator and hypergraph partitioning,
respectively.
Besides proposing the models to represent sparse matrices and
investigating related combinatorial problems, we provide a detailed
survey of relevant literature to bridge the gap between different
societies, investigate existing techniques for partitioning and
propose new ones, and finally present a thorough empirical study of
these techniques.
Our experiments on a wide range of matrices, using state-of-the-art graph
and hypergraph partitioning tools MeTiS and PaToH, revealed that the
proposed methods yield very effective solutions both in terms of
solution quality and runtime.
BU-CE-0204
TITLE: Efficient Parallel Frequency Mining Based On a Novel Top-Down
Partitioning Scheme for Transactional Data
AUTHORS: Eray Ozkural
ABSTRACT:
In recent years, large quantities of data have been amassed with advances in
data acquisition capabilities. Automated detection of useful information is
required for vast data obtained from scientific and business domains. Data
Mining is the application of efficient algorithmic solutions on a variety of
immense data for such knowledge discovery.
Frequency mining discovers all frequent patterns in a transaction or
relational database and it comprises the core of several data mining
algorithms such as association rule mining and sequence mining. Frequency
pattern discovery has become a challenge for parallel programming since it
is a highly complex operation on huge datasets demanding efficient and
scalable algorithms.
In this thesis, we propose a new family of parallel frequency mining
algorithms. We introduce a novel transaction set partitioning scheme that
can be used to divide the frequency mining task in a top-down fashion. The
method operates on the graph of frequent patterns with length two (GF2) from
which a graph partitioning by vertex separator (GPVS) is mapped to a two-way
partitioning on the transaction set. The two parts obtained can be mined
independently and therefore can be utilized for concurrency. In order for
this property to hold, there is an amount of replication dictated by the
separator in GF2 which is minimized by the GPVS algorithm. A k-way
partitioning is derived from recursive application of 2-way partitioning
scheme which is used in the design of a generic parallel frequency mining
algorithm. First we compute GF2 in parallel, succeeding that we designate a
k-way partitioning of the database for k processors with a parallel
recursive procedure. The database is redistributed such that each processor
is assigned one part. Subsequent mining proceeds simultaneously and
independently at each processor with a given serial mining algorithm. A
complete implementation in which we employ FP-Growth as the sequential
algorithm has been achieved. The performance study of the algorithm on a
Beowulf system demonstrates favorable performance for synthetic databases.
For hard instances of the problem, we have gained approximately twice the
speedup of a state-of-the-art algorithm.
We also present a correction and optimization to FP-Growth algorithm.
Keywords:
Parallel Data Mining, Frequency Mining
BU-CE-0205
TITLE: ARG: A Tool for Automatic Report Generation
AUTHORS: Murat Karakaya and H. Altay Güvenir
ABSTRACT:
The expansion of on-line text with the rapid growth of the Internet
imposes utilizing Data Mining techniques to reveal the information
embedded in these documents. Therefore text classification and text
summarization are two of the most important application areas. In this
work, we attempt to integrate these two techniques to help the user to
compile and extract the information that is needed. Basically, we
propose a two-phase algorithm in which the paragraphs in the documents
are first classified according to given topics and then each topic is
summarized to constitute the automatically generated report.
Keywords: Data mining, text summarization, text classification,
automatic report generation.
BU-CE-0206
TITLE: Diagnosis of Gastric Carcinoma Tumors by Classification on Feature Projections
AUTHORS: H. Altay Güvenir, Narin Emeksiz and Necati Örmeci
ABSTRACT:
A new classification algorithm, called BCFP (for Benefit Maximizing
Classifier on Feature Projections), is developed and applied to the
problem of diagnosis of gastric carcinoma tumors. The domain contains
records of patients with known diagnosis through gastroscopy
results. Given a training set of such records, the BCFP classifier
learns how to differentiate a new case in the domain. BCFP represents
a concept in the form of feature projections on each feature dimension
separately. Classification in the BCFP algorithm is based on a voting
among the individual predictions made on each feature. In the gastric
carcinoma tumors domain, a lesion can be an indicator of one of nine
different levels of gastric carcinoma tumors, from early to late
stages. The benefit of correct classification of early levels is much
more than that of late cases. Also, the cost of wrong classification
is not symmetric. In the training phase, the BCFP algorithm learns
classification rules that maximize the benefit of classification. In
the querying phase, using these rules, the BCFP algorithm tries to
make a prediction maximizing the benefit. A genetic algorithm is
applied to select the relevant features. The performance of the BCFP
classifier is evaluated in terms of accuracy and running time. It is
also compared with human experts.
Keywords: Machine learning, voting, multi-class classification,
benefit maximization, feature selection, gastric carcinoma tumors.
BU-CE-0207
TITLE: Predicting Next Page Access by Time Length Reference in the
Scope of Effective Use of Resources
AUTHORS: Berkan Yalcinkaya
ABSTRACT:
Access log file is like a box of treasure waiting to be exploited
containing valuable information for the web usage mining system. We
can convert this information hidden in the access log files into
knowledge by analyzing them. Analysis of web server access data can
help understand the user behavior and provide information on how to
restructure a web site for increased effectiveness, thereby improving
the design of this collection of resources. We designed and developed
a new system in this thesis to make dynamic recommendation according
to the interest of the visitors by recognizing them through the
web. The system keeps all user information and uses this information
to recognize the other user visiting the web site. After the visitor
is recognized, the system checks whether she/he has visited the web
site before or not. If the visitor has visited the web site before, it
makes recommendation according to his/her past actions. Otherwise, it
makes recommendation according to the visitors coming from the parent
domain. Here, "parent domain" identifies the domain in which the
identity belongs to. For instance, "bilkent.edu.tr" is the parent
domain of the "cs.bilkent.edu.tr". The importance of the pages that
the visitors are really interested in and the identity information
forms the skeleton of the system. The assumption that the amount of
time a user spends on page correlates to whether the page should be
classified as a navigation or content page for that user. The other
criterion, the identity information, is another important point of the
thesis. In case of having no recommendation according to the past
experiences of the visitor, the identity information is located into
appropriate parent domain or class to get other recommendation
according to the interests of the visitors coming from its parent
domain or class because we assume that the visitors from the same
domain will have similar interests. Besides, the system is designed in
such a way that it uses the resources of the system
efficiently. "Memory Management", "Disk Capacity" and "Time Factor"
options have been used in our system in the scope of "Efficient Use of
the Resources" concept. We have tested the system on the web site of
CS Department of Bilkent University. The results of the experiments
have shown the efficiency and applicability of the system.
Keywords:
Access log file, personalization, identity information, recommendation.
BU-CE-0208
TITLE: Benefit Maximizing Classification Using Feature Intervals
AUTHORS: Nazli Ikizler
ABSTRACT:
For a long time, classification algorithms have focused on minimizing
the quantity of prediction errors by assuming that each possible error
has identical consequences. However, in many real-world situations,
this assumption is not convenient. For instance, in a medical
diagnosis domain, misdiagnosing a sick patient as healthy is much more
serious than its opposite. For this reason, there is a great need for
new classification methods that can handle asymmetric cost and benefit
constraints of classifications. In this thesis, we discuss
cost-sensitive classification concepts and propose a new
classification algorithm called Benefit Maximization with Feature
Intervals (BMFI) that uses the feature projection based knowledge
representation. In the framework of BMFI, we introduce five different
voting methods that are shown to be effective over different
domains. A number of generalization and pruning methodologies based on
benefits of classification are implemented and experimented. Empirical
evaluation of the methods has shown that BMFI exhibits promising
performance results compared to recent wrapper cost-sensitive
algorithms, despite the fact that classifier performance is highly
dependent on the benefit constraints and class distributions in the
domain. In order to evaluate cost-sensitive classification techniques,
we describe a new metric, namely benefit accuracy which computes the
relative accuracy of the total benefit obtained with respect to the
maximum possible benefit achievable in the domain.
Keywords: machine learning, classification, cost-sensitivity, benefit
maximization, feature intervals, voting, pruning.
BU-CE-0209
TITLE: KiMPA: A Kinematics-Based Method for Polygon Approximation
AUTHORS: Ediz Saykol, Gurcan Gulesir, Ugur Gudukbay, Ozgur Ulusoy
ABSTRACT:
In different types of information systems, such as multimedia
information systems and geographic information systems, object-based
information is represented via polygons corresponding to the
boundaries of object regions. In many applications, the polygons have
large number of vertices and edges, thus a way of representing the
polygons with less number of vertices and edges is developed. This
approach, called polygon approximation, or polygon simplification, is
basically motivated with the difficulties faced in processing polygons
with large number of vertices. Besides, large memory usage and disk
requirements, and the possibility of having relatively more noise can
also be considered as the reasons for polygon simplification. In this
paper, a kinematics-based method for polygon approximation is
proposed. The vertices of polygons are simplified according to the
velocities and accelerations of the vertices with respect to the
centroid of the polygon. Another property of the proposed method is
that the user may set the number of vertices to be in the approximated
polygon, and may hierarchically simplify the output. The approximation
method is demonstrated through the experiments based on a set of
polygonal objects.
BU-CE-0210
TITLE:
Rule-Based Spatio-Temporal Query Processing for Video Databases
AUTHORS: Mehmet Emin Dönderler, Özgür Ulusoy and Ugur Güdükbay
ABSTRACT:
In our earlier work, we proposed a novel architecture
for a Web-based video database management system providing an
integrated support for both spatio-temporal and semantic video
queries. In this paper, we focus on the task of query processing for
spatio-temporal queries, which may contain any combination of
directional, topological, 3D-relation, external-predicate,
object-appearance, trajectory-projection, and similarity-based
object-trajectory conditions. We also propose an SQL-like textual
video query language that has the capability to handle a broad range
of spatio-temporal queries. The query language is rule-based in that
it allows users to express the spatial query conditions on video data
in terms of Prolog-type predicates.
In our video database management system, called BilVideo (Bilkent
Video), spatio-temporal query processing is carried out in five main
stages: lexical analysis (lexer), syntax checking and parse-tree
production (parser), query decomposition and query-tree construction,
subquery and interval processing, and final query result formation.
Keywords: Spatio-temporal relations, spatio-temporal query processing,
content-based retrieval, knowledge representation, inference rules,
video databases, multimedia databases.
BU-CE-0211
TITLE: Data Modeling and Querying for Video Databases
AUTHORS: Mehmet Emin Dönderler
ABSTRACT:
With the advances in information technology, the amount of multimedia
data captured, produced and stored is increasing rapidly. As a
consequence, multimedia content is widely used for many applications
in today's world, and hence, a need for organizing this data and
accessing it from repositories with vast amount of information has
been a driving stimulus both commercially and academically. In
compliance with this inevitable trend, first image and especially
later video database management systems have attracted a great deal of
attention since traditional database systems are not suitable to be
used for multimedia data.
In this thesis, a novel architecture for a video database system is
proposed. The architecture is original in that it provides full
support for spatio-temporal queries that contain any combination of
spatial, temporal, object-appearance, external-predicate,
trajectory-projection and similarity-based object-trajectory
conditions by a rule-based system built on a knowledge-base, while
utilizing an object-relational database to respond to semantic
(keyword, event/activity and category-based) and low-level (color,
shape and texture) video queries. Research results obtained from this
thesis work have been realized by a prototype video database
management system, which we call BilVideo. Its tools, Fact-Extractor
and Video-Annotator, its Web-based visual query interface and its
SQL-like textual query language are presented. Moreover, the query
processor of BilVideo and our spatio-temporal query processing
strategy are also discussed.
Keywords: Video databases, multimedia databases, information
systems, video data modeling, content-based retrieval, spatio-temporal
relations, spatio-temporal query processing, video query languages.
BU-CE-0212
TITLE: Data Modeling and Querying for Video Databases
AUTHORS: Eyup Sabri Konak
ABSTRACT:
In recent years, very large collections of images and videos have
grown rapidly. In parallel with this growth, content-based
retrieval and querying the indexed collections are required to
access visual information. Two of the main components of the
visual information are texture and color. In this thesis, a
content-based image retrieval system is presented that computes
texture and color similarity among images. The underlying
technique is based on the adaptation of a statistical approach to
texture analysis. An optimal set of five second-order texture
statistics are extracted from the Spatial Grey Level Dependency
Matrix of each image, so as to render the feature vector for each
image maximally informative, and yet to obtain a low vector
dimensionality for efficiency in computation. The method for color
analysis is the color histograms, and the information captured
within histograms is extracted after a pre-processing phase that
performs color transformation, quantization, and filtering. The
features thus extracted and stored within feature vectors are
later compared with an intersection-based method. The system is
also extended for pre-processing images to segment regions with
different textural quality, rather than operating globally over
the whole image. The system also includes a framework for
object-based color and texture querying, which might be useful for
reducing the similarity error while comparing rectangular regions
as objects. It is shown through experimental results and
precision-recall analysis that the content-based retrieval system
is effective in terms of retrieval and scalability.
Keywords: Texture Analysis, Color Histograms, Texture Similarity
Measurement, Content-Based Image Retrieval, Image Databases.
BU-CE-0213
TITLE: An Efficient Query Optimization Strategy for Spatio-Temporal Queries in Video Databases
AUTHORS: Gulay Unel
ABSTRACT:
The interest for multimedia database management systems has grown
rapidly due to the need for the storage of huge volumes of multimedia
data in computer systems. An important building block of a multimedia
database system is the query processor, and a query optimizer embedded
to the query processor is needed to answer user queries
efficiently. Query optimization problem has been widely studied for
conventional database systems, however it is a new research area for
multimedia database systems. Due to the differences in query
processing strategies, query optimization techniques used in
multimedia database systems are different from those used in
traditional databases. In this paper, a query optimization strategy is
proposed for processing spatio-temporal queries in video database
systems. The proposed strategy includes reordering algorithms to be
applied on query execution tree. The performance results obtained by
testing the reordering algorithms on different query sets are also
presented.
Keywords:
Video databases, query optimization, query tree, querying of video data.
BU-CE-0214
TITLE: A Semantic Data Model and Query Language for Video Data
AUTHORS: Umut Arslan
ABSTRACT:
Advances in compression techniques, decreasing cost of storage, and
high-speed transmission have facilitated the way video is created,
stored and distributed. As a consequence, video is now being used in
many application areas. The increase in the amount of video data
deployed and used in today's applications not only caused video to
draw more attention as a multimedia data type, but also led to the
requirement of efficient management of video data. Management of video
data paved the way for new research areas, such as indexing and
retrieval of videos with respect to their spatio-temporal, visual and
semantic contents. In this thesis, semantic content of video is
studied, where video metadata, activities, actions and objects of
interest are considered within the context of video semantic
content. A data model is proposed to model video semantic content,
which is extracted from video data by a video annotation tool. A video
query language is also provided to support semantic queries on video
data.
The outcome of this thesis work will be used in a project, which
proposes a video database system architecture with spatio-temporal,
object-trajectory and object-apperance query facilities so as to build
a complete video search system to query video data by its
spatio-temporal, visual and semantic features.
Keywords:
Video databases, semantic video modeling, annotation of video data,
semantic querying of video data
BU-CE-0215
TITLE: Modeling and Animating Personalized Faces
AUTHORS: Fatih Erol
ABSTRACT:
A very important and challenging problem in computer graphics is
modeling and animation of individualized face models. In this thesis,
we describe a facial modeling and animation system attempting to
address this problem. The system uses muscle-based generic face model
and deforms it using deformation techniques to model individualized
faces. Two orthogonal photos of the real faces are used for this
purpose. Image processing techniques are employed to extract certain
features on the photographs, which are then refined manually by the
user through the facilities of the user interface of the system. The
feature points located on the frontal and side views of a real face
are used to deform the generic model. Then, the muscle vectors in the
individualized face model are arranged accordingly. Individualized
face models produced in this manner are animated using parametric
interpolation techniques.
Keywords:
Facial animation, rendering, texture mapping, deformation, feature
extraction.
BU-CE-0216
TITLE: Simplification of Tetrahedral Meshes by Scalar Value Assignment
AUTHORS: Emre Can Sezer
ABSTRACT:
A new approach to simplification of volumetric data over an
unstructured tetrahedral mesh is presented. The data consist of sample
values of a scalar field defined over a spatial domain, which is
subdivided with a tetrahedral mesh. Simplification is performed by
means of contraction of the tetrahedra and also of the edges. The
simplification algorithm can provide a continuum of aproximate models
of the given dataset with any desired degree of accuracy. Hence, the
simplification method is suitable for multi-resolution modeling.
The novelty of the approach comes from the arbitrariness in the
selection of the point to which a tetrahedron or an edge is
contracted. Unlike most of the existing methods, the final vertex of
the contraction need not be a vertex of the original mesh. The scalar
value to be assigned to the final vertex of contraction is determined
by an extremely simple method (both conceptually and computationally),
which also provides an estimate of the error of simplification.
The proposed method is applied to two volumetric grids to illustrate
its effectiveness in simplification of volumetric data.
Keywords:
Volumetric dataset, tetrahedral mesh, tetrahedron contraction,
edge contraction, scalar value assignment.
BU-CE-0217
TITLE: Comparison of Four Approximating Subdivision Surface Schemes
AUTHORS: Tekin Kabasakal
ABSTRACT:
The idea of subdivision surfaces was first introduced in 1978, and
there are many methods proposed till now. A subdivision surface is
defined as the limit of repeated recursive refinements. In this
thesis, we studied the properties of approximating subdivision surface
schemes. We started by modeling a complex surface with splines that
typically requires a number of spline patches, which must be smoothly
joined, making splines burdensome to use. Unlike traditional spline
surfaces, subdivision surfaces are defined
algorithmically. Subdivision schemes generalize splines to domains of
arbitrary topology. Thus, subdivision functions can be used to model
complex surfaces without the need to deal with patches. We studied
four well-known schemes Catmull-Clark, Doo-Sabin, Loop and the
sqrt(3)-subdivision. The first two of these schemes are quadrilateral
and the other two are triangular surface subdivision schemes. Modeling
sharp features, such as creases, corners or darts, using subdivision
schemes requires some modifications in subdivision procedures and
sometimes special tagging in the mesh. We developed the rules of
sqrt(3)-subdivision to model such features and compared the results
with the extended Loop scheme. We have implemented exact normals of
Loop and p3-subdivision since using interpolated normals causes
creases and other sharp features to appear smooth.
Keywords:
Computational geometry and object modeling, subdivision surfaces,
Loop, Catmull-Clark, Doo-Sabin, sqrt(3)-subdivision, modeling sharp features.
BU-CE-0218
TITLE: On-Line New Event Detection and Event Clustering Using the
Concepts of the Cover Coefficient Based Clustering Methodology
AUTHORS: Ahmet Vural
ABSTRACT:
In this study, we use the concepts of the cover coefficient-based
clustering methodology (C3M) for on-line new event
detection and event clustering. The main idea of the study is to use
the seed selection process of the C3M algorithm for the
purpose of detecting new events. Since C3M works in a
retrospective manner, we modify the algorithm to work in an on-line
environment. Furthermore, in order to prevent producing oversized
event clusters, and to give equal chance to all documents to be the
seed of a new event, we employ the window size concept. Since we
desire to control the number of seed documents, we introduce a
threshold concept to the event clustering algorithm. We also use the
threshold concept, with a little modification, in the on-line event
detection. In the experiments we use TDT1 corpus, which is also used
in the original topic detection and tracking study. In event
clustering, we use both binary and weighted versions of TDT1 corpus.
With the binary implementation, we obtain better results. When we
compare our on-line event detection results to the results of UMASS
approach, we obtain better performance in terms of false alarm rates.
Keywords:
Clustering, on-line event clustering, on-line event detection.