We described our prototype MPEG-7 compatible video database system, BilVideo-7, that supports different types of multimodal queries in an integrated way. To our knowledge, BilVideo-7 is the most comprehensive MPEG-7 compatible video database system currently available, in terms of the wide range of MPEG-7 descriptors and manifold querying options. The MPEG-7 profile used for the representation of the videos enables the system to respond to complex queries with the help of the flexible query processing and bottom-up subquery result fusion architecture. The user can formulate very complex queries easily using the Visual Query Interface, whose Composite Query Interface is novel in formulating a query by describing a video segment as a composition of several video segments along with their descriptors. The broad functionality of the system is demonstrated with sample queries which are handled effectively by the system. The retrieval performance depends very much on the MPEG-7 descriptors and the distance measures used. The low-level MPEG-7 descriptors have been found effective, consistent with our observations, and therefore, widely used by the researchers in the computer vision, pattern recognition and multimedia retrieval communities. We will investigate distance measures other than the ones recommended by MPEG-7 [ref].
The multi-threaded query execution architecture is suitable for parallelization. This is required for video databases of realistic size to keep the response time of the system at interactive rates. In a parallel architecture, each query processing node may keep the data for a subset of descriptions (e.g., text, color, texture, shape) and execute only the relevant subqueries. A central Query Processor can coordinate the operation of query processing nodes.
The major bottleneck for the system is the generation of the MPEG-7 representations of videos by manual processing, which is time consuming, error-prone and which also suffers from human subjectivity. This hinders the construction of a video database of realistic size. Therefore, our current focus is on equipping the MPEG-7 compatible video feature extraction and annotation tool with as much automatic processing capabilities as possible to reduce manual processing time, error and human subjectivity during region selection and annotation.
Finally, future versions of BilVideo-7 will also support representation and querying of audio and image data. The multimodal query processing architecture makes it easy to add new descriptors in new modalities (e.g., audio descriptors). Images can be considered to be a special case of Keyframes which are decomposed into Still Regions, and hence can be supported easily.