MPEG-7 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the standards MPEG-1, MPEG-2 and MPEG-4. Different from the previous MPEG standards, MPEG-7 is designed to describe the content of multimedia. It is formally called "Multimedia Content Description Interface". It was announced in 2001.
MPEG-7 offers a comprehensive set of audiovisual description tools in the form of Descriptors (D) and Description Schemes (DS) that describe the multimedia data, forming a common basis for applications and enabling efficient and effective access to the data. The Description Definition Language (DDL) is based on W3C XML with some MPEG-7 specific extensions, such as vectors and matrices. Therefore, MPEG-7 documents are XML documents that conform to particular MPEG-7 schemas for describing multimedia content. Descriptors describe features, attributes or groups of attributes of multimedia content. Description Schemes describe entities or relationships pertaining to multimedia content. They specify the structure and semantics of their components, which may be Description Schemes, Descriptors or datatypes.
The MPEG-7 eXperimentation Model (XM) Reference Software is the framework for all the reference code of the MPEG-7 standard. It implements the normative components of MPEG-7. MPEG-7 standardizes multimedia content description but it does not specify how the description is produced. It is up to the MPEG-7 compatible application developers how the descriptors are extracted from the multimedia provided that the output conforms to the standard. MPEG-7 Visual Description Tools consist of basic structures and Descriptors that cover following basic visual features: color, texture, shape, motion, localization, and face recognition.
Color Descriptors: Color Structure Descriptor (CSD) represents an image by both color distribution and spatial structure of color. Scalable Color Descriptor (SCD) is a Haar transform based encoding of a color histogram in HSV color space. Dominant Color Descriptor (DCD) specifies up to 8 representative (dominant) colors in an image or image region. Color Layout Descriptor (CLD) is a compact and resolution invariant color descriptor to efficiently represent spatial distribution of colors. GoF/GoP Descriptor is used for the color-based features of multiple images or frames in a video segment. It is an alternative to single keyframe based representation of video segments. The descriptor is obtained by aggregating the histograms of multiple images or frames and representing the final histogram with Scalable Color Descriptor (SCD).
Texture Descriptors: Edge Histogram Descriptor(EHD) specifies the spatial distribution of edges in an image. Homogeneous Texture Descriptor (HTD) characterizes the texture of a region using mean energy and energy deviation from a set of frequency channels, which are modeled with Gabor functions. Texture Browsing Descriptor (TBD) characterizes textures perceptually in terms of regularity, coarseness and directionality.
Shape Descriptors: Contour Shape Descriptor (CShD) describes the closed contour of a 2-D region based on Curvature Scale Space (CSS) representation of the contour. Region Shape Descriptor (RSD) is based on the Angular Radial Transform (ART) to describe shapes of regions composed of single or multiple connected regions, or regions with holes. It considers all pixels constituting the shape, both boundary and interior pixels.
Motion Descriptors:
Localization Descriptors: Region Locator specifies locations of regions within images with a box or polygon. Spatio-temporal Locator specifies locations of video segments within a video sequence spatio-temporally.
Face Recognition Descriptor is a Principal Component Analysis (PCA) based descriptor that represents the projection of a face onto a set of 48 basis vectors that span the space of all possible face vectors.
In MPEG-7, the semantic content of multimedia (e.g., objects, events, concepts) can be described by text annotation (free text, keyword, structured and dependency structure) and/or semantic entity and semantic relation tools. Free text annotations describe the content using unstructured natural language text (e.g., Barack Obama visits Turkey in April). Such annotations are easy for humans to understand but difficult for computers to process. Keyword annotations use a set of keywords (e.g., Barack Obama, visit, Turkey, April) and easier to process by computers. Structured annotations strike a balance between simplicity (in terms of processing) and expressiveness. They consist of elements each answering one of the following questions: who, what object, what action, where, when, why and how (e.g., who: Barack Obama, what action: visit, where: Turkey, when: April). Dependency structure represents the linguistic structure of an annotation based on a linguistic theory called dependency grammar that explains a sentence's grammatical structure in terms of dependencies between its elements.
More detailed descriptions about semantic entities such as objects, events, concepts, places and times can be stored using semantic entity tools. The semantic relation tools describe the semantic relations between semantic entities using the normative semantic relations standardized by MPEG-7 (e.g., agent, agentOf, patient, patientOf, result, resultOf, similar, opposite, user, userOf, location, locationOf, time, timeOf) or by non-normative relations [ 1 ].
The semantic tools of MPEG-7 provide methods to create very brief or very extensive semantic descriptions of multimedia content. The choice of which description tool is to be used in a system is affected by the type of semantic queries to be supported and by the annotation tool to be used. Some of the descriptions can be obtained automatically while most of them require manual labeling. Automatic speech recognition (ASR) text can be used as free text annotations to describe video segments. Keyword and structured annotations can be obtained automatically to some extent using state-of-the-art auto-annotation techniques. Description of semantic entities and relations between them cannot be obtained automatically with the current-state-of-the-art, therefore, considerable amount of manual work is needed for this kind of semantic annotation.