MPEG-7 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the standards MPEG-1, MPEG-2 and MPEG-4. Different from the previous MPEG standards, MPEG-7 is designed to describe the content of multimedia. It is formally called "Multimedia Content Description Interface". It was announced in 2001.
MPEG-7 offers a comprehensive set of audiovisual description tools in the form of Descriptors (D) and Description Schemes (DS) that describe the multimedia data, forming a common basis for applications and enabling efficient and effective access to the data. The Description Definition Language (DDL) is based on W3C XML with some MPEG-7 specific extensions, such as vectors and matrices. Therefore, MPEG-7 documents are XML documents that conform to particular MPEG-7 schemas for describing multimedia content. Descriptors describe features, attributes or groups of attributes of multimedia content. Description Schemes describe entities or relationships pertaining to multimedia content. They specify the structure and semantics of their components, which may be Description Schemes, Descriptors or datatypes.
The MPEG-7 eXperimentation Model (XM) Reference Software is the framework for all the reference code of the MPEG-7 standard.
It implements the normative components of MPEG-7.
MPEG-7 standardizes multimedia content description but it does not specify how the description is produced.
It is up to the MPEG-7 compatible application developers how the descriptors are extracted from the multimedia provided that the output conforms to the standard.
MPEG-7 Audio Description Tools consist of basic structures and Descriptors that cover basic audio features.
The MPEG-7 low-level descriptors (LLDs) form the foundation layer of the
standard . It consists of a collection of simple, lowcomplexity
audio features that can be used to characterize any type of sound.
The LLDs offer flexibility to the standard, allowing new applications to be built
in addition to the ones that can be designed based on the MPEG-7 high-level
tools. The foundation layer comprises a series of 18 generic LLDs consisting of a
normative part (the syntax and semantics of the descriptor) and an optional, nonnormative
part which recommends possible extraction and/or similarity matching
methods. The temporal and spectral LLDs can be classified into the following
groups:
Basic Descriptors
- AudioWaveform (AWF): AudioWaveForm Descriptor describes the audio waveform envelope, typically for display purposes.AudioWaveform Descriptor allows economical display of an audio waveform. For example, a sound editing application program can display a summary of an entire audio file immediately without processing the audio data and data may be displayed and edited over a network, etc. Whatever the number of samples, the waveform may be displayed using a small set of values that represent extrema (min and max) of frames of samples. Min and max are stored as scalable time series within the AudioWaveform Descriptor. They may also be used for fast comparison between waveforms.
- AudioPower (AP): AudioPower Descriptor describes the temporally-smoothed instantaneous power (square of waveform values). Instantaneous power is calculated by taking the square of waveform samples. These are averaged over time intervals of length corresponding to hopSize and stored in the Mean field of a SeriesOfScalarType. Instantaneous power is a useful measure of the amplitude of a signal as a function of time, P(t)=|s(t)|2. In association with AudioSpectrumCentroid Descriptor and AudioSpectrumSpread Descriptor, the AudioPower Descriptor provides an economical description of the power spectrum (spreading the power over the spectral range specified by the centroid and spread) that can be compared with a log-frequency spectrum. Another possibility is to store instantaneous power at high temporal resolution, in association with a high spectral resolution power spectrum at low temporal resolution, to obtain a cheap representation of the power spectrum that combines both spectral and temporal resolution.Instantaneous power is coherent with the power spectrum. A signal labeled with the former can meaningfully be compared to a signal labeled with the latter. Note however that temporal smoothing operations are not quite the same, so values may differ slightly for identical signals.
Basic Spectral Descriptors
- AudioSpectrumEnvelope (ASE): AudioSpectrumEnvelope Desciptor describes the spectrum of the audio according to a logarithmic frequency scale.The AudioSpectrumEnvelope Descriptor describes the short-term power spectrum of the audio waveform as a time series of spectra with a logarithmic frequency axis. It may be used to display a spectrogram, to synthesize a crude "auralization" of the data, or as a general-purpose descriptor for search and comparison. A logarithmic frequency axis is used to conciliate requirements of concision and descriptive power. Peripheral frequency analysis in the ear roughly follows a logarithmic axis. The power spectrum is used because of its scaling properties (the power spectrum over an interval is equal to the sum of power spectra over subintervals).
- AudioSpectrumCentroid (ASC): AudioSpectrumCentroid Descriptor describes the center of gravity of the log-frequency power spectrum. The SpectrumCentroid is defined as the power weighted log-frequency centroid. Spectrum centroid is an economical description of the shape of the power spectrum. It indicates whether the power spectrum is dominated by low or high frequencies and, additionally, it is correlated with a major perceptual dimension of timbre; i.e.sharpness.
There are many different ways to design a spectrum centroid, according to the scale used for the values (amplitude, power, log power, cubic root power, etc.) and frequencies (linear or logarithmic scale) of spectrum coefficients. Perceptual weighting and masking can also be taken into account in more sophisticated measures. This particular design of AudioSpectrumCentroid Descriptor was chosen to be coherent with other descriptors, in particular AudioSpectrumEnvelope Descirptor, so that a signal labeled with the former can reasonably be compared to a signal labeled with the latter.
- AudioSpectrumSpread (ASS): AudioSpectrumSpread Descriptor describes the second moment of the log-frequency power spectrum. Spectrum spread is an economical descriptor of the shape of the power spectrum that indicates whether it is concentrated in the vicinity of its centroid, or else spread out over the spectrum. It allows differentiating between tone-like and noise-like sounds. As for the spectrum centroid, there are many different ways to design a spectrum spread measure. This definition follows the same criteria as AudioSpectrumCentroid Descriptor, with which it is coherent.
- AudioSpectrumFlatness (ASF): AudioSpectrumFlatness Descriptor describes the flatness properties of the spectrum of an audio signal within a given number of frequency bands. The AudioSpectrumFlatnessType describes the flatness properties of the short-term power spectrum of an audio signal. This descriptor expresses the deviation of the signal’s power spectrum over frequency from a flat shape (corresponding to a noise-like or an impulse-like signal). A high deviation from a flat shape may indicate the presence of tonal components. The spectral flatness analysis is calculated for a number of frequency bands. It may be used as a feature vector for robust matching between pairs of audio signals.
Basic Signal Parameters
- AudioHarmonicity (AH): AudioHarmonicity Descriptor describes the degree of harmonicity of an audio signal.A harmonicity measure allows distinguishing between sounds that have a harmonic spectrum (musical sounds, voiced speech, etc.) and those that have a non-harmonic spectrum (noise, unvoiced speech, dense mixtures of instruments, etc.). Together with the AudioFundamentalFrequency Descriptor, AudioHarmonicity Descriptor describes the harmonic structure of sound. These features are orthogonal and complementary to a descriptor such as AudioSpectrumEnvelope Descriptor. The exact definitions of the measures (HarmonicRatio and UpperLimitOfHarmonicity) are designed to be easy to extract, and coherent with the definitions of other descriptors (most of which are based on power).
- AudioFundamentalFrequency (AFF): AudioFundamentalFrequency Descriptor describes the fundamental frequency of the audio signal.Fundamental frequency is a good predictor of musical pitch and speech intonation. As such it is an important descriptor of an audio signal. This descriptor is not designed to be a descriptor of melody, but it may nevertheless be possible to make meaningful comparisons between data labeled with a melody descriptor, and data labeled with fundamental frequency. Fundamental frequency is complementary to the log-frequency logarithmic spectrum, in that, together with the AudioHarmonicity Descriptor, it specifies aspects of the detailed harmonic structure of periodic sounds that the logarithmic spectrum cannot represent for lack of resolution. The inclusion of a confidence measure, using the Weight field of the SeriesOfScalarType is an important part of the design, that allows proper handling and scaling of portions of signal that lack clear periodicity.
Temporal Timbral Descriptors:
- LogAttackTime (LAT): The log attack time (LAT) is defined as the time it takes to reach the maximum amplitude of a signal from a minimum threshold time. Its main motivation is the description of the onsets of single sound samples from different musical instruments. In the MPEG-7 standard, LAT is defined as the logarithm (decimal base) of the duration from time Tstart when the signal starts to time T stop when it reaches its maximum value (for a percussive sound) or its sustained part (for a sustained sound, i.e. with no decay phase).
- TemporalCentroid (TC): The TemporalCentroid is defined as the time average over the energy envelope of the signal. Signal Envelope and frame sampling rate is used to compute TC.
Spectral Timbral Descriptors:
- HarmonicSpectralCentroid (HSC): The HarmonicSpectralCentroid is computed as the average over the sound segment duration of the instantaneous HarmonicSpectralCentroid within a running window. The instantaneous HarmonicSpectralCentroid is computed as the amplitude (linear scale) weighted mean of the harmonic peaks of the spectrum.The use of a linear frequency scale instead of a logarithmic one is derived from experimental results on human perception of timbre similarity. The use of a linear scale instead of a logarithmic one significantly increases the explanation of the experimental results.
- HarmonicSpectralDeviation (HSD): The HarmonicSpectralDeviation is computed as the average over the sound segment duration of the instantaneous HarmonicSpectralDeviation within a running window. The instantaneous HarmonicSpectralDeviation is computed as the spectral deviation of log-amplitude components from a global spectral envelope. The use of a logarithmic amplitude scale instead of a linear one is derived from experimental results on human perception of timbre similarity. The use of a logarithmic scale instead of a linear one significantly increases the explanation of these experimental results.
- HarmonicSpectralSpread (HSS): The HarmonicSpectralSpread is computed as the average over the sound segment duration of the instantaneous HarmonicSpectralSpread within a running window. The instantaneous HarmonicSpectralSpread is computed as the amplitude weighted standard deviation of the harmonic peaks of the spectrum, normalized by the instantaneous HarmonicSpectralCentroid.As for the spectral centroid, there are many different ways to design a spectrum spread measure. This definition follows the same criteria as HarmonicSpectralCentroid Descriptor, with which it is coherent.
- HarmonicSpectralVariation (HSV): The HarmonicSpectralVariation is defined as the mean over the sound segment duration of the instantaneous HarmonicSpectralVariation. The instantaneous HarmonicSpectralVariation is defined as the normalized correlation between the amplitude of the harmonic peaks of two adjacent frames.
- SpectralCentroid (SC): The SpectralCentroid is computed as the power weighted average of the frequency of the bins in the power spectrum.This descriptor is very similar to the ASC defined in , but is more specifically designed to be used in distinguishing musical instrument timbres. Like the two other spectral centroid definitions contained in the MPEG-7 standard (ASC and HSC), it is highly correlated with the perceptual feature of the sharpness of a sound. The spectral centroid is commonly associated with the measure of the brightness of a sound. It has been found that increased loudness also increases the amount of high spectrum content of a signal thus making a sound brighter.
Spectral Basis Representations:
- Audio Spectrum Basis (ASB): The AudioSpectrumBasis Descriptor contains basis functions that are used to project high-dimensional spectrum descriptions into a low-dimensional representation. Spectrum dimensionality reduction plays a substantial role in automatic classification applications by compactly representing salient statistical information about audio segments. These features have been shown to perform well for automatic classification and retrieval applications.
- AudioSpectrumProjection (ASP): Low-dimensional representation of a spectrum using projection against spectral basis functions. The projected data is stored in a SeriesOfVector Descriptor. The dimensions of the SeriesOfVector Descriptor depend upon the usage model: For stationary basis components the dimension attribute is set to dim=”N K+1” where N is the spectrum length and K is the number of basis functions. For time-varying basis components dim=”M N K+1” where M is the number of blocks, N is the spectrum length and K is the number of basis functions per block. The AudioSpectrumProjection Descriptor is the compliment to the AudioSpectrumBasis Descriptor and is used to represent low-dimensional features of a spectrum after projection against a reduced rank basis. These two types are always used together. The low-dimensional features of the AudioSpectrumProjection Descriptor consist of a SeriesOfVectors, one vector for each frame of the normalized input spectrogram.
In MPEG-7, the semantic content of multimedia can be described by text annotation (free text, keyword, structured and dependency structure) and/or semantic entity and semantic relation tools. Free text annotations describe the content using unstructured natural language text (e.g., Barack Obama visits Turkey in April). Such annotations are easy for humans to understand but difficult for computers to process. Keyword annotations use a set of keywords (e.g., Barack Obama, visit, Turkey, April) and easier to process by computers. Structured annotations strike a balance between simplicity (in terms of processing) and expressiveness. They consist of elements each answering one of the following questions: who, what object, what action, where, when, why and how (e.g., who: Barack Obama, what action: visit, where: Turkey, when: April). Dependency structure represents the linguistic structure of an annotation based on a linguistic theory called dependency grammar that explains a sentence's grammatical structure in terms of dependencies between its elements.
The semantic tools of MPEG-7 provide methods to create very brief or very extensive semantic descriptions of multimedia content. The choice of which description tool is to be used in a system is affected by the type of semantic queries to be supported and by the annotation tool to be used. Some of the descriptions can be obtained automatically while most of them require manual labeling. Keyword and structured annotations can be obtained automatically to some extent using state-of-the-art auto-annotation techniques. Description of semantic entities and relations between them cannot be obtained automatically with the current-state-of-the-art, therefore, considerable amount of manual work is needed for this kind of semantic annotation.