Efficient Methods for Revealing Structure in Symbolic Data

Vera Afreixo, Paulo Jorge Ferreira, Óscar Mortágua Pereira, Manuel Santos, António Valente, "Efficient Methods for Revealing Structure in Symbolic Data", Proc. 1st European Conference in Functional Genomics , Prague, Czech Republic, May 2003


There is a great deal of interest in methods for detecting and revealing structure, i.e., short and long range correlations, periodicities and structure at several scales, in DNA sequences. Fourier analysis and related tools are routinely used to expose these structures, but require an underlying algebraic framework (e.g., group structure) that is absent in symbolic data.
Fourier analysis on symbolic data becomes possible after mapping symbols to numbers. Independence of the results with respect to particular labellings can be achieved by mapping symbols to vectors, and expressing them using an orthonormal basis in 3D space. For each frequency, the sum of the squared modulus of the three Fourier coefficients provides a measure of spectral content with the required invariance properties, which can also be regarded as the Fourier transform of the inner-product autocorrelation of the mapped vectors. Another approach is to use four indicator (binary) sequences. Again, the total Fourier spectrum of the symbolic data is defined as the sum of the four individual spectra.
In here we present an approach for the analysis of symbolic data based on the concept of "autocorrelation of symbolic data". The Fourier transform of this sequence can be expressed in terms of the binary indicator sequences, shedding light on the interplay between several methods for the analysis of symbolic data.
Additionally, efficient algorithms for the comparative treatment of entire genomes is also presented. A recent approach that generalizes the spectral envelope concept requires solving eigenproblems at each step. We propose a more efficient scheme that springs directly from the Fourier transforms of the indicator sequences. The interest of this is twofold: it clarifies the connection between this approach and others, and leads to much faster analysis.


Conference: 1st European Conference in Functional Genomics in Prague, Czech Republic

ISSN: 978-3-642-31137-6