\title{A Comparative Study of Biological Distances for EST Clustering} \author{ Scott Hazelhurst\inst{1} \and Zsuzsanna Lipt\'ak\inst{2} \and Judith Zimmerman\inst{3} } \institute{ School of Computer Science, University of the Witwatersrand, Johannesburg,\\ Private Bag 3, 2050 Wits, South Africa \email{}\footnote{Partially supported by SA National Research Foundation (GUN2053410)} \and Universit\"at Bielefeld,Technische Fakult\"at, AG Genominformatik,\\ 33594 Bielefeld, Germany \email{}\footnote{Most of this work was done while Zs.L. was working in the Research Group `Algorithms, Data Structures, and Applications', Institute of Theoretical Computer Science, ETH Zurich, and at the South African National Institute of Bioinformatics (SANBI), Cape Town} \and Research Group `Algorithms, Data Structures, and Applications', Institute of Theoretical Computer Science,\\ ETH Zurich, CH-8092 Zurich \email{} } \date{} \bibliographystyle{plain} \begin{document} \maketitle \begin{abstract} The paper presents the results of an experimental study in which different string distance measures were compared and evaluated as to their applicability to EST clustering. We implemented two tools, SeqGen (Sequence Generator) and ECLEST (Evaluator for Clusterings of ESTs). These were used to generate simulated ESTs from input human cDNAs; and to run EST clustering on these ESTs and compute a score for the quality of the clustering, respectively. We propagate the use of simulated data for comparative studies of this type, because they allow evaluation w.r.t.\ a known ideal solution (in this case, the correct clustering), which is not possible in most cases with real-life data. The distance measures we compared include both subword-based and alignment-based measures. We ran a large number of tests and obtained statistically significant results as to the applicability of the distance measures included. For example, we show that certain subword-based measures produce output, in a significant number of cases, that is comparable to alignment-based ones, and that certain (easy-to-compute) measures are well suited for a preprocessing step. Our results have significant applications in studies of gene expression and discovery of products of alternative splicing, where there is a pressing need for fast clustering of increasingly large sets of ESTs. \end{abstract} \keywordname{ string distance measures, EST clustering, simulated data, clustering evaluation, benchmarks}