SeqGen: A tool for creating benchmarks for EST clustering algorithms Hazelhurst and Bergheim Clustering algorithms for biological data such as ESTs have many important applications. One of these is clustering ESTs that have been produced from mRNAs extracted from cells to determine which gene products can be found in that cell. Another application is identifying the products of alternate gene splicing. The problem is given a set of ESTs to cluster them so that each cluster contains fragments from a different gene product. Clustering algorithms use some measure of sequence similarity to determine whether ESTs should be clustered together. Many distance measures have been proposed with various arguments advanced on their behalf and against other, including performance, quality, and biological faithfulness. Whether one measure is better than another in terms of computational costs can relatively easily be determined. It is more difficult to assess a similarity measure with respect to the quality of the clusters that a clustering algorithm that uses this measure will find. Ultimately, quality should be measured against what the right answer is. But what is the right answer? After all, if we knew the right answer, we wouldn't need to cluster. This paper presents a methodology for generating artificial data that allows users to choose an error model and then experimentally validate which distance function or clustering algorithm is appropriate.