not logged in



EST analysis pipeline

Database and web site

Software Implementation

System Requirements



Contact information


EST2uni is an open, parallel tool for automated EST analysis, and database creation with a data mining web interface. It is an EST analysis tool capable of converting, in a fully automatic way, a set of chromatograms or plain sequences in a highly structured and annotated EST database with a user oriented web interface.

The EST analysis pipeline includes standard pre-processing, clustering and annotation programs, and the software is highly modular, so facilitating the incorporation of new methods and analyses. Running options are easily adapted to local needs by simply modifying an extensively documented single configuration text file which provides the parameters to be used. Once configured, the pipeline runs without user assistance, and the database is filled with the analyses results in real time.

The analysis can be run either in a single standard computer or in a PC-cluster, so benefiting from the multiprocessor capabilities of these systems.

The web site deployed is a powerful mining tool with a complex, yet easy to use, query interface, that provides bulk data retrieval and download. Access to the data can be restricted by passwords to keep the data private. The web also eases the use of several tools, like primer design and BLAST searches against the database.

This tool has been developed by the join effort of the Spanish Melon Genome Project and the Citrus Functional Genomics Project. The software is free, has been released under the GPL license, and its development is open and collaborative. Any researcher is free to use it, to develop it, and to have a web site deployed with its own sequences. The system is continually being improved by adding new analyses and integrating new tools, and we are open to suggestions or collaborations with the genomics community.


The package offers the following features:

EST analysis pipeline

The EST analysis provided by the EST2uni pipeline includes the following steps:

Below we describe the analysis methods included in the current distribution of the package. However, the modular design of the software, and the autonomy of the database-interacting objects make easy to include new methods in the analysis.

1) EST pre-processing.

- Sequence and quality extraction from trace files:

- Low quality regions trimming:

- Cloning vector trimming:

- Contamination vector removal:

- Masking of low complexity regions:

- Masking of repetitive elements:

2) EST assembly.

- Identification of the unigenes set:

- Graphical representation of the contigs alignment:

- Unigenes orientation:

- Identification of "superunigenes":

- Selection of a representative cDNA clone for each unigene/superunigene:

3) Unigenes structural annotation.

- Microsatellites:

- SNPs:

- ORF prediction:

- Putative full-length clones identification:

- Integration of information about RFLP genetic markers:

- Integration of information about PCR genetic markers:

4) Unigenes functional annotation.

- BLAST annotation:

- Protein domains:

- Gene Ontology (GO) classification:

- Orthologs identification:

Database and web site

The pipeline creates an structured relational database, and the package includes the necessary files to setting up a web site with access to that database through a powerful data mining tool with an advanced querying interface and high integration among all kinds of data.

It is not just a collection of simple tables showing the data, or a simple query form. On the contrary, it allows combination of different functional and structural annotation criteria in the queries. For instance, it is very easy to look for unigenes with ESTs from several given libraries so that they are transcription factors, and have SSRs or SNPs.

Unigenes can also be queried using a BLAST search, or a file with a list of unigene names, orthologs, or previous database version IDs.

Also, the web site provides some tools that allow bulk download of chromatograms, unigene and EST names, fasta sequences, contig alignments, orthologs, or previous database unigene names.

In order to globally analyze the GO terms of the result set, it is also possible to automatically submit the orthologs in some model species to Babelomics.

The web site also generates some statistics for each library, like number of ESTs, singletons, contigs, unigenes, GO terms, novelty, and redundancy.

Also the unigene web page view shows graphical and textual summaries of the assembly and annotations.

Moreover, Primer3 is integrated in the web site, and it is possible to design primers and see the graphical results doing just two mouse clicks.

Software Implementation

The package consists of:

The main Perl script manages the execution of several standard tools commonly used for EST analysis. The script can be run either in sequential mode or in a parallel environment using the load distribution tool CONDOR. The analysis results are recorded in a normalized MySQL database and the web site is built by using PHP.

The functioning of the pipeline is highly configurable, and the configuration parameters used by the pipeline and the external tools are stored in a plain text configuration file (est_pipe.conf) which can be easily modified to adapt the pipeline to particular needs. After setup configuration, the system runs without user assistance and the database is filled with the analyses results in real time.

Each annotation analysis is controlled by a different perl module. The results from different analyses are stored in independent tables to facilitate the future addition of new analyses. As a consequence, all analyses can be run independently at different times, and the annotation modules can be re-run at any time with updated databases.

The package is distributed with a complete working web site which should be very easy to adapt by an administrator. The visual design is controlled by using CSS to facilitate changes, and deeper adaptation should be easily performed because the PHP code that appears on the browseable web pages is minimum (e.g., the queries page has just 14 PHP lines). The system also supports different projects using the same server. Each project should be allocated in a specific directory with its browseable web pages, all of them using the common PHP functions stored in a shared directory.

System Requirements

1) Hardware

No hardware restrictions apply to our package. However, it should be noted that some external programs used here could be not available for some architectures. Memory restrictions are limited to those applying to the external programs used in the pipeline. In general, it runs without problems in standard medium-level ix86 boxes.

2) Operating systems

We have tested and configured correctly this software in ix86 boxes running the following GNU/Linux distributions:

It should however run without problems in any Unix installation as long as all the required software is installed. Also, it could be easily ported to Microsoft Windows once the external programs used here were available for that platform.

3) Software

The following software is required to run the EST2uni pipeline (click here to see installation instructions for external software).

In addition, the following tools are required to perform the analysis (click here to see installation instructions for external software), altough other similar tools could be used in addition to or instead of them (with a little hacking, of course). In addition, some of them are even optional, provided that you don't want to do that analysis.


Please, go here for EST2uni installation and configuration instructions.


EST2uni can be freely downloaded.

EST2uni is free software distributed under the GPL licence and its development is open to anyone interested.

The development is done on a public subversion server, you can download the lastest version with the command svn checkout svn:// If you want to contribute with some code just contact us.

A mail list is devoted to the use and development of this tool. Feel free to ask for help or to suggests improvements and bugs in it.

Contact information