EST2uni is an open, parallel tool for automated EST analysis, and database creation with a data mining web interface. It is an EST analysis tool capable of converting, in a fully automatic way, a set of chromatograms or plain sequences in a highly structured and annotated EST database with a user oriented web interface.
The EST analysis pipeline includes standard pre-processing, clustering and annotation programs, and the software is highly modular, so facilitating the incorporation of new methods and analyses. Running options are easily adapted to local needs by simply modifying an extensively documented single configuration text file which provides the parameters to be used. Once configured, the pipeline runs without user assistance, and the database is filled with the analyses results in real time.
The analysis can be run either in a single standard computer or in a PC-cluster, so benefiting from the multiprocessor capabilities of these systems.
The web site deployed is a powerful mining tool with a complex, yet easy to use, query interface, that provides bulk data retrieval and download. Access to the data can be restricted by passwords to keep the data private. The web also eases the use of several tools, like primer design and BLAST searches against the database.
This tool has been developed by the join effort of the Spanish Melon Genome Project and the Citrus Functional Genomics Project. The software is free, has been released under the GPL license, and its development is open and collaborative. Any researcher is free to use it, to develop it, and to have a web site deployed with its own sequences. The system is continually being improved by adding new analyses and integrating new tools, and we are open to suggestions or collaborations with the genomics community.
The package offers the following features:
1) the EST analysis is fully automated by a highly configurable pipeline covering all the steps (EST pre-processing, clustering, annotation and database creation) from the input chromatograms and/or sequences to a clean and annotated web-searchable EST database,
2) although it includes a series of commonly used analysis methods, its design is highly modular, which facilitates the incorporation of new methods and analyses,
3) it uses, when possible, third-party, freely-available, commonly-used programs, so facilitating the incorporation of the improvements made by others programmers,
4) it is able to run in parallel in a personal computer (PC) cluster, so benefiting from the multiprocessor capabilities of these systems, and
5) it provides a highly-configurable and extensible user-friendly database interface via web, which allow complex queries and data mining combining any search criteria.
5b) all the package is easily customizable:
- Perl script customizable through a configuration text file.
- Web page format based on a CSS file.
- Ready to insert PHP functions in custom web pages.
- Database interaction isolated from HTML printing through PHP objects.
5c) Linux or other Unix ready, it could be easily ported to Windows.
6) the software is based on an open source license (GPL) to allow continuous development by a community of users and programmers.
The EST analysis provided by the EST2uni pipeline includes the following steps:
1) EST pre-processing, to get clean, high-quality EST sequences,
2) EST clustering or assembly, to get a set of unique gene consensus sequences or unigenes, and
3) unigenes annotation, to provide structural and functional information about the unigenes.
Below we describe the analysis methods included in the current distribution of the package. However,
the modular design of the software, and the autonomy of the database-interacting objects make easy
to include new methods in the analysis.
1) EST pre-processing.
- Sequence and quality extraction from trace files:
Both chromatogram and FASTA sequence files with or without quality values are accepted as entry point to the analysis. Base calling and quality score assignment from chromatograms are performed with phred.
- Low quality regions trimming:
Low quality regions are removed with Lucy.
- Cloning vector trimming:
Cloning vector sequences are removed with Lucy.
- Contamination vector removal:
- Masking of low complexity regions:
Low complexity regions are masked with SeqClean.
- Masking of repetitive elements:
Repetitive elements are masked with RepeatMasker.
2) EST assembly.
- Identification of the unigenes set:
- Graphical representation of the contigs alignment:
Color-based images providing a graphical display of the EST assemblies in contigs are created using the GPL Perl script contigimage.pl (distributed with the EST2uni package).
- Unigenes orientation:
Poly(A/T) tails are detected and used to reverse the sequences when necessary, using a locally developed algorithm.
- Identification of "superunigenes":
Very similar unigenes probably corresponding to, for example, gene families or splicing variants, are grouped in the so called "superunigenes", using a locally developed algorithm. Regarding microarrays, a "superunigene" groups different unigenes with the same expected mRNA target under standard hybridization conditions, as judged by extensive sequence overlapping.
- Selection of a representative cDNA clone for each unigene/superunigene:
A selection of one cDNA clone as the best representative for each unigene/superunigene can be done, using a locally developed algorithm based on certain number of user-defined preferences (e.g., the most 5' or 3' clone, or the longest or shortest one, etc.). These clones could be used as probes to be printed in microarrays.
3) Unigenes structural annotation.
SSR microsatellites can be detected with a modified version of sputnik.
Putative single nucleotide polymorphisms (SNPs) can be found by using a locally developed algorithm.
- ORF prediction:
Open reading frames of the unigenes are obtained using ESTScan.
- Putative full-length clones identification:
Unigenes with ESTs coming from putative full-length clones are identified, using a locally developed algorithm.
- Integration of information about RFLP genetic markers:
When a file including association among clones and RFLP markers is provided, these markers are associated to the unigenes having ESTs coming from these clones.
- Integration of information about PCR genetic markers:
When a file including a set of primer pairs is provided, in-silico PCR experiments can be performed with ipcress to integrate information about PCR-based molecular markers.
4) Unigenes functional annotation.
- BLAST annotation:
Comparisons against a set of user-defined nucleotide and/or protein databases are carried out by using NCBI BLAST. Unigenes are annotated with the descriptions of the most informative BLAST hits.
- Protein domains:
- Gene Ontology (GO) classification:
Annotation of unigenes with GO terms can be done from BLAST results against a user-defined GO-annotated database.
- Orthologs identification:
A bi-directional BLAST comparison can also be performed with a number of user-defined species-specific sequence databases in order to obtain a set of putative orthologs, using a locally developed algorithm.
The pipeline creates an structured relational database, and the package includes the necessary files to setting up a web site with access to that database through a powerful data mining tool with an advanced querying interface and high integration among all kinds of data.
It is not just a collection of simple tables showing the data, or a simple query form. On the contrary, it allows combination of different functional and structural annotation criteria in the queries. For instance, it is very easy to look for unigenes with ESTs from several given libraries so that they are transcription factors, and have SSRs or SNPs.
Unigenes can also be queried using a BLAST search, or a file with a list of unigene names, orthologs, or previous database version IDs.
Also, the web site provides some tools that allow bulk download of chromatograms, unigene and EST names, fasta sequences, contig alignments, orthologs, or previous database unigene names.
In order to globally analyze the GO terms of the result set, it is also possible to automatically submit the orthologs in some model species to Babelomics.
The web site also generates some statistics for each library, like number of ESTs, singletons, contigs, unigenes, GO terms, novelty, and redundancy.
Also the unigene web page view shows graphical and textual summaries of the assembly and annotations.
Moreover, Primer3 is integrated in the web site, and it is possible to design primers and see the graphical results doing just two mouse clicks.
1) a main Perl script (perl/est2uni) to perform both the EST analysis pipeline and the database creation,
2) a set of Perl modules (perl/*.pm) with the functions called by the main Perl script,
3) a set of PHP scripts (php/est2uni/*.php) to generate the browseable database-interacting web pages, and
4) a set of PHP modules (php/estpipe/*.php) with the functions called by the PHP scripts.
The main Perl script manages the execution of several standard tools commonly used for EST analysis. The script can be run either in sequential mode or in a parallel environment using the load distribution tool CONDOR. The analysis results are recorded in a normalized MySQL database and the web site is built by using PHP.
The functioning of the pipeline is highly configurable, and the configuration parameters used by the pipeline and the external tools are stored in a plain text configuration file (est_pipe.conf) which can be easily modified to adapt the pipeline to particular needs. After setup configuration, the system runs without user assistance and the database is filled with the analyses results in real time.
Each annotation analysis is controlled by a different perl module. The results from different analyses are stored in independent tables to facilitate the future addition of new analyses. As a consequence, all analyses can be run independently at different times, and the annotation modules can be re-run at any time with updated databases.
The package is distributed with a complete working web site which should be very easy to adapt by an administrator. The visual design is controlled by using CSS to facilitate changes, and deeper adaptation should be easily performed because the PHP code that appears on the browseable web pages is minimum (e.g., the queries page has just 14 PHP lines). The system also supports different projects using the same server. Each project should be allocated in a specific directory with its browseable web pages, all of them using the common PHP functions stored in a shared directory.
No hardware restrictions apply to our package. However, it should be noted that some external programs used here could be not available for some architectures. Memory restrictions are limited to those applying to the external programs used in the pipeline. In general, it runs without problems in standard medium-level ix86 boxes.
2) Operating systems
We have tested and configured correctly this software in ix86 boxes running the following GNU/Linux distributions:
-Fedora Core 5 (FC)
-Ubuntu 6.06 (Ubt)
-SuSE Linux 8.2
It should however run without problems in any Unix installation as long as all the required software is installed. Also, it could be easily ported to Microsoft Windows once the external programs used here were available for that platform.
The following software is required to run the EST2uni pipeline
(click here to
see installation instructions for external software).
In addition, the following tools are required to perform the analysis (click here to see installation instructions for external software), altough other similar tools could be used in addition to or instead of them (with a little hacking, of course). In addition, some of them are even optional, provided that you don't want to do that analysis.
- go-perl Perl package
Please, go here for EST2uni installation and configuration instructions.
EST2uni can be freely downloaded.
EST2uni is free software distributed under the GPL licence and its development is open to anyone interested.
The development is done on a public subversion server, you can download the lastest version with the command svn checkout svn://phobos.agr.upv.es/estpipe. If you want to contribute with some code just contact us.
A mail list is devoted to the use and development of this tool. Feel free to ask for help or to suggests improvements and bugs in it.