Protocol for Processing Large-Scale Wheat EST Data
This protocol is developed at Albany, CA. The computer programs used include phred and cross_match from University of Washington, Seattle and
Perl scripts written in house. The scripts are available at Perl
scripts site.
- Step 1 : Converting raw sequence data into a fasta format. As raw sequence data emerging from sequencing
machine, either from Beckman's CEQ2000 or from Applied Biosystem's ABI3700, phred analysis is carried out to convert raw sequence data
into three files, fasta file, quality score file and histogram file. Phred is a base-calling program
developed at University of Washington, Seattle.
- Step 2 : Trimming bases having low quality scores at both ends of a sequence read. This is done using a perl
script called phredclean.pl (or sweep1.pl). Any read with a quality score of less than 20 or > 1% base call error rate will be trimmed
off at this
step. (Note: new version of phred has the capability of removing low quality or phred score < 20 sequences.)
- Step 3 : Screening vector sequence. A sequence comparison against a file containing cloning vector sequence is
carried out in this step by using cross_match program. Cross_match will find
and mask the vector sequence.
- Step 4 : Vector sequence removal. A perl script called cmclean.pl (or sweep2.pl) is written to remove vector
sequence.
- Step 5 : Sequence editing. This step is to manually remove undesirable sequences. A set of guidelines has been
developed to ensure the consistency of the editing process. This set of guidelines will be accomodated into the package
which will allow a fully automated sequence editing process to be implemented in future.
Rules for manual sequence editing:
- Remove any sequence less than 100 bases
- Remove control sequences
- Remove sequences with microsatellite repeats only
- Trim long poly A sequences (> 100 bases) to less than 50 bases
- Remove sequences less than 100 bases after removal of poly A tail
- Remove all sequences with > 40 bases poly Ts at the beginning of the read
- Remove all sequences if the read has over 35% of total base call below phred score 20
- Check and remove rRNA sequence contamination