Protocol for Processing Large-Scale Wheat EST Data


This protocol is developed at Albany, CA. The computer programs used include phred and cross_match from University of Washington, Seattle and Perl scripts written in house. The scripts are available at Perl scripts site.

  1. Step 1 : Converting raw sequence data into a fasta format. As raw sequence data emerging from sequencing machine, either from Beckman's CEQ2000 or from Applied Biosystem's ABI3700, phred analysis is carried out to convert raw sequence data into three files, fasta file, quality score file and histogram file. Phred is a base-calling program developed at University of Washington, Seattle.
  2. Step 2 : Trimming bases having low quality scores at both ends of a sequence read. This is done using a perl script called phredclean.pl (or sweep1.pl). Any read with a quality score of less than 20 or > 1% base call error rate will be trimmed off at this step. (Note: new version of phred has the capability of removing low quality or phred score < 20 sequences.)
  3. Step 3 : Screening vector sequence. A sequence comparison against a file containing cloning vector sequence is carried out in this step by using cross_match program. Cross_match will find and mask the vector sequence.
  4. Step 4 : Vector sequence removal. A perl script called cmclean.pl (or sweep2.pl) is written to remove vector sequence.
  5. Step 5 : Sequence editing. This step is to manually remove undesirable sequences. A set of guidelines has been developed to ensure the consistency of the editing process. This set of guidelines will be accomodated into the package which will allow a fully automated sequence editing process to be implemented in future.

    Rules for manual sequence editing:
  1. Remove any sequence less than 100 bases
  2. Remove control sequences
  3. Remove sequences with microsatellite repeats only
  4. Trim long poly A sequences (> 100 bases) to less than 50 bases
  5. Remove sequences less than 100 bases after removal of poly A tail
  6. Remove all sequences with > 40 bases poly Ts at the beginning of the read
  7. Remove all sequences if the read has over 35% of total base call below phred score 20
  8. Check and remove rRNA sequence contamination