PorToL Lifecycle

Lifecycle of a PorToL Project Specimen

The “Wet” Phase

  1. Collection The first step in sequencing a sponge specimen is to collect the specimen. Specimens are collected around the world and brought back to DNA sequencing labs (e.g. Smithsonian Institute).
  2. Sample Preparation Some portion of the sample is chosen for sequencing. This could be a particular gene, or the mitochondrial DNA, or another portion of the specimen DNA. The sample is prepared for a run in a sequencing machine. The sequencing machines have a number of limitations. First, they require many copies of a sample in order to be able to detect the base sequence. PCR is utilized to make millions of copies of a sample for sequencing. Second, the sequencing machines can only sequence relatively short fragments of DNA, on the order of 300-500 base pairs, before the quality becomes too poor. Therefore, the sample is denatured into smaller sequences. Finally, a primer or vector has to be introduced which binds to a reliable location on a DNA sequence so that the sequencing process will have a known starting point or signature to look for when doing the alignment. Sometimes a vector like e coli is used and the sample is spliced into the e coli at a known location. In other instances, a primer is used which binds to known locations in DNA sequences.
  3. Sequencer Run The sample is run through the sequencer. The sample is spread across a tray of wells. Each well is aligned with a capilary tube, and the sample solution travels up into the capilary tube. This process is similar to chromatography in that samples travel different distances up the tubes based on their molecular weight, thereby segregating the sample. Each tube is analyzed with a laser which can detect the difference in refraction of the four potential base pairs at each location in the sequence. The sequencer creates an output stream which represents the signal strength of each of the four bases at each location. A graphical representation looks like: (IMAGE GOES HERE) This data is represented textually and stored in a format which is particular to the model of sequencer (e.g. ABI format).

At this point, we have a series of files which represent a set of related DNA fragments (which all came from the same initial sample). This marks the end of the “wet lab” portion of the process. Now we begin the computational portion of the process, and our starting point is typically a directory full of sequencer data files from a single run (a directory of fragments which were derived from a single sample).

The Computational Phase

  1. Base Calling and Quality Scoring (Phred) At this point in the process, we have produced a series of sequencer data which contains signal strengths for each of the four bases at each location for each fragment of the original sample (remember, we chopped the original sample into fragments 300-500 base pairs long). Base calling is the process of looking at each fragment, and at each location within the fragment, making the call of whether it is an A,C,T or G. Quality scoring is a statement of your confidence that your base call is correct for that location. This data is output into a file which represents the called bases and a file which represents the quality scores (it is possible that there is a combined file format, as well, that is an unknown at the moment).
  2. Trimming Ends Trimming can also be performed by Phred as a post-processing step. Trimming is the process of removing the portions of the called sequences which are considered to be of too poor quality to be accurate. The threshold for this determination is configurable and it is often useful to try various thresholds for trimming. Generally there is a small segment (5-10 bases) at the beginning of a sequence which need to be trimmed due to noise in the sequencer process, and then a long tail which needs to be trimmed (starting anywhere from base #150-500) because the further along in the sequence the sequencer goes, the less accurate the results become due to reduced amplitude of sample. (is that correct?)
  3. Vector Removal (cross_match) This step removes the portions of the sequences that can be attributed to the vector or the primer which we introduced in the sequencing process. For example, if e coli was used as a vector, we would expect to find fragments of e coli DNA in these sequences. cross_match consults a known vectors file and looks for these known sequences within the output files of the Phred process.
  4. Contig Assembly (Phrap) Now we are ready to try to fit all of the cleaned up fragment sequences into a much longer best-fit single sequence. This is accomplished by comparing all of the sequences to each other and finding areas of overlap. If the quality of the samples is high enough, it will be possible to arrange most or all of the fragments into one long “consensus contig” by joining them at their overlapping sections. We can also generate overall quality scores for the consensus sequence at this step to determine an overall quality for this specimen.
  5. Primer Removal If a primer was used as a starting point for the sequencer process we now need to remove the portion of the consensus contig which corresponds to the primer. Phrap consults a file of known primer sequences and looks for and removes those sequences from the consensus contig.
  6. Compare against GenBank (BLAST) Now we have a consensus contig for a particular gene for a particular specimen. Compare this to the GenBank data for sponges to see if we get a good match to an already known species.
  7. Compare against PorToL (BLAST) Repeat the BLAST against PorToL project data to determine if this specimen has been previously cataloged by a member of the PorToL project but not yet published to GenBank.
  8. Align contigs (MUSCLE) Align this specimen’s contig with that of other known sponge specimens.
  9. Create tree (GARLI) Build a tree of alignment using GARLI and the specimen’s contig along with the known sponge contigs.