Prophage Finder Help
III. Downloading and Installing Prophage Finder
This web application was designed to provide researchers with a tool to quickly predict potential prophage loci in prokaryotic genome sequences. However, this web application does not make any predictions as to whether the identified prophage is functional or not and it is also important to note the identified prophage region will most likely not represent the entire prophage.
Prophage Finder initially uses BLASTX to compare input DNA sequences to a database of predicted proteins from all sequenced phage genomes as of April 2005, available at http://www.ncbi.nlm.nih.gov/genomes/static/phg.html. The BLAST results are then processed by a Perl program to determine potential prophage. This perl program predicts prophage by clustering the BLAST hits based on the number of base pairs between them and then reporting the clusters greater than a specified size.
A. Input
1. For this tutorial you will need to download, an example sequence file called rsph2.txt. Download the file by clicking here and save it on your computer.
2. After downloading the file, open the Prophage Finder homepage and select the file rsph2.txt as your upload file. Next, check the box for tRNAscan, enter your email address, and then click on the submit button.
3. After you click on the submit button you will be notified once the DNA sequence has been uploaded that your request has been submitted and your results will be sent to you via email. For this tutorial, you should receive your results within an hour.
4. You can either wait for your results or continue with the results files provided below. There should be seven output files.
| BlastOut | Summary | DNASeqs | GeneSeqs | ProtSeqs | Complete | tRNAscan |
B. Output
The BlastOut and tRNAscan output files are standard BLAST and tRNAscan output files, more information on these output files can be found at their respective websites. Five additional output files are provided by Prophage Finder.
ProphageFinderDNASeqs.txt - This file contains the entire predicted DNA sequence for each of the predicted prophage based upon coordinates from BLAST hit clusters. In the example below, the input DNA sequence was named rsph2.txt (located in the temp directory) and this file contained a DNA sequence with the FASTA header >Contig2.
Example:
Sequence file: temp/rsph2.txt Sequence Header: >Contig2 >PredictedProphage1 TGGATCGATCCCGAGCTGAACACCGAGGCCACGATGAAGGCCGGGAAGCTCTTTCTCGACTTCGACATCGAGCC... >PredictedProphage2 GGCAAGGACGAGGTTCGTCGCGGACTCGCGACCGTTGCGGTCGGTTTCCATGTCGTACGTCAGGGCCTGACCGT... >PredictedProphage3 GCCGAAGGCGCCGAGAAGCTGGGGCGCCTGCTGGGTGAAGGCGATCATCGCCGACTGACCCGAGGCGACCTGCA...
ProphageFinderGeneSeqs.txt - This file contains the DNA sequences of all of the gene regions with BLAST hits for each prophage. In the example below, >PredictedProphage1 has 8 hits. The DNA sequence is listed for each of the 8 hits. They do not all start with the typical ATG start codon or end with stop codons because BLAST hits usually do not cover the entire length of a gene, and this data is parsed from the BLAST report.
Example:
Sequence file: temp/rsph2.txt Sequence Header: >Contig2 >PredictedProphage1-Hit1 TGGATCGATCCCGAGCTGAACACCGAGGCCACGATGAAGGCCGGGAAGCTCTTTCTCGACTTCGACATCGAGCCG... >PredictedProphage1-Hit2 ATGGCCCTTCCGCGCACGATCAGGAACTTCAACGCCTTCGTCGACGGGGTGAGCTACTTCGGCATCGTCGACGAG... >PredictedProphage1-Hit3 CAGCCGCTGAAGCGAGCCTCGGGCGACATCGCGAGCGTCACGGTGCGCAAGCCTGACGTCGGCAGTCTGCGCGGC... >PredictedProphage1-Hit4 TTCCTGCGCCCGGCGGCCGAGTTCGAGCGCTTCCGGGTGCAGCTGACCAATCTCGAAGGCTCGGCCGAGGGCGCC... >PredictedProphage1-Hit5 GCCAGCCTCGTGATGATGGCGCTCGGCAGCTTCCGCTTCGGCGTGAACCGGGCGGGCTACCAGAGCTTCAGCCGA... >PredictedProphage1-Hit6 CTCGCCGGCGACATGCTGGATGCGATCTGCAAGGCCCGGCTCGGGTCCGAGCGGCATGTACCGGCGGTGCTGGCG... >PredictedProphage1-Hit7 ATGATCCCGGCCTTCCGGCTGACCGTCGATGGCGAGGACGCGACCGGTGCCGTGGCCGACCGGCTCCTGAGCCTC... >PredictedProphage1-Hit8 CAGCCTGTGGCGCCCTGGATCGGCGGCAAGCGCAACCTCGCCCGCCGGATCTGCGCCATCCTCGACCGCAGCCCC...
ProphageFinderProtSeqs.txt - This file contains the amino acid sequences for the predicted proteins based on BLAST hits for each prophage. In the example below, the protein sequences for the 8 hits within >PredictedProphage1 are listed. For the same reason that the genes do not all start with ATG, the proteins do not all start with Methionine(M).
Example:
Sequence file: temp/rsph2.txt Sequence Header: >Contig2 >PredictedProphage1-Hit1 WIDPELNTEATMKAGKLFLDFDIEPPAPLEHLTLQAHRNGDYYEELVLSVTGN >PredictedProphage1-Hit2 MALPRTIRNFNAFVDGVSYFGIVDEGKLPAVKIQTEAHRGAGMDGPIGIDVGMEALASKMSFSEWVPAIVKKLGR... >PredictedProphage1-Hit3 QPLKRASGDIASVTVRKPDVGSLRGLKLTDILQMDVTALSRLLPRITEPALLPDEVAALDPADFLGLSAAVVGF >PredictedProphage1-Hit4 FLRPAAEFERFRVQLTNLEGSAEGADRALRWIEDFATRTPLQLNDTVAAYARLKAFGLDPTKGAMQALVDTMAAT... >PredictedProphage1-Hit5 ASLVMMALGSFRFGVNRAGYQSFSRSASWRWEAQDRLGRAPALQYLGPGSDEITLEGVIYPHFRGGLRQVELMRL... >PredictedProphage1-Hit6 LAGDMLDAICKARLGSERHVPAVLAANPHLAALGSVYPAGVLITLPEVAEPVATGQIRLW >PredictedProphage1-Hit7 MIPAFRLTVDGEDATGAVADRLLSLVITDEDGTKADRLEIELDDRDGRLAFPDTEARIEVALGFAGQPLAAMGVF... >PredictedProphage1-Hit8 QPVAPWIGGKRNLARRICAILDRSPCLTYAEPFVGMAGIFLRRSSRPRAEVINDRGRDVANLFRILQRHYPQFLD...
ProphageFinderSummary.txt - This is the Prophage Finder summary output file.
It contains several pieces of information. First, it shows information about the
input sequence including its size, GC content, and the predicted GC content at first, second, and third codon positions
(based on the GC content and equations provided by Lawrence and Ochman.
1997. Amelioration of bacterial genomes: rates of change and exchange.
J Mol Evol. 44(4):383-97).
Second, this summary file shows information about each predicted prophage. This
information is displayed in the following order: length (in base pairs), number of hits, GC
content, GC of the coding regions, GC of each codon position in the predicted
prophage genes, and description of each top BLAST hit in the predicted prophage
with associated coordinates.
Example:
Prophage Finder Output Sequence file: temp/rsph2.txt Sequence Header: >Contig2 Sequence size: 943022 Sequence GC content: 69.01 Predicted Codon Position GC: 1st: 69.34 2nd: 45.33 3rd: 84.47 >PredictedProphage1 Length: 6016 Number of hits = 8 GC Coding First Second Third 71.21 70.47 69.63 47.99 93.78 Left Right Strand Best Blast Hit Score Evalue 418447 418605 1 ref|NP_046778.1|gpFI 50.8bits(120) 2e-04 418623 419117 1 ref|NP_052270.1|similar to P2 tail tube protein FII, Swiss-Prot Accession Number P22502 64.7bits(156) 1e-08 419156 419377 1 ref|NP_490623.1|hypothetical protein 76.6bits(187) 3e-12 419836 420606 1 ref|NP_050646.1|putative tape measure protein 80.1bits(196) 3e-13 421931 422311 1 ref|NP_758935.1|ORF44 116bits(291) 3e-24 422344 422523 1 ref|NP_878205.1|gpX 52.4bits(124) 6e-05 422532 423356 1 ref|NP_758937.1|ORF46 169bits(427) 5e-40 423683 424462 1 ref|YP_164085.1|Dam modification methylase 219bits(559) 2e-55
ProphageFinderComplete.txt - This file is the complete Prophage Finder results file. It contains all of the information from the other four files plus one piece of additional information. This addition is a codon usage profile which contains the following information: amino acid, codon, total number of occurrences of the codon in the prophage, percent usage of the codon for the amino acid, and number of occurrences of the codon per 1000 codons.
Example of Codon Usage Profile:
Codon Usage: AA Codon Total Percent #per1000 Leu TTG 9 0.11 10.34 Leu TTA 42 0.53 48.28 Leu CTG 3 0.04 3.45 Leu CTA 6 0.07 6.90 Leu CTT 16 0.20 18.39 Leu CTC 4 0.05 4.60
5. The primary output file that we are going to look at in the tutorial is the Summary file. Open this file by clicking on the Summary link above. As you can see, Prophage Finder has predicted three potential prophage in this sequence. One of the most important characteristics of a predicted prophage is its number of hits. The higher the number of hits the more likely it is that the predicted prophage is an actual one. Generally all predicted prophage with greater than ten hits are actual prophage with very few exceptions. All three of the predicted prophage in this example have seven or eight hits which makes it rather likely that they are actual prophage. It is also important to note that a single prophage loci can be detected as multiple clusters or a single cluster can represent multiple prophage loci.
6. Important words to look for in the hit description include: tail, integrase, portal, protease, capsid, terminase, tape measure, methylase, methyltransferase, packaging, and helicase. Examination of the three predicted prophage shows that each of them contains at least two hits with these words in them. This provides support for these prophage because these words represent genes that are important in phage biology. It is important to note that the hits presented in the Summary file are the best hits for that region of the prophage and there may be weaker hits for that region that correspond to one of these genes. Also many of the genes present in the database have not yet been characterized and may be equally important.
7. One thing do be careful of when looking at the hits is gene duplications. An example of gene duplication is in >PredictedProphage2. The first two hits are to the exact same gene. When trying to determine if a predicted prophage is real, it is best to subtract any duplicates from the total number of hits for that prophage. In this case that would bring the number of hits for >PredictedProphage2 down to seven, which is still a pretty good number. In some cases, there will be four or more copies of the same gene present, which often results in a false positive.
8. In some cases GC content can also provide support for a predicted prophage being real. In this example all three of the prophage have a GC content within 5% of the host sequence. However, GC content have been shown to be a poor indicator for prophage. However, greater deviations may be found and may help provide insight into potential prophage in other genome sequences.
9. Codon usage data is provided in the Complete output file and can be compared with genome codon usage statistics available through a link near the bottom of the homepage. Again, significant deviations between prophage and genome data may support assignment of prophage loci.
10. It is observed that prophage are often inserted into genome sequences next to tRNA genes. The tRNAscan output can be used to determine if the predicted prophage are near tRNA genes. When comparing the tRNA locations to the location of the predicted prophage, it is important to realize the predicted prophage most likely represents only a portion of the actual prophage. In the case of our example, >PredictedProphage1 and >PredictedProphage3 are potentially inserted next to tRNAs 9 and 11 respectively.
11. Finally, to answer the question of whether these predicted prophage are real prophage. Based on the criteria above it is likely that these predictions are real prophage, but without further examination it can not be determined conclusively. However, these three predicted prophage have been examined carefully and do represent actual prophage-related loci. >PredictedProphage1 is actually ~9.3 kilobase pairs long, >PredictedProphage2 is ~58.6 kilobase bairs long, and >PredictedProphage3 is 35.5 kilobase pairs long. Interestingly, all three prophage are inserted next to tRNAs.
12. Examples of weak and strong prophage predictions are provided below. The Weak example is most likely not a prophage due to its low number of hits and the fact that one of its hits is a duplicate. The Strong example is a clear indicator of a prophage-related loci due to its large number of hits and the fact that many of its hits are to important phage related genes.
III. Downloading and Installing Prophage Finder
If a user intends to run Prophage Finder using multiple sets of values for the various parameters, it will be faster if the user downloads Prophage Finder and runs it locally on their computer. The local version of Prophage Finder does not run BLAST so it is necessary for the user to already have a BLAST result file from running Prophage Finder on the website. The .zip file containing Prophage Finder can be downloading by clicking here. After the .zip file has been downloaded, all of its files must be unzipped into the same directory. Files contained in the .zip file include:
ProphageFinderMulti2.pl - This is the main perl program that must be executed to run the program.
Prophage.pm - This is perl module is used to create Prophage objects.
ProphageSubs.pm - This perl module contains various subroutines required for ProphageMulti2.pl to run.
Readme.txt - This file contains instructions on how to install and run Prophage Finder.
Other requirements:
Perl must be installed on your computer.
http://www.perl.com/download.csp
Bioperl must be installed on your computer.
http://bio.perl.org/Core/Latest/index.shtml
After all of these components are installed the program is ready to be run. Further instructions can be found in the Readme.txt included in the .zip file.
Michael Bose
B.S. Molecular Biology & Bioinformatics
Dr. Robert Barber
Assistant Professor of Biological Sciences
University of Wisconsin-Parkside
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.