Prophage Finder Help

I.    Introduction

II.   Tutorial

III.  Downloading and Installing Prophage Finder

IV.  Contact Information

V.   Additional References

 

I. Introduction

    This web application was designed to provide researchers with a tool to quickly predict potential prophage loci in prokaryotic genome sequences.  However, this web application does not make any predictions as to whether the identified prophage is functional or not and it is also important to note the identified prophage region will most likely not represent the entire prophage.

    Prophage Finder initially uses BLASTX to compare input DNA sequences to a database of predicted proteins from all sequenced phage genomes as of April 2005, available at http://www.ncbi.nlm.nih.gov/genomes/static/phg.html. The BLAST results are then processed by a Perl program to determine potential prophage. This perl program predicts prophage by clustering the BLAST hits based on the number of base pairs between them and then reporting the clusters greater than a specified size.

II. Tutorial

A. Input

1.   For this tutorial you will need to download, an example sequence file called rsph2.txt.  Download the file by  clicking here and save it on your computer.

2.  After downloading the file, open the Prophage Finder homepage and select the file rsph2.txt as your upload file.  Next, check the box for tRNAscan, enter your email address, and  then click on the submit button.

3.  After you click on the submit button you will be notified once the DNA sequence has been uploaded that your request has been submitted and your results will be sent to you via email.  For this tutorial, you should receive your results within an hour.

4.  You can either wait for your results or continue with the results files provided below.  There should be seven output files.

BlastOut Summary DNASeqs GeneSeqs ProtSeqs Complete tRNAscan

B. Output

The BlastOut and tRNAscan output files are standard BLAST and tRNAscan output files,  more information on these output files can be found at their respective websites.  Five additional output files are provided by Prophage Finder. 

ProphageFinderDNASeqs.txt - This file contains the entire predicted DNA sequence for each of the predicted prophage based upon coordinates from BLAST hit clusters.  In the example below, the input DNA sequence was named rsph2.txt (located in the temp directory) and this file contained a DNA sequence with the FASTA header >Contig2.

Example:

	Sequence file:  temp/rsph2.txt
	Sequence Header:  >Contig2

	>PredictedProphage1
	TGGATCGATCCCGAGCTGAACACCGAGGCCACGATGAAGGCCGGGAAGCTCTTTCTCGACTTCGACATCGAGCC...

	>PredictedProphage2
	GGCAAGGACGAGGTTCGTCGCGGACTCGCGACCGTTGCGGTCGGTTTCCATGTCGTACGTCAGGGCCTGACCGT...
	
	>PredictedProphage3
	GCCGAAGGCGCCGAGAAGCTGGGGCGCCTGCTGGGTGAAGGCGATCATCGCCGACTGACCCGAGGCGACCTGCA...

ProphageFinderGeneSeqs.txt - This file contains the DNA sequences of all of the gene regions with BLAST hits for each prophage.  In the example below, >PredictedProphage1 has 8 hits.  The DNA sequence is listed for each of the 8 hits.  They do not all start with the typical ATG start codon or end with stop codons because BLAST hits usually do not cover the entire length of a gene, and this data is parsed from the BLAST report.

Example:

	Sequence file:  temp/rsph2.txt
	Sequence Header:  >Contig2

	>PredictedProphage1-Hit1
	TGGATCGATCCCGAGCTGAACACCGAGGCCACGATGAAGGCCGGGAAGCTCTTTCTCGACTTCGACATCGAGCCG...
	>PredictedProphage1-Hit2
	ATGGCCCTTCCGCGCACGATCAGGAACTTCAACGCCTTCGTCGACGGGGTGAGCTACTTCGGCATCGTCGACGAG...
	>PredictedProphage1-Hit3
	CAGCCGCTGAAGCGAGCCTCGGGCGACATCGCGAGCGTCACGGTGCGCAAGCCTGACGTCGGCAGTCTGCGCGGC...
	>PredictedProphage1-Hit4
	TTCCTGCGCCCGGCGGCCGAGTTCGAGCGCTTCCGGGTGCAGCTGACCAATCTCGAAGGCTCGGCCGAGGGCGCC...
	>PredictedProphage1-Hit5
	GCCAGCCTCGTGATGATGGCGCTCGGCAGCTTCCGCTTCGGCGTGAACCGGGCGGGCTACCAGAGCTTCAGCCGA...
	>PredictedProphage1-Hit6
	CTCGCCGGCGACATGCTGGATGCGATCTGCAAGGCCCGGCTCGGGTCCGAGCGGCATGTACCGGCGGTGCTGGCG...
	>PredictedProphage1-Hit7
	ATGATCCCGGCCTTCCGGCTGACCGTCGATGGCGAGGACGCGACCGGTGCCGTGGCCGACCGGCTCCTGAGCCTC...
	>PredictedProphage1-Hit8
	CAGCCTGTGGCGCCCTGGATCGGCGGCAAGCGCAACCTCGCCCGCCGGATCTGCGCCATCCTCGACCGCAGCCCC...

 

ProphageFinderProtSeqs.txt - This file contains the amino acid sequences for the predicted proteins based on BLAST hits for each prophage.  In the example below, the protein sequences for the 8 hits within >PredictedProphage1 are listed.  For the same reason that the genes do not all start with ATG, the proteins do not all start with Methionine(M).

Example:

	Sequence file:  temp/rsph2.txt
	Sequence Header:  >Contig2

	>PredictedProphage1-Hit1
	WIDPELNTEATMKAGKLFLDFDIEPPAPLEHLTLQAHRNGDYYEELVLSVTGN
	>PredictedProphage1-Hit2
	MALPRTIRNFNAFVDGVSYFGIVDEGKLPAVKIQTEAHRGAGMDGPIGIDVGMEALASKMSFSEWVPAIVKKLGR...
	>PredictedProphage1-Hit3
	QPLKRASGDIASVTVRKPDVGSLRGLKLTDILQMDVTALSRLLPRITEPALLPDEVAALDPADFLGLSAAVVGF
	>PredictedProphage1-Hit4
	FLRPAAEFERFRVQLTNLEGSAEGADRALRWIEDFATRTPLQLNDTVAAYARLKAFGLDPTKGAMQALVDTMAAT...
	>PredictedProphage1-Hit5
	ASLVMMALGSFRFGVNRAGYQSFSRSASWRWEAQDRLGRAPALQYLGPGSDEITLEGVIYPHFRGGLRQVELMRL...
	>PredictedProphage1-Hit6
	LAGDMLDAICKARLGSERHVPAVLAANPHLAALGSVYPAGVLITLPEVAEPVATGQIRLW
	>PredictedProphage1-Hit7
	MIPAFRLTVDGEDATGAVADRLLSLVITDEDGTKADRLEIELDDRDGRLAFPDTEARIEVALGFAGQPLAAMGVF...
	>PredictedProphage1-Hit8
	QPVAPWIGGKRNLARRICAILDRSPCLTYAEPFVGMAGIFLRRSSRPRAEVINDRGRDVANLFRILQRHYPQFLD...

 

ProphageFinderSummary.txt - This is the Prophage Finder summary output file. It contains several pieces of information. First, it shows information about the input sequence including its size, GC content, and the predicted GC content at first, second, and third codon positions (based on the GC content and equations provided by Lawrence and Ochman.  1997.  Amelioration of bacterial genomes: rates of change and exchange.
J Mol Evol. 44(4):383-97). Second, this summary file shows information about each predicted prophage. This information is displayed in the following order: length (in base pairs), number of hits, GC content, GC of the coding regions, GC of each codon position in the predicted prophage genes, and description of each top BLAST hit in the predicted prophage with associated coordinates.

Example:

	Prophage Finder Output

	Sequence file:  temp/rsph2.txt
	Sequence Header:  >Contig2

	Sequence size:  943022
	Sequence GC content:  69.01
	Predicted Codon Position GC:  	1st:  69.34	 2nd:  45.33	 3rd:  84.47

	 >PredictedProphage1 
	 Length:   6016 
	 Number of hits =   8 
	 GC	Coding	First	Second	Third
	 71.21 	 70.47 	 69.63 	 47.99 	 93.78 
	 Left	Right	Strand	Best Blast Hit	Score	Evalue
	 418447	418605	1	ref|NP_046778.1|gpFI 	50.8bits(120)	2e-04
	 418623	419117	1	ref|NP_052270.1|similar to P2 tail tube protein FII, Swiss-Prot Accession Number P22502	64.7bits(156)	1e-08
	 419156	419377	1	ref|NP_490623.1|hypothetical protein	76.6bits(187)	3e-12
	 419836	420606	1	ref|NP_050646.1|putative tape measure protein	80.1bits(196)	3e-13
	 421931	422311	1	ref|NP_758935.1|ORF44 	116bits(291)	3e-24
	 422344	422523	1	ref|NP_878205.1|gpX 	52.4bits(124)	6e-05
	 422532	423356	1	ref|NP_758937.1|ORF46 	169bits(427)	5e-40
	 423683	424462	1	ref|YP_164085.1|Dam modification methylase	219bits(559)	2e-55
 

ProphageFinderComplete.txt - This file is the complete Prophage Finder results file. It contains all of the information from the other four files plus one piece of additional information. This addition is a codon usage profile which contains the following information: amino acid, codon, total number of occurrences of the codon in the prophage, percent usage of the codon for the amino acid, and number of occurrences of the codon per 1000 codons.

Example of Codon Usage Profile:

	Codon Usage:
 	AA	Codon	Total	Percent	#per1000
 	Leu 	 TTG 	 9 	 0.11 	 10.34 	 
	Leu 	 TTA 	 42 	 0.53 	 48.28 	 
	Leu 	 CTG 	 3 	 0.04 	 3.45 	 
 	Leu 	 CTA 	 6 	 0.07 	 6.90 	 
 	Leu 	 CTT 	 16 	 0.20 	 18.39 	 
 	Leu 	 CTC 	 4 	 0.05 	 4.60 

 

 5.  The primary output file that we are going to look at in the tutorial is the Summary file.  Open this file by clicking on the Summary link above.  As you can see, Prophage Finder has predicted three potential prophage in this sequence.  One of the most important characteristics of a predicted prophage is its number of hits.  The higher the number of hits the more likely it is that the predicted prophage is an actual one.  Generally all predicted prophage with greater than ten hits are actual prophage with very few exceptions.  All three of the predicted prophage in this example have seven or eight hits which makes it rather likely that they are actual prophage.  It is also important to note that a single prophage loci can be detected as multiple clusters or a single cluster can represent multiple prophage loci.

6.   Important words to look for in the hit description include: tail, integrase, portal, protease, capsid, terminase, tape measure, methylase, methyltransferase, packaging, and helicase.  Examination of the three predicted prophage shows that each of them contains at least two hits with these words in them.  This provides support for these prophage because these words represent genes that are important in phage biology.  It is important to note that the hits presented in the Summary file are the best hits for that region of the prophage and there may be weaker hits for that region that correspond to one of these genes.  Also many of the genes present in the database have not yet been characterized and may be equally important.

7.  One thing do be careful of when looking at the hits is gene duplications.  An example of gene duplication is in >PredictedProphage2.  The first two hits are to the exact same gene.  When trying to determine if a predicted prophage is real, it is best to subtract any duplicates from the total number of hits for that prophage.  In this case that would bring the number of hits for >PredictedProphage2 down to seven, which is still a pretty good number.  In some cases, there will be four or more copies of the same gene present, which often results in a false positive.

8.  In some cases GC content can also provide support for a predicted prophage being real.  In this example all three of the prophage have a GC content within 5% of the host sequence.  However, GC content have been shown to be a poor indicator for prophage.  However, greater deviations may be found and may help provide insight into potential prophage in other genome sequences.

9.  Codon usage data is provided in the Complete output file and can be compared with genome codon usage statistics available through a link  near the bottom of the homepage.  Again, significant deviations between prophage and genome data may support assignment of prophage loci.

10.  It is observed that prophage are often inserted into genome sequences next to tRNA genes.  The tRNAscan output can be used to determine if the predicted prophage are near tRNA genes.  When comparing the tRNA locations to the location of the predicted prophage, it is important to realize the predicted prophage most likely represents only a portion of the actual prophage.   In the case of our example,  >PredictedProphage1 and >PredictedProphage3 are potentially inserted next to tRNAs 9 and 11 respectively.

11.  Finally, to answer the question of whether these predicted prophage are real prophage.  Based on the criteria above it is likely that these predictions are real prophage, but without further examination it can not be determined conclusively.  However, these three predicted prophage have been examined carefully and do represent actual prophage-related loci. >PredictedProphage1 is actually ~9.3 kilobase pairs long, >PredictedProphage2 is ~58.6 kilobase bairs long, and >PredictedProphage3 is 35.5 kilobase pairs long.  Interestingly, all three prophage are inserted next to tRNAs.

12.  Examples of weak and strong prophage predictions are provided below.  The Weak example is most likely not a prophage due to its low number of hits and the fact that one of its hits is a duplicate.  The Strong example is a clear indicator of a prophage-related loci due to its large number of hits and the fact that many of its hits are to important phage related genes.

Weak

Strong

 

III. Downloading and Installing Prophage Finder

    If a user intends to run Prophage Finder using multiple sets of values for the various parameters, it will be faster if the user downloads Prophage Finder and runs it locally on their computer.  The local version of Prophage Finder does not run BLAST so it is necessary for the user to already have a BLAST result file from running Prophage Finder on the website.  The .zip file containing Prophage Finder can be downloading by clicking here.  After the .zip file has been downloaded, all of its files must be unzipped into the same directory.  Files contained in the .zip file include:

    ProphageFinderMulti2.pl - This is the main perl program that must be executed to run the program.

    Prophage.pm - This is perl module is used to create Prophage objects.

    ProphageSubs.pm - This perl module contains various subroutines required for ProphageMulti2.pl to run.

    Readme.txt - This file contains instructions on how to install and run Prophage Finder.

Other requirements:

    Perl must be installed on your computer.

    http://www.perl.com/download.csp

    Bioperl must be installed on your computer.

    http://bio.perl.org/Core/Latest/index.shtml

After all of these components are installed the program is ready to be run.  Further instructions can be found in the Readme.txt included in the .zip file.

 

IV. Contact Information

Michael Bose

B.S. Molecular Biology & Bioinformatics

mikebose27@yahoo.com

Dr. Robert Barber

Assistant Professor of Biological Sciences

University of Wisconsin-Parkside

barber@uwp.edu

 

V. Additional References

  1. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs",  Nucleic Acids Res. 25:3389-3402.

  2. Canchaya, Carlos, Caroline Proux, Ghislain Fournous, Anne Bruttin, and Harald Brussow (2003), "Prophage Genomics", Micro. and Mol. Bio. Rev. 67:238-276.
     
  3. Lowe, T.M. and S.R. Eddy (1997), "tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence", Nucl. Acids Res., 25:955-964.