NAME

BlastSniffer, a script for gene localization based on similarity.


SYNOPSIS

perl bsniffer.pl [project_name] [gene_name]


DESCRIPTION

This program allows the user to define the location of a set of genes in a set of genomic contigs. It is particularly suited for similarity-based predictions applied to gene families. Usually, GeneTuner is run downstream of BlastSniffer to define the exact exon boundaries of the gene. The purpose of both programs is to automatically perform several time-consuming steps in the manual annotation of genes by homology.

Usually, annotations start with a TBLASTN comparison between a known protein and a genomic sequence. The user must then decide which resulting hits correspond to bona fide exons of new putative genes. BlastSniffer handles this task. Then, every exon junction has to be manually assessed and corrected, and the new gene must be rebuilt by copying the specified subsequences from the template contig. GeneTuner lets the user choose exon junctions, and also add and remove exons, directly from the genomic sequence. To help this task, the template sequence is shown aligned to the contig.

The recommended procedure to use these programs consists of the following steps:

  1. Store all of the starting protein sequences in one folder (ppath) with a given extension (aa_ext). Optionally, store the corresponding nucleotide sequences with the same names in a different folder (npath) and/or with a different extension (nt_ext).
  2. Run TBLASTN to compare each of the starting protein sequences with the template genomic sequence (dbpath). Store the output files in a folder (tbnpath) with the extension .tbn.
  3. Run BlastSniffer on those files and decide which BLAST hits describe each gene.
  4. Open GeneTuner and load the project files created by BlastSniffer. See GeneTuner documentation.
  5. Edit the predicted exons, save the result, and exit. Repeat steps 4 and 5 until you have processed every tbn file.

Welcome screen

Unless a valid project name is provided in the command line, the user will be taken to the welcome screen. This screen allows the user to create a new project or load an existing project. Once a project is created or loaded, BlastSniffer processes the next tbn file and takes the user to the edition screen.

Simplified TBLASTN file

Once a tbn file is loaded, a new subfolder is created at the results folder with the name of the gene. Inside this new folder, a second tbn file is created with a simplified format:

 c19h_usp10
 
 Lenght: 798
 
 >chr11
 
 
 0 ---> *c19h_usp1*
 Query: 8        YIFGDFSPDEFNQFFVTPRSSVEL 31
                 YIFG+FSPDEFNQFFVTPR SVE+
 Sbjct: 18795784 YIFGEFSPDEFNQFFVTPRCSVEV 18795855
 
 
 ...
 
 
 7 ---> 
 Query: 484      ALGDKIVRDIRPGAAFEPTYIYRLLTVNKSSLSEK 518
                 ALGDKIVRDIRPGAAFEPTYIYRLLTV KSSLSEK
 Sbjct: 18813207 ALGDKIVRDIRPGAAFEPTYIYRLLTVIKSSLSEK 18813311
 
 
 
 8 ---> 
 Query: 518      KGRQEDAEEYLGFILNGLHEEMLNLKKLLSPSNE 551
                 +GRQEDAEEYLGFILNGLHEEML LKKLLSP NE
 Sbjct: 18814019 QGRQEDAEEYLGFILNGLHEEMLTLKKLLSPHNE 18814120

This format gives, for each hit, a hit number for identification, the direction of the hit ('-->' for + and '<--' for -), and a warning if this hit has been used in another annotation. This allows the description of a gene as a combination of hit numbers. Users can read this file at this point to decide which combination of hits best describes the novel gene.

On the other hand, BlastSniffer considers all of the possible hit combinations and sets a raw score for them. The higher the score, the better the resulting alignment. However, other reasons may lead the user to choose a combination with a low score. Therefore, BlastSniffer sorts the combinations and shows them to the user in the edition screen.

Edition screen

These are the elements of the edition screen:

 **************                                           
 c19h_usp10                          (Gene name          )
 Contig: Chr11                       (Contig             )
 Combination: 0                      (Combination number )
 Raw score: 2774                    
 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0    (Current combination)
 ************************                                 
 ************************                                 
 1.- Accept combination                                   
 2.- Previous combination                                 
 3.- Next combination                                     
 4.- Previous contig                                      
 5.- Next contig                                          
 6.- Enter manually                                       
 7.- Mark used hits                                       
 8.- Pass gene                                            
 9.- Exit                                                 
 ************************                                 
 ************************                                 
 Please type option number
  1. Accept current combination, save result, and go to next gene.
  2. Show previous combination (better score).
  3. Show next combination (worse score).
  4. Go to previous contig.
  5. Go to next contig.
  6. Enter a combination manually.
  7. Modify the new tbn file to show which hits have been used in the annotation of other genomes.
  8. Save nothing and continue to next gene. GeneTuner will ignore this gene. Useful if there is no orthologue of the query protein in the template genome.
  9. Exit. Nothing will be saved for the current gene, but GeneTuner will not ignore this entry.

At the begining, the hit combination with the best raw score is shown. At this point, the user may want to Mark used hits to modify the tbn file and make sure that a hit is not used in more than one gene. Depending on personal preferences, several workflows are feasible. Some users may start with the combinations offered by BlastSniffer and decide which one is correct for the gene. Other users may prefer to read the simplified tbn file and decide which combination describes the gene without any previous advice. Once the decission has been made, users must make sure that the correct contig is displayed at the edition screen. If not, it can be reached by repeatedly going to previous contig or next contig. Then, the combination can be reached by repeatedly diplaying previous combination or next combination. Users can also enter the combination manually by providing a properly sorted list of hit numbers separated by commas or spaces.

It must be noted that the algorithm in BlastSniffer only considers sequence similarity. This does not guarantee that a combination describes a gene just because that combination has the best score.

Project files

Project files are saved into the projects folder with .gt extension. They contain the following information:

 tbnpath  => path to the folder where the starting ".tbn" files are stored
 basepath => path to the folder where the result folders will be stored
 dbpath   => path to the genomic (template) sequence
 ppath    => path to the starting protein sequences folder
 aa_ext   => extension of the starting protein sequences
 npath    => path to the starting nucleotide sequences folder (optional)
 nt_ext   => extension of the starting nucleotide sequences (optional)
Both GeneTuner and BlastSniffer can create compatible project files from user input.


ARGUMENTS

 --help        print this help

OPTIONS

none implemented


INPUT

The input for BlastSniffer is a set of TBLASTN results in text files with extension tbn.


OUTPUT


LICENSE

This program is free software and can be redistributed under the same terms as Perl. See http://www.perl.com/pub/a/language/misc/Artistic.html


AUTHOR

Copyright (C) 2008, Victor Quesada

e-mail: quesadavictor@uniovi.es