BIOPERL INTERVIEW QUESTIONS


Most Important Frequently Asked Bioperl Interview Questions




Interview Quesions on Bioperl

    1. Question 1. What Is Bioperl?

      Answer :

      BioPerl is a toolkit of perl modules useful in building bioinformatics solutions in Perl. It is built in an object-oriented manner so that many modules depend on each other to achieve a task. The collection of modules in the bioperl-live repository consist of the core of the functionality of bioperl. Additionally auxiliary modules for creating graphical interfaces (bioperl-gui), persistent storage in RDMBS (bioperl-db), running and parsing the results from hundreds of bioinformatics applications (Run package), software to automate bioinformatic analyses (bioperl-pipeline) are all available as CVS modules in our repository.

    2. Question 2. What Is The Difference Between 1.5.1 And 1.4.0? What Do You Mean Developer Release?

      Answer :

      The 1.4.x series was released in 2004 and represented a stable release series. The 1.5.0 release was made in early 2005 but several annoying bugs were included in it. The 1.5.1 release in October has fixed those bugs and also added a number of new modules as well. See the Changes file for more information.

      Developer releases are odd numbered releases (1.3, 1.5, etc) not intended to be completely stable (although all tests should pass). Stable releases are even numbered (1.0, 1.2, 1.6) and intended to provide a stable API so that modules will continue to respect the API throught a stable release series. We cannot guarantee that APIs are stable between releases (i.e. 1.6 may not be completely compatible with scripts written for 1.4), but we endeavor to keep the API stable so that upgrading is easy.

      0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were stable releases on 0.7 branch. This means they had a set of functionality that is maintained throughout (no experimental modules) and were guaranteed to have all tests and subsequent bug fix releases with the 0.7 designation would not have any API changes.

      The 0.9.X series was our first attempt at releasing so called developer releases. These are snapshots of the actively developed code that at a minimum pass all our tests.

    3. Question 3. Can You Explain The Object Model Design And Rationale?

      Answer :

      There is no simple answer to this question. Simply put, this is a toolkit which has grown organically. The goals and user audience has evolved. Some decisions have been made and we have been forced to live by them rather than destroy backward compatibility. In addition there are different philosophies of software development. The major developers on the project have tried to impose a set of standards on the code so that the project can be coordinated without every commit being cleared by a few key individuals (see Eric S. Raymond's essay "The Cathedral and the Bazaar" for different styles of running an open source project - we are clearly on the Bazaar end). Advanced BioPerl talks more about specific design goals.

      The clear consensus of the project developers is that BioPerl should be consistent. This may cause us to pay the price of some copy-and-paste of code, with the Get/Set accessor methods being a sore spot for some, and the lack of using AUTOLOAD. By being consistent we hope that someone can grok the gist of a module from the basic documentation, see example code, and get a set of methods from the API documentation. We aim to make the core object design easy to understand. This has not been realized by any stretch of the imagination as the toolkit has well over 1000 modules in bioperl-live and bioperl-run alone.

      That said we do want to improve things. We want to experiment with newer modules which make Perl more object-oriented. We have high hopes for some of the promises of Perl6. To try and realize this goal we are encouraging developers to play with new object models in a bioperl-experimental project.

    4. Question 4. How Do I Submit A Patch Or Enhancement To Bioperl?

      Answer :

      We suggest the following. Post your idea to the appropriate mailing list. If it is a really new idea consider taking us through your thought process. We'll help you tease out the necessary information such as what methods you'll want and how it can interact with other BioPerl modules. If it is a port of something you've already worked on, give us a summary of the current methods. Make sure there is an interface to the module, not just an implementation and make sure there will be a set of tests that will be in the t/ directory to insure that your module is tested. If you have a suggested patch and/or code enhancement, the SubmitPatch HOWTO gives guidelines on how to properly submit them via Bugzilla. See also Advanced BioPerl for more information.

    5. Question 5. What Is Bioperl-pedigree?

      Answer :

      The Pedigree package was started by Jason Stajich and provides basic tools for interacting with pedigree data and rendering pedigree plots.

    6. Question 6. What Is Bioperl-gui?

      Answer :

      The GUI package provides a Graphical User Interface for interacting with sequence and feature objects. It is used as part of the Genquire project.

    7. Question 7. What Is Bioperl-microarray?

      Answer :

      The Microarray package provides some basic tools for microarray functionality. It was started by Allen Day and may need some more work before it is a mature product.

    8. Question 8. What Is Bioperl-db?

      Answer :

      The BioPerl db package contains interfaces and adaptors that work with a BioSQL database to serialize and de-serialize Bioperl objects. Hilmar Lapp strongly recommends you use the CVS version with the latest biosql-schema.

    9. Question 9. Bioperl-ext Won't Compile The Staden Io Lib Part - What Do I Do?

      Answer :

      Make sure you read the README about copying files over. Some more items to check off before asking.

      • Are you sure io_lib is installed where you think it is, and that the install path is seen by Perl (did you answer the questions during perl Makefile.PL ?)
      • Did you copy the various missing .h files (os.h config.h if I remember right) from your io_lib source directory into the install include directory when installing io_lib?
      • When you ran make for io_lib did you see any errors or messages about how you should probably run "ranlib" on the library object?
      • Did you run "ranlib" on the installed libread file(s)?

    10. Question 10. What Is Bioperl-ext?

      Answer :

      bioperl-ext is a package of code for C-extensions (hence the 'ext') to BioPerl. These include interfacing to the staden IO library (the io_lib library) for reading in chromatogram files and Bio::Ext::Align which is a Smith-Waterman implementation.

    11. Question 11. I'm Trying To Run Bio::tools::run::standaloneblast And I'm Seeing Error Messages Like Can't Locate Bio/tools/run/wrapperbase.pm - How Do I Fix This?

      Answer :

      This file is missing in version 1.2. Two possible solutions: install version 1.2.1 or greater or retrieve and copy WrapperBase.pm to the proper location.

    12. Question 12. What Does The Future Hold For Running Applications Within Bioperl?

      Answer :

      We are trying to build a standard starting point for analysis application which will probably look like Bio::Tools::Run::AnalysisFactory which will allow the user to request which type of remote or local server they want to use to run their analyses. This will connect to the Pasteur's PISE server, the EBI's Novella server, as well as be aware of wrappers to run applications locally.

    13. Question 13. Hey, I Want To Run Clustalw Within Bioperl, I Used

      Answer :

      Most of the Bio/Tools/Run directory was moved to a new package, bioperl-run, to help make the size of the core code smaller and separate out the more specialized nature of application running from the rest of BioPerl. You can get these modules by installing the bioperl-run package. Download it via Getting BioPerl. This changeover began in the bioperl 1.1 developer release.

    14. Question 14. How Do I Tell Blast To Search Multiple Database Using Bio::tools::run::standaloneblast?

      Answer :

      Put the names of the databases in a variable. like so:

      my $dbs = '"/dba/BMC.fsa /dba/ALC.fsa /dba/HCC.fsa"';
      my @params = ( d      => "$dbs",
      program     => "BLASTN",
      _READMETHOD => "Blast",
      outfile     => "$dir/est.bls" );

      my $factory =
      Bio::Tools::Run::StandAloneBlast->new(@params);
      my $seqio = Bio::SeqIO->new(-file=>'t/amino.fa',-format => 'Fasta' );
      my $seqobj = $seqio->next_seq();
      $factory->blastall($seqobj);

    15. Question 15. How Do I Run Blast From Within Bioperl?

      Answer :

      Use the module Bio::Tools::Run::StandAloneBlast. It will give you access to many of the search tools in the NCBI BLAST suite including blastall, bl2seq, blastpgp. The basic structure is like this.

      use Bio::Tools::Run::StandAloneBlast;
      my $factory = Bio::Tools::Run::StandAloneBlast->new(p => 'blastn',
                                                          d => 'nt',
                                                          e => '1e-5');
      my $seq = Bio::PrimarySeq->new(-id => 'test1',
                                     -seq => 'AGATCAGTAGATGATAGGGGTAGA');
      my $report = $factory->blastall($seq); # get back a Bio::SearchIO report

    16. Question 16. How Do I Merge A Set Of Sequences Along With Their Features And Annotations?

      Answer :

      Try the cat() method in Bio::SeqUtils:

       $merged_seq = Bio::SeqUtils->cat(@seqs)

      This method uses the first sequence in the array as a foundation and adds the subsequent sequences to it, along with their features and annotations.

    17. Question 17. Can I Query Medline Or Other Bibliographic Repositories Using Bioperl?

      Answer :

      Yes! The solution lies in Bio::Biblio*, a set of modules that provide access to MEDLINE and OpenBQS-compliant servers using SOAP. 

    18. Question 18. How Do I Do Motif Searches With Bioperl? Can I Do "find All Sequences That Are 75% Identical" To A Given Motif?

      Answer :

      There are a number of approaches. Within BioPerl take a look at Bio::Tools::SeqPattern. Or, take a look at the TFBS package. This BioPerl-compliant package specializes in pattern searching of nucleotide sequence using matrices.

      It's also conceivable that the combination of BioPerl and Perl's regular expressions could do the trick. You might also consider the CPAN module String::Approx (this module addresses the percent match query), but experienced users question whether its distance estimates are correct, the Unix agrep command is thought to be faster and more accurate. 

    19. Question 19. How Do I Find All The Orfs In A Nucleotide Sequence? Antigenic Sites In A Protein? Calculate Nucleotide Melting Temperature? Find Repeats?

      Answer :

      In fact, none of these functions are built into BioPerl but they are all available in the EMBOSS package, as well as many others. The BioPerl developers created a simple interface to EMBOSS such that any and all EMBOSS programs can be run from within BioPerl. See Bio::Factory::EMBOSS for more information, it's in the bioperl-run package.

      If you can't find the functionality you want in BioPerl then make sure to look for it in EMBOSS, these packages integrate quite gracefully with BioPerl. Of course, you will have to install EMBOSS to get this functionality.

      In addition, BioPerl after version 1.0.1 contains the Pise/Bioperl modules. The Pise package was designed to provide a uniform interface to bioinformatics applications, and currently provides wrappers to greater than 250 such applications! Included amongst these wrapped apps are HMMER, PHYLIP, BLAST, GENSCAN, and the EMBOSS suite. Use of the Pise/BioPerl modules does not require installation of Pise locally as it runs through the HTTP protocol of the web.

    20. Question 20. I Get The Warning (old Style Annotation) On New Style Annotation::collection. What Is Wrong?

      Answer :

      You're using an old version! You'll see this error because the modules and interface has changed starting with BioPerl 1.0. Before v1.0 there was a Bio::Annotation module with add_Comment, add_Reference, each_Comment, and each_Reference methods.

      After v1.0 there is a Bio::Annotation::Collection module with add_Annotation('comment', $ann) and get_Annotations('comment').

      Please update your code in order to avoid seeing these warning messages. In the future the Reference objects will likely be implemented by the Bio::Biblio system but we hope to maintain a compatible API for these.

    21. Question 21. How Do I Get The Reverse-complement Of A Sequence Using The Subseq Method?

      Answer :

      One way is to pass the location to subseq in the form of a Bio::LocationI object. This object holds strand information as well as coordinates.
      use Bio::Location::Simple;
      my $location = Bio::Location::Simple->new(-start  => $start,
                                                -end   => $end,
                                                -strand => "-1");
      # assume we already have a sequence object
      my $rev_comp_substr = $seq_obj->subseq($location);

    22. Question 22. How Do I Get The Complete Spliced Nucleotide Sequence From The Cds Section?

      Answer :

      You can use the spliced_seq method. For example:

       my $seq_obj = $db->get_Seq_by_id($gi);
       foreach my $feat ( $seq_obj->top_SeqFeatures ) {
       if ( $feat->primary_tag eq 'CDS' ) {
         my $cds_obj = $feat->spliced_seq;
         print "CDS sequence is ",$cds_obj->seq,"n";
       }
      }

    23. Question 23. How Do I Retrieve A Nucleotide Coding Sequence When I Have A Protein Gi Number?

      Answer :

      You could go through the protein's feature table and find the coded_by value. The trick is to associate the coded_by nucleotide coordinates to the nucleotide entry, which you'll retrieve using the accession number from the same feature.

      use Bio::Factory::FTLocationFactory;
      use Bio::DB::GenPept;
      use Bio::DB::GenBank;
      my $gp = Bio::DB::GenPept->new;
      my $gb = Bio::DB::GenBank->new;
      # factory to turn strings into Bio::Location objects
      my $loc_factory = Bio::Factory::FTLocationFactory->new;
       my $prot_obj = $gp->get_Seq_by_id($protein_gi);
      foreach my $feat ( $prot_obj->top_SeqFeatures ) {
         if ( $feat->primary_tag eq 'CDS' ) {
         # example: 'coded_by="U05729.1:1..122"'
         my @coded_by = $feat->each_tag_value('coded_by');
         my ($nuc_acc,$loc_str) = split /:/, $coded_by[0];
         my $nuc_obj = $gb->get_Seq_by_acc($nuc_acc);
         # create Bio::Location object from a string
         my $loc_object = $loc_factory->from_string($loc_str);
         # create a Feature object by using a Location
         my $feat_obj = Bio::SeqFeature::Generic->new(-location =>$loc_object);
         # associate the Feature object with the nucleotide Seq object
         $nuc_obj->add_SeqFeature($feat_obj);
          my $cds_obj = $feat_obj->spliced_seq;
          print "CDS sequence is ",$cds_obj->seq,"n";
         }
      }

    24. Question 24. How Do I Parse The Cds Join Or Complement Statements In Genbank Or Embl Files To Get The Sub-locations?

      Answer :

      For example, how can I get the the coordinates 45 and 122 in join(45..122,233..267) :
      You could use primary_tag to find the CDS features and the Bio::Location::SplitLocationI object to get the coordinates:
      foreach my $feature ($seqobj->top_SeqFeatures){
        if ( $feature->location->isa('Bio::Location::SplitLocationI') and $feature->primary_tag eq 'CDS' ) {
           foreach my $location ( $feature->location->sub_Location ) {
             print $location->start , ".." , $location->end, "n";
           }
        }
      }

    25. Question 25. How Do I Retrieve All The Features From A Sequence? How About All The Features Which Are Exons Or Have A /note Field That Contains A Certain Gene Name?

      Answer :

      To get all the features:

      my @features = $seq->all_SeqFeatures();
      To get all the features filtering on only those which have the primary tag (ie. feature type) exon.
      my @genes = grep { $_->primary_tag eq 'exon'}
      $seq->all_SeqFeatures();
      To get all the features filtering on this which have the /note tag and within the note field contain the requested string $noteval.
      my @f_with_note = grep {  my @a = $_->has_tag('note') ? $_->each_tag_value('note') : ();
                                               grep { /$noteval/ } @a;  }  $seq->all_SeqFeatures();

    26. Question 26. Does Bio::searchio Parse The Html Output That Blast Creates Using The -t Option?

      Answer :

      Yes, with a twist. You can modify Bio::SearchIO's _readline() method such that it reads in the HTML and strips it of tags using the HTML::Strip module.
      Please note: We do not suggest parsing BLAST HTML output if it can be avoided. We actively support XML, tabular, and text output parsing of NCBI BLAST reports only; we have never supported parsing of NCBI BLAST HTML output directly through BioPerl and will not attempt to rectify problems where HTML output parsing post-stripping of the tags breaks but parsing text output works. Consider this fair warning.

      use Bio::SearchIO;
      use HTML::Strip;
      my $hs = HTML::Strip->new();
      # replace the blast parser's _readline method with one that
      # auto-strips HTML:
      package Bio::SearchIO::blast;
      sub Bio::SearchIO::blast::_readline {
       my ($self, @args) = @_;
       my $line = $self->SUPER::_readline(@args);
       return unless defined $line;
       return $hs->parse($line);
      }
      # now parse using the BLAST format module
       my $in = new Bio::SearchIO(-format => 'blast', -file   => $file);

    27. Question 27. Can I Get Domain Number From Hmmpfam Or Hmmsearch Output?

      Answer :

      For example:

       SH2_5: domain 2 of 2, from 349 to 432: score 104.4, E = 1.9e-26
      Not directly but you can compute it since the domains are numbered by their order on the protein:
      my @domains = $hit->domains;
      my $domainnum = 1;
      my $total = scalar @domains;
      foreach my $domain ( sort { $a->start <=> $b->start } $hit->domains ) {
        print "domain $domainnum of $total,n";
        $domainnum++;
      }

    28. Question 28. How Do I Get The Frame For A Translated Search?

      Answer :

      I'm using Bio::Search* and its frame() to parse BLAST but I'm seeing 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3.

      Why am I seeing these different numbers and how do I get the frame according to BLAST?

      These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with a frame of 2 with the strand being set to -1.
      Frames are relative to the hit or query sequence so you need to query it based on sequence you are interested in:

      $hsp->hit->strand;
      $hsp->hit->frame;
      or
      $hsp->query->strand;
      $hsp->query->frame;
      So the value according to a blast report of -3 can be constructed as:
      my $blastframe = ($hsp->query->frame + 1) * $hsp->query->strand;

    29. Question 29. How Can I Generate A Pairwise Alignment Of Two Sequences?

      Answer :

      Look at Bio::Factory::EMBOSS to see how to use the water and needle alignment programs that are part of the EMBOSS suite. Bio::Factory::EMBOSS is part of the bioperl-run package.

      Or you can use the pSW module for DNA alignments or the dpAlign module for protein alignments. These are part of the bioperl-ext package; download it via Getting BioPerl.

      You can also use prss34 (part of FASTA package) to assess the significance of a pairwise alignment by shuffling the sequences.

    30. Question 30. I Want To Parse Fasta Or Ncbi -m7 (xml) Format, How Do I Do This?

      Answer :

      It is as simple as parsing text BLAST results - you simply need to specify the format as fasta or blastxml and the parser will load the appropriate module for you. You can use the exact logic and code for all of these formats as we have generalized the modules for sequence database searching. The page describing Bio::SearchIO provides a table showing how the formats match up to particular modules. Note that, for parsing BLAST XML output, you will need XML::SAX and that XML::SAX::ExpatXS is recommended to speed up parsing.

    31. Question 31. What Was Wrong With Bio::tools::blast?

      Answer :

      Bio::Tools::Blast* is no longer supported, as of BioPerl version 1.1. Nothing is really wrong with it, it has just been outgrown by a more generic approach to reports. This generic approach allows us to just write pluggable modules for FASTA and BLAST parsing while using the same framework. This is completely analogous to the Bio::SeqIO system of parsing sequence files. However, the objects produced are of the Bio::SearchIO rather than Bio::Seq variety.

    32. Question 32. I Want To Parse Blast, How Do I Do This?

      Answer :

      As of version 1.1, BioPerl only supports one approach - the Bio::SearchIO interface. There are other BLAST parsing modules in the package, but they remain just to support older legacy code. Bio::SearchIO supports:

      • BLAST
      • MegaBLAST (PSL)
      • PSIBLAST
      • HMMER
      • WABA
      • BLASTZ (AXT)
      • exonerate
      • SIM4
      • Wise tools
      • FASTA reports

    33. Question 33. I Would Like To Make My Own Custom Fasta Header - How Do I Do This?

      Answer :

      You want to use the method preferred_id_type().

      Here's some example code:

      use Bio::SeqIO;
      my $seqin = Bio::SeqIO->new(-file => $file,
                                  -format => 'genbank');
      my $seqout = Bio::SeqIO->new(-fh => *STDOUT,
                                  -format => 'fasta');
      # From Bio::SeqIO::fasta
      $seqout->preferred_id_type('display');
      my $count = 1;
      while (my $seq = $seqin->next_seq) {
          # override the regular display_id with your own
          $seq->display_id('foo'.$count);
          $seqout->write_seq($seq);
          $count++;
      }
      You can pass one of the following values to preferred_id_type: "accession", "accession.version", "display", "primary". The description line is automatically appended to the preferred id type but this can also be set, like so:
      $seq->desc($some_string);

    34. Question 34. Accession Numbers Are Not Present For Fasta Sequence Files.if You Parse A Fasta Sequence Format File With Bio::seqio The Sequences Won't Have The Accession Number. What To Do?

      Answer :

      All the data is in the $seq->display_id it just needs to be parsed out. Here is some code to set the accession number.

      my ($gi,$acc,$locus);

      (undef,$gi,undef,$acc,$locus) = split(/|/,$seq->display_id);

      $seq->accession_number($acc);

      Why don't we just go ahead and do this? For one, we don't make any assumptions about the format of the ID part of the sequence. Perhaps the parser code could try and detect if it is a GenBank formatted ID and go ahead and set the accession number field. It would be trivial to do, just no one has volunteered the time - put it on the Project priority list if you think it is important and better yet, volunteer the code patch!

    35. Question 35. How Do I Parse A Sequence File?

      Answer :

      Use the Bio::SeqIO system. This will create Bio::Seq objects for you. 

    36. Question 36. I Can't Get Sequences With Bio::db::genbank Any More, Why Not?

      Answer :

      If you are running an old BioPerl version, NCBI changed the web CGI script that provided this access. You must use a modern version like 1.4.x or 1.5.x.

    37. Question 37. How Can I Get Nt_ Or Nm_ Or Np_ Accessions From Ncbi (reference Sequences)?

      Answer :

      To retrieve GenBank reference sequences, or RefSeqs, use Bio::DB::RefSeq, not Bio::DB::GenBank or Bio::DB::GenPept when you are retrieving these accession numbers. This is still an area of active development because the data providers have not provided the best interface for us to query. EBI has provided a mirror with their dbfetch system which is accessible through the Bio::DB::RefSeq object however, there are cases where NT_ accession numbers will not be retrievable.

    38. Question 38. How Can I Use Bio::seqio To Parse Sequence Data To Or From A String?

      Answer :

      Use this code to parse sequence records from a string:

      use IO::String;
      use Bio::SeqIO;
      my $stringfh = new IO::String($string);
      my $seqio = new Bio::SeqIO(-fh => $stringfh,
                                 -format => 'fasta');
      while( my $seq = $seqio->next_seq ) {
       # process each seq
      }
      And here is how to write to a string:
      use IO::String;
      use Bio::SeqIO;
      my $s;
      my $io = IO::String->new($s);
      my $seqOut = new Bio::SeqIO(-format =>'swiss', -fh =>$io);
      $seqOut->write_seq($seq1);
      print $s; # $s contains the record in swissprot format and is stored in the string

    39. Question 39. How Do I Use Bio::index::fasta And Index On Different Ids?

      Answer :

      I'm using Bio::Index::Fasta in order to retrieve sequences from my indexed fasta file but I keep seeing MSG: Did not provide a valid Bio::PrimarySeqI object when I call fetch followed by write_seq() on a Bio::SeqIO handle. Why?

      It's likely that fetch didn't retrieve a Bio::Seq object. There are few possible explanations but the most common cause is that the id you're passing to fetch is not the key to that sequence in the index. For example, if the FASTA header is >gi|12366 and your id is 12366 then fetch won't find the sequence, it expects to see gi|12366. You need to use the get_id method to specify the key used in indexing, like this:

      $inx = Bio::Index::Fasta->new(-filename =>$indexname);
      $inx = id_parser(&get_id);
      $inx->make_index($fastaname);
      sub get_id {
        my $header = shift;
        $header =~ /^>gi|(+)/;
        $1;
      }
      The same issue arises when you use Bio::DB::Fasta, but in that case the code might look like this:

      $inx = Bio::DB::Fasta->new($fastaname, -makeid => &get_id);

    40. Question 40. Cannot Get An Accession From Genbank When I Know It Is There?

      Answer :

      I'm using Bio::DB::GenBank to query GenBank and I'm certain that the id is there but I'm seeing the error MSG: acc does not exist. This bug in versions 1.2 and 1.2.1, but it is fixed in 1.2.2. Either upgrade to 1.2.2 or higher, or edit the module Bio::DB::GenBank and change protein to nucleotide in the BEGIN block.



Topic: Bioperl Interview Questions
Interview Quesions on Bioperl

No comments:

Post a Comment