Biopython 1.61 was released a little less than a month
new features and fixes came along with it, and I'm naturally excited since it's
the first official release that includes SearchIO.
It's still subject to change, made clear by a
(new in 1.61 too!) that pops up every time the module is imported. You can use
it in your scripts, but the details may change between versions, until the
experimental warning is gone. Of course, you should have no problem if you
update and test your scripts every now and then.
The module has been undergoing some internal changes, since the last time I
posted an update about it. Most notably, there is now a parser for HMMER2
(thanks to Kai Blin) and BGZF-compatible indexing
functions (thanks to Peter Cock). For all of you
who are interested, the complete set of functions and objects are now documented
in Biopython's API page.
There's also a tutorial that you can follow along (with sampe files and code)
in the official tutorial.
These tutorials and APIs are the de-facto reference for SearchIO, so refer to
them whenever possible. Of course, if you do need additional help, the Biopython
mailing list is only a few clicks away.
From here on, my personal focus for SearchIO will be to maintain it and squash
out any bugs that pops up. There will be updates to the code internals as well,
but it really depends on the real-world use cases I will be seeing.
If a new search program shows up, for example, that may prompt the creation of new
I do have some thoughts about adding a couple ...
Google Summer of Code 2012 has finally drawn to a close.
It's been a great learning experience, one I would not hesitate to recommend to anyone. I've learned a lot in the past few months, not just about writing open source software but also about several bioinformatics applications (that I hope to continue use in the foreseeable future) and even about myself.
I'm deeply thankful first and foremost to Peter, my mentor for the summer, and to the Biopython community in general. Peter was the person who reviewed my first ever open source contribution last year (also to Biopython) and got me interested in doing more. Throughout the summer, he didn't just answer my questions about pieces of code that I had trouble with, but also explained a lot more about working in modern bioinformatics in general. As a person who hopes to one day make a living out of bytes and basepairs, I really couldn't have asked for a better mentor.
As for the Biopython community (and in extension the Open Bioinformatics Foundation), I would like to convey my gratitude as well. I have to say that I learned Python in the first place with a considerable influence from Biopython. Initially I simply intended to use it to automate some of my repetitive lab tasks. I guess it was a slippery slope that led me down this road now :). So thank you for everyone involved in Biopython and thank you OBF for accepting my proposal this year. It was not the best of proposals, I'm sure, but I hope I have delivered.
And finally, to Google, thank you for having the initiative to organize the Summer of Code. I hope the program will get bigger in scope and gain more visibility in the years to come. Better yet, I hope more companies start doing a similar initiative.
My whole personal experience through the summer probably deserves a post of its own. For now, I want to briefly recap what I have done these past few months and outline the future ...
It's been a while since I posted my GSoC updates.
The main reason was a considerable change to the main SearchIO object model. It turns out that the trio of
HSP I had been using objects was not sufficient to consistently model outputs from all the search program I have encountered. So with Peter's guide, I've spent most of my time writing and rewriting several different models, trying to find out which one is best. As the model forms the base of all the parsing logic, every alternative required some parser rewriting, which surely takes time. The good thing is we've finally settled on one alternative so I thought this is another update worth posting.
I'll try to explain what the new object model looks like in this post. There's also another update to the main API that I'll mention afterwards.
As always, this is done using the latest code available from my main development repo. I should also mention that I've been using Travis for continuous testing on this branch. Travis has exposed several version-specific bugs that I have managed to fix, making the codebase better.
Anyway, on to the updates!
Improved SearchIO Model
I started developing SearchIO using a hierarchy of three objects: the
QueryResult object to represent search queries, the
Hit object to represent a matching entry in the search database, and the
HSP object to represent a region of significant alignment between the query and hit sequence. This works well for the early search programs I worked with: BLAST, HMMER, and FASTA. The outputs of these programs can be parsed easily into this hierarchy, and interacting with ...
One of the things I enjoy during my time developing SearchIO in the past few weeks is that I get to play with many different programs and see how they behave. Even for programs that I thought I'm familiar with, I sometimes still see unanticipated behaviors (hint: sequence coordinates). It's like the old days of Windows 95, when you would try to delete a file and see how that affects your computer (and occasionally discover later that you need the file for proper boot-up). Except this time, it's mostly with command line programs and biological sequences (plus there's almost no risk I would render my computer useless permanently).
You can probably imagine how I feel like when the programs I'm playing with are totally new to me.
That's what I experienced when I started playing with Exonerate, but it's a very interesting program on its own. Take a look at a sample output below to get an idea of what its capable of (I should note that I deliberately chose this output because of the dense information it has, without checking its biological correctness).
This is a single alignment (HSP, really) resulting from a sequence search against the yeast genome using the
genome2genome model. Basically, this model assumes that there can exist introns in both the query and/or hit sequences, along with assuming that parts of both sequences may be open-reading frames. It was generated using Exonerate 2.2 with ...
BLAST plain text output is a tricky beast. It's the output format easiest to read for us, humans, but it's arguably harder for computers to read compared to its XML our tabular counterparts. One reason is because NCBI themselves give no guarantee that the output stays the same between different BLAST versions. This means that for each different BLAST version, there is a chance that a given parser breaks. It's still a useful format, nonetheless, giving the reader instant feedback or visualization of his/her search results. It is why some Bio* libraries (and other programs, no doubt) continue to try write parsers for the plain text format.
In Biopython, BLAST plain text parsing support is officially obsoleted. It ships without guarantee that the plain text parser will work. However, the code itself is still there and for the most part it still works. Just to give an illustration, the plain text parser's test suite contains files from legacy BLAST version 2.0.10 (released more than 10 years ago) up to the latest version (2.2.26+, released last year). Although these files do not cover all possible outputs one can generate from BLAST and some versions in between them are not covered, it speaks of how much versatile the parser is.
Now, when I started writing parsers for SearchIO, the plain text output is indeed one format that crossed my mind. I was warned, though, officially supporting it will take much work and might not be the best thing to do. Plus, it's hard to think that I can manage to write a new parser from scratch that can handle all variations in the plain text output during the GSoC period (on top of the other parsers I need to write). After talking with my mentor, we decided that for now the best thing to do is perhaps to write a SearchIO wrapper around the current Biopython's BLAST parser.
I've done a similar thing with the BLAST XML parser (→