PyPy, Biopython and Fasta/q Parsing
tl;dr pypy and a good algorithm can give you speed
Can Pypy speed up your Sequence analysis? Pypy is a compiler for python that boasts a speedup of your programs relative to python. Anyone parsing through large fasta/fastq files knows that speedups are always welcome. How does pypy work for something simple and ubiquitous like iterating through a bunch of records and writing them back out? I use seqtk as a baseline for how fast it is possible to go and compare a few different iteration methods using python2, pypy2, python3 and pypy3. Full code is available in this Gist.
Tests#
#read/write fasta with SeqIO.parse()
from Bio import SeqIO
fasta_records = SeqIO.parse(open("test.fasta",'r'),'fasta')
SeqIO.write(fasta_records, "out.fasta","fasta")
#read/write fastq with SeqIO.parse()
from Bio import SeqIO
fastq_records = SeqIO.parse(open("test.fastq",'r'),'fastq')
SeqIO.write(fastq_records, "out.fastq","fastq")
#use readfq to read/write
fastq_records = readfq(open("test.fastq",'r'))
with open("out.fastq",'w') as f:
for name, seq, qual in fastq_records:
f.write("@seq_{}\n{}\n+\n{}\n".format(name,seq,qual))
# use fastqiterator to read/write
from Bio.SeqIO.QualityIO import FastqGeneralIterator
fastq_records = FastqGeneralIterator(open("test.fastq",'r'))
with open("out.fastq",'w') as f:
for name, seq, qual in fastq_records:
f.write("@seq_{}\n{}\n+\n{}\n".format(name,seq,qual))
Results#
# Seqtk (baseline)
- seqtk (fasta) 5
- seqtk (fastq) 7
# SeqIO.Parse() fasta
- py 28
- pypy 7
- py3 27
- pypy3 112
# SeqIO.Parse() fastq
- py 105
- pypy 30
- py3 103
- pypy3 43
# readfq
- py 11
- pypy 10
- py3 11
- pypy3 17
# fastqiterator
- py 12
- pypy 9
- py3 8
- pypy3 17
Conclusions#
Although this is a simple set of benchmarks we can spot a few interesting things. The first thing to notice is that seqtk is indeed speed king. (I am using this excellent BioStars as reference) and can parse 1 million fasta or fastq records in only a few seconds.
Python implementations of fastq parsers that are pared down toa bare minimum (readfq
and fastqiterator
) approach seqtk in speed in this simple assay. I suppose they should as it comes down to straight I/O. In this case the use of pypy does give a modest gain. Pypy’s values og 10s and 9s respectively are within 50% of the seqtk fastq parsing speed. So thats pretty nice.
If, however, we want to use BioPython’s SeqIO.parse()
function, we know we are going to incur some costs. We need to create a Seq
object for each sequence and will pay a penalty that loos to be about a 6x slowdown for fasta parsing and a ~14x slowdown for fastq parsing. Thats a pretty big hit. Does pypy help here? Tremendously. Pypy speeds up the parsing by 3-4x in both cases. This is a great result although the pypy results remain far slower than seqtk (30s vs 7s).
What did we learn? Pypy is not a completely magic solution, however, when used instead of the normal python runtime with an appropriate algorithm, this can be within striking distance of seqtk. Because pypy can be used with python’s Multiprocessing module it should be straightforward to parallelize sequence processing in such a way as to surmount these small differences. Next we will need a benchmark where something is done to the sequences - trimming, barcode multiplex lookup or something of the like.