PyPy, Biopython and Fasta/q Parsing

tl;dr pypy and a good algorithm can give you speed

Can Pypy speed up your Sequence analysis? Pypy is a compiler for python that boasts a speedup of your programs relative to python. Anyone parsing through large fasta/fastq files knows that speedups are always welcome. How does pypy work for something simple and ubiquitous like iterating through a bunch of records and writing them back out? I use seqtk as a baseline for how fast it is possible to go and compare a few different iteration methods using python2, pypy2, python3 and pypy3. Full code is available in this Gist.

Tests#

#read/write fasta with SeqIO.parse()
from Bio import SeqIO
fasta_records = SeqIO.parse(open("test.fasta",'r'),'fasta')
SeqIO.write(fasta_records, "out.fasta","fasta")

#read/write fastq with SeqIO.parse()
from Bio import SeqIO
fastq_records = SeqIO.parse(open("test.fastq",'r'),'fastq')
SeqIO.write(fastq_records, "out.fastq","fastq")

#use readfq to read/write
fastq_records = readfq(open("test.fastq",'r'))
with open("out.fastq",'w') as f:
    for name, seq, qual in fastq_records:
        f.write("@seq_{}\n{}\n+\n{}\n".format(name,seq,qual))

# use fastqiterator to read/write
from Bio.SeqIO.QualityIO import FastqGeneralIterator
fastq_records = FastqGeneralIterator(open("test.fastq",'r'))
with open("out.fastq",'w') as f:
    for name, seq, qual in fastq_records:
        f.write("@seq_{}\n{}\n+\n{}\n".format(name,seq,qual))


Results#

# Seqtk (baseline)
- seqtk (fasta)	5
- seqtk (fastq)	7

# SeqIO.Parse() fasta
- py	28
- pypy	7
- py3	27
- pypy3	112

# SeqIO.Parse() fastq
- py	105
- pypy	30
- py3	103
- pypy3	43

# readfq
- py	11
- pypy	10
- py3	11
- pypy3	17

# fastqiterator
- py	12
- pypy	9
- py3	8
- pypy3	17

Conclusions#

Although this is a simple set of benchmarks we can spot a few interesting things. The first thing to notice is that seqtk is indeed speed king. (I am using this excellent BioStars as reference) and can parse 1 million fasta or fastq records in only a few seconds.

Python implementations of fastq parsers that are pared down toa bare minimum (readfq and fastqiterator) approach seqtk in speed in this simple assay. I suppose they should as it comes down to straight I/O. In this case the use of pypy does give a modest gain. Pypy’s values og 10s and 9s respectively are within 50% of the seqtk fastq parsing speed. So thats pretty nice.

If, however, we want to use BioPython’s SeqIO.parse() function, we know we are going to incur some costs. We need to create a Seq object for each sequence and will pay a penalty that loos to be about a 6x slowdown for fasta parsing and a ~14x slowdown for fastq parsing. Thats a pretty big hit. Does pypy help here? Tremendously. Pypy speeds up the parsing by 3-4x in both cases. This is a great result although the pypy results remain far slower than seqtk (30s vs 7s).

What did we learn? Pypy is not a completely magic solution, however, when used instead of the normal python runtime with an appropriate algorithm, this can be within striking distance of seqtk. Because pypy can be used with python’s Multiprocessing module it should be straightforward to parallelize sequence processing in such a way as to surmount these small differences. Next we will need a benchmark where something is done to the sequences - trimming, barcode multiplex lookup or something of the like.