Process Substitution and BioPython
I recently picked up a copy of Vince Buffalo’s Bioinformatics Data Skills and have learned a few gems. There is one section where he discusses the uses of named pipes and process substitution, (blogpost about it here) and I wanted to try it. I also have been using the Click library for making command line scripts in python. Can I combine these together fruitfully and get the ease writing code with some of the memory-saving and computational speedups of using process substition? Indeed. Here I build up a simple script that aims to concatenate each record of a forward/reverse fastq pair and gzip the output.
Script Version 1#
See if you can pipe stdin to Biopython’s fastq iterator and pipe in the file from the command line.
#test.py
#use sys.stdin and pipe a fastqfile into your script
import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator
fastq_f = FastqGeneralIterator(sys.stdin)
with open("out.fq",'w') as f:
for fastq in fastq_f:
f.write(str(fastq))
# pipe the fastq into our iterating/writing script
cat fastqf | python test.py -
Script Version 2#
Add Click to handle arguments instead of stdin so you can allow multiple inputs down the line. Check the input works by using process substitution to cat the file.
#test.py
#add argument handling using click
import click
from Bio.SeqIO.QualityIO import FastqGeneralIterator
@click.command()
@click.argument('input1', type=click.File('rb'))
def fastqprocess(input1):
fastq_f = FastqGeneralIterator(input1)
with open("out.fq",'w') as f:
for fq in fastqs:
f.write(str(fq))
if __name__ == "__main__":
fastqprocess()
#use process substitution to give the first argument even
#though its not necessary (but image if it was gzipped...)
python test.py <(cat fastqf)
Script Version 3#
Add multiple arguments so that both the forward and reverse fastq file can be entered as well as the output file.
Notice all opening and closing of files is handled by click
. Use process substitution to cat the input files and also
to gzip the output file.
import click
from Bio.SeqIO.QualityIO import FastqGeneralIterator
@click.command()
@click.argument('input1', type=click.File('rb'))
@click.argument('input2', type=click.File('rb'))
@click.argument('output', type=click.File('w'))
def fastqprocess(input1,input2, output):
fastq_f = FastqGeneralIterator(input1)
fastq_r = FastqGeneralIterator(input2)
for forward, reverse in zip(fastq_f, fastq_r):
name1, seq1, qual1 = forward
name2, seq2, qual2 = reverse
output.write("@seq_{}\n{}\n+\n{}\n".format(name1,seq1+seq2,qual1+qual2))
if __name__ == "__main__":
fastqprocess()
# add in your two fastq files and gzip the output
python test.py <(zcat fastqf.gz) <(zcat fastqr.gz) >(gzip > out.fq.gz)
zcat out.fq.gz | head
Conclusions#
I like process substition. It lets the you tap into some nice performance benefits as long as you learn a few simple conventions. The major benefit of the named pipe/process substitution model is that you avoid writing your intermediate results. This can save a lot of disk space AND have the benefit of a dramatic speedup since I/O to disk takes a lot more time than working in memory. Furthermore, it is nearly trivial to add gzip/gunzip which allows you to easily incorporate compression into all of your proccesing scripts and Makefiles.
Also, I can now recommend using the click.File() settings in your program to handle the opening and closing of file resources. It lets you reduce the number of with open(…) in your code for simpler, sleeker function definitions. (compare Version 3 with this). It looks as if the click.Path() funcitons can also serve thhe useful funtion of checking for the existence of files on your path.