Wildcards in Make

Make is a tool initially developed by software developers to build their source code but it has found another home in data analysis and reproducible research. Mike Bostock’s excellent make overview outlines the main benefits: a concise way to show all of the steps necessary to build something - in his case a visualization. Karl Broman’s make tutorial is a second excellent introduction make, in this case in the context of creating a completely reproducible research paper. One of Karl’s examples is the use of the wildcard which allows you to use pattern matching to process lots of files in a simple, declarative way. Anyone new to make will probably have a bit of difficulty with the syntax but once mastered it can be incredibly useful to represent data processing steps for large, multistep projects. This is a very powerful tool for data processing and I wanted to highlight how the wildcards can be used for data processing.

Imagine I have a folder, data, with a few hundred fastq files. I would like to quality trim these files and blast them against a database. Assuming for a second that I don’t want to concatenate the files I can use a for loop or something simple like this:

#1. define the fastq files
data = $(wildcard data/%fastq.gz)

#2. substitute the folder and suffix
qc_data = $(data: data/%.fastq.gz=processed/%.fasta)

#3. substitute the folder and suffix again
processed = $(qc_data:processed/%.fasta=blast/%.blastout)

#4. make target using wildcards
# trim and convert to fasta usign seqtk
processed/%.fasta: data/%fastq.gz
	mkdir -p data
	seqtk trimfq $< | seqtk seq -a > $@

#5. blast
blast/%.blastout: processed%.fasta
	mkdir -p blast
	blastn -in $< -out $@ -db somedb

#6. call them all
all: $(qc_data) $(processed)

Theres six things happening here:

  1. The wildcard keyword is letting the wildcard mark, %, serve as a glob. Your initial data will be all the files in the data directory ending in .fastq.gz
  2. Define a substitution for each of the data files. I change the directory and suffix of the original data files. These new names will be the output of the trimming and fasta-conversion.
  3. Once again I redefine the names, this time changing directories and suffixes again. These will be the blast output.
  4. Define the trimming step. This is a target with wildcards on both sides. It works. Amazingly. I have a one-to-one input to output so I don’t know what would happen if the expansions are different length s- there might be an issue. But for this example, the expansion will work correctly. The excellent seqtk toolkit is used to trim (trimfq )and convert (seq -a) using the input, $<, and output, $@, notations.
  5. Define the blast step. Same expansion and notation.
  6. Define a step to call all of the processes.

I will admit the notation can be tough to get used to but I have just defined a simple data processing pipeline using pattern matching and rule-making. This is about as declarative as it gets: No for loops, just a description of what to do and a set of files to do it to. The wildcard expansion is a really nice trick that makes Make a phenomenal tool. How does compare to scripting the same job in Python?

import glob
import os
from system import call

for p in glob.glob("data/*.fastq.gz"):
	name = os.path.basename(p).split(".")[0]
	seqtkname = "processed/{}.fasta".format(name)
	blastname = "blast/{}.blastout".format(name)
	seqtk = "seqtk trimfq {} | seqtk seq -a > {}".format(name, seqtkname)
	blast = "blastn -in {} -out {} -db somedb".format(seqtkname, blastname)

The python version is not too long and will get the job done. They are both fairly succinct and straightforward. However, the make version will give you all the nice time-stamp goodness that is really very useful when working with large files. You won’t mistakenly rerun large/expensive jobs with make so long as you don’t change any of the files upstream of your processed file.

Hopefully this example helps someone who is struggling with Make’s syntax.