EUtils Gene Download

I’ve written before about the power of Entrez Direct, a set of command line utilities for accessing NCBI’s immense data. Although the web interface is great for many things, if you find yourself clicking and clicking the CLI is probably a better route. I wanted to be able to download all of the bacterial genes annotated by some name but ran into an issue whereby I was unintentionally downloading the entire genome - not good. Heres the problem and the workaround.

Below is an esearch command that will lookup bacterial ornithine decarboxylases, elink to the nucleotide database (nuccore), and download (efetch) the fasta records of each.

# this will download all the bacterial genomes in which 
# an ornithine decarboxylate is found.... multiple Gb of data
esearch -db gene -query "ornithine decarboxylase AND bacteria [Organism]" | \
  elink -target nuccore | \
  efetch -format fasta > orn_decarb.fna

So lets see whats going on step by step. Here we can see the esearch call is finding 1977 matching genes to my query.

# theres 1977 matching genes
esearch -db gene -query "ornithine decarboxylase AND bacteria [Organism]" 

<ENTREZ_DIRECT>
  <Db>gene</Db>
  <WebEnv>NCID_1_372671251_130.14.22.215_9001_1438825162_1074538627_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1977</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

Lets take a look at a few entries to see what the data looks like:

esearch -db gene -query "ornithine decarboxylase AND bacteria [Organism]" | efetch

95. speC
ornithine decarboxylase [Escherichia coli IAI39]
Other Aliases: ECIAI39_3389
Annotation: NC_011750.1 (3542125..3544260, complement)
ID: 7152283

96. adiA
biodegradative arginine decarboxylase [Escherichia coli IAI39]
Other Aliases: ECIAI39_4541
Annotation: NC_011750.1 (4745988..4748258, complement)
ID: 7150691

So now we can see that the nuccore reference is not a single nucleotide reference with a unique ID, but instread is a link to a genome Accession with information about the stop and start locations and the strand. If we use elink that location data will be ignored and the entire nucleotide accession record will be downloaded:

# this will pull in the entire NC_011750.1 sequence
esearch -db gene -id  7150691 | elink -target nuccore | fetch -format fasta

So we need a workaround. I sent an email to NCBI’s helpdesk and they were quick to respond with a partial solution:

# create a three column list of genes and start/stops
esearch -db gene -query "ornithine decarboxylase AND bacteria [Organism]" | \
  esummary | \
  xtract -pattern GenomicInfoType -element ChrAccVer,ChrStart,ChrStop > locations.txt
  
...
NC_002491.1	850656	850357
NC_002771.1	789191	788922
NC_003116.1	1959076	1958813
NC_002570.2	1424067	1423792
NC_000921.1	77725	77994
NC_000908.2	460950	460684
...

The main issue is that there is no way to pipe the location into elink to get the locations so we must settle for the next best thing - obtaining the genomes and the locations of the genes. To pull out the genes you need a script. Heres my quick and dirty script to pump each line into an efetch command.

#!/usr/bin/env python

import sys
from subprocess import call

def process_file(f, outfile):
    for line in open(f,'r'):
        accession, start, stop = line.split()
        if stop > start:
            strand = 1
        else:
            strand=2
            
        cmd = "efetch -db nuccore -id {} -seq_start {} -seq_stop {} -strand {} -format fasta >> {}".format(
                  accession, start, stop, strand, outfile)
        call(cmd, shell=True)

if __name__ == "__main__":
    f = sys.argv[1]
    outfile = sys.argv[2]
    process_file(f,outfile)

In the end, this is not as clean as it could be. Ideally, for genes referencing a genome location, the default elink/efetch behavior would be to download only the region of interest. The tradeoff of this particular approach is more calls to the server (one for each gene) but far less sequence to download. As long as the number of genes of interest are of a reasonable order, I think this is good tradeoff. Normally NCBI prefers you to state upfront what you are going to download so it can prepare/cache the downloads - a process that happens transparently for you when using the CLI. I have requested that they consider changing the default behavior for this gene-to-nuccore database linkage.

#all together now in a Makefile
locations.txt:
  esearch -db gene -query "ornithine decarboxylase AND bacteria [Organism]" | \
    esummary | \
    xtract -pattern GenomicInfoType -element ChrAccVer,ChrStart,ChrStop > $@

genes.fna: locations.txt
  python fetch genes.py $< $@