Exploring Mibig with Datalog

As a follow up on the previous post about Mibig, this post is an example of using datalog to examine Mibig Data. I encountered the excellent datamaps library which allows you to ingest clojure maps/vectors as a DB which can then be queried using the datalog engine of datomic. Below are a few examples to give a flavor of how it works. Perhaps the best example of the flexibility is the query that can be read “get all molecules that have tyrosine-encoding adenylation domains.”

Code available here.

load the data

(def mibigdata
  (into [] (map #(parse-stream (clojure.java.io/reader %) true) jsonfiles)))

; create a fact database
(def mibigdata-dm  (dm/facts mibigdata))

get compounds of a single Mibig entry

(dm/q '[:find ?c
        :where [?e    :mibig_accession "BGC0001305"]
               [?e    :compounds ?cmps]
               [?cmps :compound ?c]]
  mibigdata-dm)

["fujikurins"]

get Mibig ID and compound names of polyketides

(dm/q '[:find ?mb ?c
        :where
        [?e   :mibig_accession ?mb]
        [?e   :biosyn_class "Polyketide"]
        [?e   :compounds ?cmps]
        [?cmps :compound ?c]]
      mibigdata-dm)

...
 ["BGC0001401" "6’-Chloromelleolide F"]
 ["BGC0001401" "6’-Bromomelleolide F"]
 ["BGC0001403" "trypacidin"]
 ["BGC0001404" "sorbicillin"]
 ["BGC0001405" "chaetoviridin"]
 ["BGC0001405" "chaetomugilin"]
 ["BGC0001409" "dutomycin"])

get Mibig ID and compound names of terpenes

(dm/q '[:find ?mb ?c
        :where
        [?e   :mibig_accession ?mb]
        [?e   :biosyn_class "Terpene"]
        [?e   :compounds ?cmps]
        [?cmps :compound ?c]]
      mibigdata-dm)

["BGC0001323" "monoterpenes-diterpenes"]
 ["BGC0001324" "monoterpenes-diterpenes"]
 ["BGC0001361" "sodorifen"]
 ["BGC0001372" "penigequinolone"]
 ["BGC0001375" "penitrem A"]
 ["BGC0001375" "penitrem B"]
 ["BGC0001375" "penitrem C"]
 ["BGC0001375" "penitrem D"]
 ["BGC0001375" "penitrem E"]
 ["BGC0001375" "penitrem F"])

get all PKS subclasses

(->
  (dm/q '[:find ?pksub
          :where
          [?e    :Polyketide  ?pk]
          [?pk   :pk_subclass ?pksub]]

        mibigdata-dm)
  flatten
  set)


#{"Benzoisochromanequinone"
  "Angucycline"
  "Ansamycin"
  "Enediyine"
  "None"
  "Macrolide"
  "Polyether"
  "Anthracycline"
  "Polyene"
  "Tetracycline"
  "Aryl polyene"
  "Chalcone"
  "Other"
  "Polyphenol"
  "Tetracenomycin"}

get all NRPS subclasses

(->
  (dm/q '[:find ?nrpsub
          :where
          [?e    :NRP  ?nrp]
          [?nrp  :subclass ?nrpsub]]
        mibigdata-dm)
  flatten
  set)

get all Molecules that have tyrosine-encoding adenylation domains

(->
  (dm/q '[:find ?mb ?c
          :where
          [?e       :mibig_accession ?mb]
          [?e       :compounds ?cmps]
          [?cmps    :compound ?c]
          [?e       :NRP  ?nrp]
          [?nrp     :nrps_genes ?nrpgene]
          [?nrpgene :nrps_module ?nrpsmod]
          [?nrpsmod :a_substr_spec ?adom]
          [?adom    :prot_adom_spec "Tyrosine"]]
        mibigdata-dm)
  set)

#{["BGC0001178" "UK-68,597"]
  ["BGC0001095" "fengycin"]
  ["BGC0000429" "Skyllamycin A"]
  ["BGC0001028" "nostopeptolide A"]
  ["BGC0001049" "Tenellin"]
  ["BGC0000970" "Chondrochloren"]
  ["BGC0000289" "A40926"]
  ["BGC0000440" "teicoplanin"]
  ["BGC0001050" "Thalassospiramide A"]
  ["BGC0000290" "A47934"]
  ["BGC0001136" "desmethylbassianin"]
  ["BGC0000393" "Myxoprincomide-c506"]
  ["BGC0000306" "Arylomycin"]
  ["BGC0001152" "fusaricidin B"]
  ["BGC0001090" "bacillomycin D"]}

all contributers

(->
  (dm/q '[:find ?name ?institute
            :where
          [?person :submitter_name ?name]
          [?person :submitter_institution ?institute]]
        mibigdata-dm)
  set)

  ...
  ["Robin Teufel" "Scripps Institution of Oceanography; UCSD"]
  ["Eric J. N. Helfrich" "Institute of Microbiology, Eidgenössische Technische Hochschule (ETH) Zurich, Piel lab"]
  ["Keishi Ishida" "Leibniz Institute for Natural Product Research and Infection Biology"]
  ["Stefan Diethelm" "Scripps Institution of Oceanography"]}

contributers by frequency

(->
  (dm/q '[:find (count ?name) ?name ?institute
          :where
          [?person :submitter_name ?name]
          [?person :submitter_institution ?institute]]
        mibigdata-dm)
  sort)

...
 [7 "Neha Garg" "University of California at San Diego"]
 [8 "Yit-Heng Chooi" "The Australian National University"]
 [9 "David Fewer" "University of Helsinki"]
 [9 "Jose A. Salas" "University of Oviedo"]
 [9 "Yuta Tsunematsu" "University of Shizuoka"]
 [10 "Xiaohui Yan" "The Scripps Research Institute"]
 [13 "Fengan Yu" "University of Michigan"]
 [14 "Keishi Ishida" "Leipniz Institute for Natural Product Research and Infection Biology"]
 [20 "Daniel Krug" "HIPS"])

misspelling the Rockefeller University

(->
  (dm/q '[:find  ?institute (count ?institute)
          :where
          [?person :submitter_name "Sean Brady"]
          [?person :submitter_institution ?institute]]
        mibigdata-dm)
  sort)

(["The Rockefeller Univeristy" 2] ["The Rockefeller University" 4])