Exploring Mibig with Datalog
As a follow up on the previous post about Mibig, this post is an example of using datalog to examine Mibig Data. I encountered the excellent datamaps library which allows you to ingest clojure maps/vectors as a DB which can then be queried using the datalog engine of datomic. Below are a few examples to give a flavor of how it works. Perhaps the best example of the flexibility is the query that can be read “get all molecules that have tyrosine-encoding adenylation domains.”
Code available here.
load the data
(def mibigdata
(into [] (map #(parse-stream (clojure.java.io/reader %) true) jsonfiles)))
; create a fact database
(def mibigdata-dm (dm/facts mibigdata))
get compounds of a single Mibig entry
(dm/q '[:find ?c
:where [?e :mibig_accession "BGC0001305"]
[?e :compounds ?cmps]
[?cmps :compound ?c]]
mibigdata-dm)
["fujikurins"]
get Mibig ID and compound names of polyketides
(dm/q '[:find ?mb ?c
:where
[?e :mibig_accession ?mb]
[?e :biosyn_class "Polyketide"]
[?e :compounds ?cmps]
[?cmps :compound ?c]]
mibigdata-dm)
...
["BGC0001401" "6’-Chloromelleolide F"]
["BGC0001401" "6’-Bromomelleolide F"]
["BGC0001403" "trypacidin"]
["BGC0001404" "sorbicillin"]
["BGC0001405" "chaetoviridin"]
["BGC0001405" "chaetomugilin"]
["BGC0001409" "dutomycin"])
get Mibig ID and compound names of terpenes
(dm/q '[:find ?mb ?c
:where
[?e :mibig_accession ?mb]
[?e :biosyn_class "Terpene"]
[?e :compounds ?cmps]
[?cmps :compound ?c]]
mibigdata-dm)
["BGC0001323" "monoterpenes-diterpenes"]
["BGC0001324" "monoterpenes-diterpenes"]
["BGC0001361" "sodorifen"]
["BGC0001372" "penigequinolone"]
["BGC0001375" "penitrem A"]
["BGC0001375" "penitrem B"]
["BGC0001375" "penitrem C"]
["BGC0001375" "penitrem D"]
["BGC0001375" "penitrem E"]
["BGC0001375" "penitrem F"])
get all PKS subclasses
(->
(dm/q '[:find ?pksub
:where
[?e :Polyketide ?pk]
[?pk :pk_subclass ?pksub]]
mibigdata-dm)
flatten
set)
#{"Benzoisochromanequinone"
"Angucycline"
"Ansamycin"
"Enediyine"
"None"
"Macrolide"
"Polyether"
"Anthracycline"
"Polyene"
"Tetracycline"
"Aryl polyene"
"Chalcone"
"Other"
"Polyphenol"
"Tetracenomycin"}
get all NRPS subclasses
(->
(dm/q '[:find ?nrpsub
:where
[?e :NRP ?nrp]
[?nrp :subclass ?nrpsub]]
mibigdata-dm)
flatten
set)
get all Molecules that have tyrosine-encoding adenylation domains
(->
(dm/q '[:find ?mb ?c
:where
[?e :mibig_accession ?mb]
[?e :compounds ?cmps]
[?cmps :compound ?c]
[?e :NRP ?nrp]
[?nrp :nrps_genes ?nrpgene]
[?nrpgene :nrps_module ?nrpsmod]
[?nrpsmod :a_substr_spec ?adom]
[?adom :prot_adom_spec "Tyrosine"]]
mibigdata-dm)
set)
#{["BGC0001178" "UK-68,597"]
["BGC0001095" "fengycin"]
["BGC0000429" "Skyllamycin A"]
["BGC0001028" "nostopeptolide A"]
["BGC0001049" "Tenellin"]
["BGC0000970" "Chondrochloren"]
["BGC0000289" "A40926"]
["BGC0000440" "teicoplanin"]
["BGC0001050" "Thalassospiramide A"]
["BGC0000290" "A47934"]
["BGC0001136" "desmethylbassianin"]
["BGC0000393" "Myxoprincomide-c506"]
["BGC0000306" "Arylomycin"]
["BGC0001152" "fusaricidin B"]
["BGC0001090" "bacillomycin D"]}
all contributers
(->
(dm/q '[:find ?name ?institute
:where
[?person :submitter_name ?name]
[?person :submitter_institution ?institute]]
mibigdata-dm)
set)
...
["Robin Teufel" "Scripps Institution of Oceanography; UCSD"]
["Eric J. N. Helfrich" "Institute of Microbiology, Eidgenössische Technische Hochschule (ETH) Zurich, Piel lab"]
["Keishi Ishida" "Leibniz Institute for Natural Product Research and Infection Biology"]
["Stefan Diethelm" "Scripps Institution of Oceanography"]}
contributers by frequency
(->
(dm/q '[:find (count ?name) ?name ?institute
:where
[?person :submitter_name ?name]
[?person :submitter_institution ?institute]]
mibigdata-dm)
sort)
...
[7 "Neha Garg" "University of California at San Diego"]
[8 "Yit-Heng Chooi" "The Australian National University"]
[9 "David Fewer" "University of Helsinki"]
[9 "Jose A. Salas" "University of Oviedo"]
[9 "Yuta Tsunematsu" "University of Shizuoka"]
[10 "Xiaohui Yan" "The Scripps Research Institute"]
[13 "Fengan Yu" "University of Michigan"]
[14 "Keishi Ishida" "Leipniz Institute for Natural Product Research and Infection Biology"]
[20 "Daniel Krug" "HIPS"])
misspelling the Rockefeller University
(->
(dm/q '[:find ?institute (count ?institute)
:where
[?person :submitter_name "Sean Brady"]
[?person :submitter_institution ?institute]]
mibigdata-dm)
sort)
(["The Rockefeller Univeristy" 2] ["The Rockefeller University" 4])