Notes From Biodata - API All the things
I recently attended the inaugural BioData conference at Cold Spring Harbor Laboratory. Organized by Mike Schatz, Anne Carpenter and Matt Wood, the conference was an attempt to bring into a single room the many divergent groups of people who are interested in combing biology and data science. Overall I enjoyed the conference and met a number of interesting, bright people while learning a fair number of new things. In term of talks (abstracts here), the subject of large data dominated a number of the sessions and these talks were engineering/compsci-heavy where the investigators tried to answer questions like: how do you design a datacenter/supercomputer to analyze 100000 genomes? How can you scale from analyzing 100 genomes to analyzing 100,000? What software designs should we follow? APIs or files/programs? How do we store petabytes of genome sequences? These talks were much bigger scale than I am accustomed to but it was interesting to see how the major players are scaling especially with human genome datasets.
There were also a number of less interesting pipeline talks/posters where people described how they processed a large dataset. As there are lots of datasets and lots of tools this genre of talk is inevitable but of less general interest. Finally, there were a number of talks about domain-specific applications of datascience to particular biological questions including a talk by Gunnar Ratsch to apply natural language processing to physician’s notes in order to google-ify the tracking of patients with cancer at MSKCC. He showed several natural language processing tools that are being used to categorize and organize patient data to allow physicians to better track and compare large cohorts of patients.
Keynote 1: API All the Things#
David Hauslers’s keynote can be summed up as: “API all the things!”. As a leader in the Global Alliance for Genomics and Health he is pitching for a global set of standards that can be used to reliably store and query the deluge of genomic data being produced around the world. His plea/argument is based on the growing amount of data that is being generated and can no longer be housed in a central repository. As such he is focused on establishing a set of standards for identifying datasets (a hash scheme for uniquely ID-ing datasets), and a reliable means of communicating by an agreed upon API that all participating organizations will need to implement. For example, you may need to ask “Do you have a genome sequence with an ‘A’ on chromosome 19 position 23244345?” and the API would need to be able to respond “YES”, “NO”, or “Cannot Respond”. More difficult, costly, or private queries would need to go through authentication but the idea is that a unified API could eventually allow more complicated queries as well. He also brought up several technical issues:
- We need a new means of specifying reference locations and should adopt a graph-based structure for genomes where each positions is uniquely identified not by an integer value from the end of the chromosome but by a minimal set of left- and right- sequences that can uniquely and unequivocally locate a sequence in a genome.
- We need better/faster methods for querying stored data and pointed out Matt Massie as an example of someone trying to bring speed to large queries with SPARK.
Also, I got to see first hand the legendary Hawaiian shirts.
Keynote 2: Encryption for the Cloud#
The second keynote speaker, Kristin Lauter, is a cryptologist at Microsoft and spoke about a technique called homomorphic encryption. Homo-what? Yeah I had no idea either. The basic idea is: can you encrypt data in such a way that a third party can perform a function on the encrypted data at which point it can be returned to you for decryption. In the era of cloud computing this idea is necessary but how does it work? First of all - none of the encryption schemes commonly in use are “homomorphic” - which simply means that if some data has two functions, f()
and g()
, acting on it, the result will be the same whether you do f(g(data))
or g(f(data))
. In this example g()
and f()
represent an encryption step and a query. If you can satisfy this homomorphic property, you can upload encrypted data to an untrusted cloud server and perform secure queries on it. Dr. Lauter walked us through the basics and although I think the math behind it has got to be pretty interesting its way over my head. She did give us a hint, though - by saying that the encryption keys can be thought of as the coefficients of a large polynomial. The output can then undergo certain limited operations (addition, multiplication) and be decrypted by the same keys. Hmmm. Yeah I really don’t know how it works but I will take her word that its a very exciting time in the mathematic/scientific world of cryptology method-development.
A few other Thoughts#
Perhaps my favorite talk of the meeting was by Thomas Wu of Genentech. Thom writes software and he, along with Jens Reeder and Matt Bauer, also presented a poster on HTSeqGenie, a platform for processing data used in house by Genentech. Thom’s talk was about the algorithms and data-structures underlying his alignment software GSNAP. As a non computer scientist I was very happy that Tom walked us through some of the low-level “in the weeds” details about how to make software performant. Much of Thom’s talk dealt with how to store sequences as bytes in the memory of modern CPUs. If one knows about how the instruction sets operate, one can use a data representation that can take advantage of the instruction sets to reduce overall computation. Thom showed us one example where bitstrings of a variant sequence were laid out “vertically” (bits are stacked) instead of horizontally (bits adjacent) which allowed the CPUs to calculate the difference as an XOR. In the vertical addition, the presence of “1” automatically means that position is a variant whereas the horizontal addition required an extra step to get the position. Therefore by changing the layout in memory you could save a step - useful if you are doing millions or tens-of-millions of queries. It was a pleasure to see how a computer scientist works to identify potential data-structures or algorithms to make his program efficient even if its unlikely that you will not need to get that low-level yourself. Although others spoke a bit about making performant programs (Matt Massie, Mauricio Carniero), Thom was the only person, with the possible exception of Anthony Cox in discussing his BEETL library, who really went into the nitty gritty details and that was quite enjoyable - thanks Thom. He also mentioned the possibility of extending GSNAP to allow querying agains the entire NR/NT database and I want to look at it as an alternative to BLAST.
Machine Learning and Patients
Two interesting talks on the last morning were affiliated with Gunnar Ratsch’s group from MSKCC. The first, by Kana Shimizu, showed the creation of a homomorphic encryption system that lets a group query a smiles chemical database. Molecules were represented in vector format based on some fingerprinting scheme and the group had found a way to send the data vector (vector of fingerprints) and perform the query (tanimoto distance between the query vector and each row in the database) while both were encrypted. The group could therefore query and receive similar chemicals from a database in a totally secure way. This was a beautiful example of what the keynote from the night before had spoken of. This was made possible because the heart of the algorithm is addition. Gunnar gave the other talk about the application of Natural Language Processing to understand and organize patient records with the idea of providing physicians tools to help them track and understand the progression of cancer.
Wrap Up#
This was my first CSHL Meeting and I enjoyed it. I met a lot interesting people, got to experience the legendary CSHL after-meeting bar and make some professional contacts and some friends. The next #BioData will be in 2016 and it will be interesting to see what that brings. I would like to see some more algorithm stuff as that was fascinating and eye-opening. I also liked the domain-specific application of datascience tools to particular domains and found these gave me some good ideas for my own research so i would also like to see more of that. Kudos to Mike, Anne, and Matt for a great meeting; thanks.