Biodata 18

These are my notes from the meeting of the 2018 CSHL Biological Data Science Meeting.

This was an enjoyable, genomics-heavy meeting exploring the application of data science techniques to biological problems. The conference was filled with high-caliber science talks/posters (abstracts here) taking place over four days at CSHL. I was only able to attend Thurday/Friday and therefore missed the sections on Algorithms (W) and Imaging (S), which by all accounts were quite strong. But even so I got my fill of new ideas and exposure to applications in a wide variety of fields. One issue that came up a lot was the recent Illumina acquisition of PacBio. And despite having people from ILMN and PacBio present, no one seemed to know what that deal will ultimately mean for the business and science of genomics. However, there were several talks/posters that demonstrated the superiority of long reads for identifying entire classes of structural variations so it may signify the beginning of a broader incorporation of long-reads into our genetic standard of care.

On a personal note it was a pleasure to make the acquaintance of several scientific/twitter luminaries. In particular, I enjoyed meeting Susan Holmes of phyloseq / dada2 fame whose new book Modern Statistics for Modern Biology looks to be fantastic. It was also great to discuss PacBio reads with the incomparable Jason Chin whose pioneering work on denoising and assembling long reads was elegantly recapped in a a recent blog post, and to learn a bit about Ben Langmead’s thoughts on the future of effeciency-at-scale for the next generation of genomics tools.

Scientific Highlights#

Anne Carpenter’s Keynote#

Anne Carpenter’s talk centered on the use of the Cell Profiler for the purpose of classifying cells. Although the initial work in image classification was supervised by users, newer versions support unsupervised image classification. Ann showed that with a few inexpensive dyes, a series of “painting” assays could be used to capture major features of cellular morphology allowing high throughput image analysis. These painting assays could be used to classify images from mutant cell lines and, amazingly, the genotype could be correctly inferred: mutations along the same signaling pathways perturb the cells in visually similar ways. There is a predictable cellular phenotype for perturbations to the RAS and Hippo pathways it seems. Might it be true across a much wider class of gene knockdowns so genotype could be rapidly inferred by classification using a large pretrained model?

Although the examples that were shown were genetic mutations that dramatically affect morphology, the scalability and simplicity of the assay lets her team screen cells with large chemical libraries. Image classification can predict which pathways the compounds are involved in. Anne stated her primary interest is in drug discovery but its clear that this tool can be used for many screening strategies including target identification, compound prioritization, or the early discovery of toxic properties.

Arjun Raj’s Talk#

Arjun Raj’s rhetorical hook was to ask whether or not cells have “free will”. Although I was initially dubious due to this anthropomorphic flourish, I quickly warmed to the talk when Arjun showed a plate of melanoma cells being treated with drug, highlighting the subset the cells that were resistant. He was able to track down the cells, show there is no genome-level differences between dead cells and resistant cells, and further demonstrate that resistance was due to stochastic gene expression. This cell-to-cell variability that conferred different “behavior” in the presence of drug was his “free will”. How might you find other factors that, when over-expressed, contribute to the resistance phenotype? He devised a series of clever experiments to passage cells over time and use rna-seq to identify genes that are variably expressed between single cells. And he can show that many of these factors were among his previous favorite candidates for conferring resistance. This was beautiful work using new techniques to reboot and extend earlier concepts. Further, the implications for identifying factors that could severely curtail the long-term effectiveness of chemotherapeutics opens the door to adjuvants that would minimize the occurrence of resistance.

Rachel Sherman’s Talk#

The current standard of care for clinical genomics is to identify SNPs by mapping short read sequences to a reference genome and looking for differences. While effective for identifying what short reads can see, Rachel Sherman demonstrated there are very large classes of structural variants that are completely invisible to short read analysis. As a student in the Schatz and Salzberg labs, Rachel looked for mutations in a series of breast cancer cell lines with ILMN, PacBio and ONT reads. There were a staggeringly large number (>20k) of variants that were seen by PacBio and not by ILMN. This is tremendous! And it speaks to the amazing revolution in genome biology offered by long read technology. Although she did not dwell on it extensively, the ONT data did not seem to stack up well - with a large number of false positive calls most likely due to small indels on the reads.

Steven Salsberg’s Keynote#

Steven Salzberg’s talk was an enjoyable recap of “adventures and misadventures” in genomics. His talk, broken into four parts, recounted tales from his time at TiGR, as a member of the Human Genome Consortium, and as a long-time quality-watcher of sequences in public databases. His talk had two major themes: First, that a careful use of the public databases can lead to interesting findings. For example, he spoke of the discovery of several new bacterial species found in the deposited genomes/draft-genomes of other organisms, pointing out that most people deposit reads in the SRA and then never look at them again. It is potential gold mine if you have a good question and know how to search. Second, while the public databases may be full of goodies they may also have misleading or downright false data. He spent quite a bit of time debunking and re-debunking claims of Horizontal Gene Transfer into humans where the evidence for these genes turned out to be artifactual proteins in genbank. He also pointed out that because of the human reference genome is incomplete, there are potentially very many human contaminants in other draft genomes. An interesting tidbit that I picked up in the talk was a bona fide example of HGT from Agrobacterium into domesticated sweet potato.

Steve’s talk was charming and easy going. At the end of the talk he exhorted the crowd not to just build tools but to also use them. There is no one better at using your tool than you are - why let the biologists have all the fun?

Major Trends#

Single Cell sequencing has arrived. It is a fertile area for methods development and is being applied to many areas of biology. The volume of data is driving innovation in projection/clustering methods, in visualization and in data-processing/interactivity tools.
Imaging. I missed the Images (Saturday morning) session but it is clear that Deep learning is incredibly well suited to image analysis and that the same sort of feature recognition patterns that drive consumer applications are being applied to research questions across the board. I expect this to be a particularly fruitful marriage.
Reproducibility of ML models. It seems that building, applying, and distributing models for ML applications has been a pain point. There were a number of workflow suggestions that are aimed towards the reproducibility and extensibility not only of the raw tools themselves but all do the intermediate representations necessary for reuse.
Reproducibility more broadly. There was a major emphasis on reproducibility and I believe that just about every speaker had a github repo (or three) for code ( full list here ) and sang the praise of bioconda/bioconductor/docker Declarative workflows are here to stay. This represents an excellent trend that is now also making its way to the cloud for CI and for reproducible distribution of, e.g. ML models.
Everybody is doing data science (or wants to). Although the conference was academic-centric, there was a reasonable showing of industry personnel from a wide number of sectors that include industries that develop and deploy machine learning (Google, Amazon) at scale to a larger number of mid-sized to large companies whose personnel were picking up ideas on how to apply some of these techniques to their own problem domain (Pepsi, DSM Biotech, DNANexus).
NCBI is trying to make cloud computing first class. Multiple NCBI scientists/engineers presented posters on efforts to facilitate the use of NCBI tools on the cloud. One was about the BLAST AWS AMI; the other was about packaging up the PGAP program with docker/CWL. Repo here
Not nearly as much k-mer stuff as 2014 (it seems). There was some updates to the kraken program and some talks/posters on the use of minhashes for distance calculation but this was a relatively small portion of the total whereas my impression from 2014 was that kmer-based approaches were ascendant. Peak kmer? Or perhaps just mature.

A Few Thoughts on Training#

The panel on the future of biological data science was chaired by Adam Phillipy and featured a question about how the panelists train their students and what the implications are for NIH grant efforts. Is there a such a thing as data science and should it be taught/trained independently? While there was not exactly a consensus, I was partial to the arguments of Melissa Gymrek who more or less dismissed the idea of data science as an independent subject of study. While “data science”, per se, may not be a field in and of itself it is clear the computation and statistics are transforming just about every conceivable field.

A more compelling take home for technical training was, however, clearly on display at this meeting: the exceptional power of good mentorship. At this meeting there was a significant presence from those in the Salzberg academic umbrella. Among the 1st generation Salzberg academic descendants were Mike Shatz, Adam Phillipy, and Ben Langmead. There were also many excellent students and postdocs of each of those people as well as of Steven Salzberg himself. I expressed to a few people at dinner my admiration of the mutually reinforcing respect amongst the members of the group and of the high quality of science coming out across the board. It would be hard to overstate the importance of these lineages and I don’t mean that in the pejorative aspect of having allies on the right committees (also important). Rather, it seems that the best thing for a person interested in research science is to fall within the umbra of a powerful research talent. However, we cannot all train with the Steven Salzbergs of the world - it is not a big enough pipeline to fill all the growing roles & niches popping up in the data world. So there will need to be other approaches as well. But from my perspective, expanding the support of classic research-based training with strong, proven scientists seems to be the best idea for the long term success of any research endeavor - computational or otherwise.