AWS HCLS Conference 2019

Over the past few days I attended The AWS Healthcare & Life Sciences Cloud Symposium, a gathering of companies in the healthcare field (broadly defined) that use AWS to facilitate their business. According to one of the organizers, Elliot Menschik, the symposium (for which there doesn’t seem to be an online presence) was organized at the behest of the various people in his sphere who are curious to learn about how others in their field are deploying services to the cloud. “Healthcare” and “lifescience” businesses are both in the same economic sector, however the cultures and core business forces for each are quite different. Many of the healthcare companies, large- or small- belong to conventional businesses and the products they building aim to improve core, existing business processes such as clinical trial enrollment, medical billing or hospital document management. The lifescience companies were mostly venture-backed startups heavily invested in research and much more in flux in regards to their business processes. The innovation on the healthcare side seems to be to introduce new technology to revolutionize existing businesses, while that of the lifesciences is to attempt to create a business/product that has yet to exist. We’ll see from feedback whether AWS thinks keeping these groups together in the future is a good idea. Personally, I felt there was much to learn from both sides although I exclusively attended the lifescience track on Day 2.

Biotech-Specific Take-homes#

As someone charged with establishing and growing the informatics capability at small startup I was especially keen to learn about how others approached the same problems we are trying to solve. How did they handle certain technical issues like data storage, cloud compute, AI/ML? What do their operations look like as in how big are the teams, how are the roles split? Here are a few general observations grouped by topic:

DataStorage: Use/leverage serverless where you can. Many presenters using CSV/JSON/Parquet on S3 as their datastore. Cheap and query-able using Athena/Presto/Drill/Snowflake. We have certainly found this to be a useful strategy.
DataStorage: S3 is great but there are many use-cases where you want a file or a file system. This point was made by David Ficenec of Decibel and reiterated by David Johnson of Moderna. One suggestion is File Gateway, a new AWS product that abstracts the S3 filestore as a filesystem and acts as a local machine whihch understands file sharing protocols (e.g. SMB). This is definitely a product/approach to watch as it simplifies/obviates a whole slew of file-related issues. However, several people commented that earlier versions of this tech were buggy/faulty so its not clear yet how robust it will be.
DataStorage: The cost-benefit of regular/automated data-capture in a structured way was a common theme among many presenters and represents one of the unique challenges to early-stage life science companies whose data formats, assays, and possibly business directions are often in flux. Several case studies were presented - the Monolithic approach represented by Benchling and the Microservice approach described by David Johnson of Moderna. Benchling’s value-proposition is to create a centralized and integrated source of data that can be used/manipulated without programming experience or the need for technology expertise. The Benchling Team presented a case study where they had iterated through 27 versions of a protocol with a client, the final version of which was highly reproducible/scaleable. While the take-home message of this study was supposed to be that once-established, a centralized LIMS/ELN can scale. However, this case study also exposed, possibly inadvertently, the difficulty of standardizing protocols in early stage life-sciences. Somewhat in contrast, Moderna’s presented their creation of a microservice that automatically verifies RNA sequences using Sanger-sequencing data. In this case, data from a 3rd party provider is uploaded to S3 and triggers an EC2 job that scores/validates the RNA and sends the result to an DB/App which is accessible to the scientific team. While these two approaches were not meant to be juxtaposed, to me they seemed to highlight two successful but competing philosophies on how data should be stored/managed.
Compute: Use Batch. Everyone is doing it. Dockerize your software and scale your analyses using Batch. Basically everyone uses Batch. You should too. Its cheap, cost effective & scalable. However, there is a learning curve - I do remember when we started I was confused about the many things that require setting up - Job Definition, Job Queues, Compute Resource. But once you figure it out, its great.
Compute: There is no consensus on launching Batch jobs. Perhaps this is too problem specific. Some people using Step functions, others using Nextflow launched locally, others launch jobs in the context of AWS SQS-queues/lambda functions. This is an area where I think there could be some major improvements/simplifications that would facilitate use/adoption.
Compute: Migrating to the cloud can be tough. For larger organizations the transition costs can be mitigated by encapsulation of software components in Docker, and splitting their monolithic applications into microservices which can be migrated piecemeal. Database migration was also an issue but there wasn’t much info on particular cases here. A migration of the large computational infrastructure of Ginko seem to be underway, led by the talented and excellently named Florencio Mazzoldi.
ML/AI: Roughly half of the Biotechs that presented used ML/AI in their analysis pipelines; I will highlight three use cases that I thought were exemplary:
1. Gritstone. In the most impressive talk of the two days, Roman Yelensky presented the Gritstone platform for defining neoantigens. The entire idea is impressive and hot: develop personalized cancer therapeutics by training a patient’s own T-cells to target neoantigens on the tumor’s cell surface. They seqeunce the tumor (and a non-tumor match) to identify neoantigens. Of all the potential neoantigens, Roman and his team msut determine which will most effectively be presented as antigens via the MHC I/II systems and they have created an ML model to score the neo antigens, the top 20 of which are adminstered via viral nanoparticle. Its a beautiful scientific process and the most mature demonstration of ML/AI models in production where the ML/AI model is required for the fast & reliable prediction of neoantigens. Really quite impressive.
2. XtalPi. Virgia Burger of XtalPi described the cloud workflow that backs their Crystal Strucutre Prediction service. Unbenownst to me, the packing of small molecules in the microcrystals of pills can potentially have a strong effect on bioavailabilty. Virginia mentioned the case of ritonavir, an anti-retroviral drug that worked spendidly for two years when Ritonavir discovered a lower-energy configuration at which point all the crystals in the entire supply chain slowly adopted this new, and completely insoluble form of the drug. Although a workaround (syrup-form) was found, crystallization of avtive ingredients is now of immense importance during the formulation phase of drug. Virgina and her team use AWS to model all possible crystal configurations to determine if there are putative crystal forms that have yet to be discoverred. Thay have a sophisticated, broad, highly cloud-embracing computational pipeline for building, scoring, and tecting their models. It is the sort of large-scale computational chemistry that can only be done with the sorts of scaleable resources that few cloud providers can accomadate.
3. Crispr Therapeutics. Andrew Kernystsky presented on the use of machine learning models to predict off-target cuts for particular guide RNAs. In the end his finding was relatively simple - that one is likely to be most successful with gRNA sequences that are most distinct from any other sequence in the genome (fewest sites with 1-, 2-, and 3- mismatches are the best)), but he used a fairly extensive suite of test models to make this assessment.

General Take-Homes for ML in Healthtech#

Based on the keynote and healthtech sessions from day 1:

Every company in the world wants to do AI. Apparently there are 1000’s of AI startups out there each one striving for a niche. The barriers to entry seems to availability of good quality, copious data; institutional resistance; talent shortage.
AWS is attempting to make AI/ML applications as turnkey as possible through complete AI-as-a service services as well as the use of intermediate tools that facilitate deploying ML/AI models on the cloud with minimal engineering requirements - ie. SageMaker. Note to self: check out Sagemaker.
Most of the “low-hanging fruit” is in the rather boring nitty-gritty internals of things like medical billing, form aggregation, etc. When asked what he was most excited about for ML/AI in 2019, Manu Tandon, the CIO of Beth Israel Deaconess Medical Center, lamented issues within the hospital that prevent patients from being discharged simply because forms can’t be found or some trivial formality is not processed - in short the most pressing problems he hopes AI/ML can reduce/eliminate are internal business processes. These sorts of high-friction problems can cripple the hospital when it approaches full-capacity.
The “big-players” are envisioning and attempting to create a new healthcare world around IoT and AI/ML. Devices that track and monitor used to predict and avoid disease and indeed, to possibly eliminate the hospital entirely. However, those are the grand dreams. The current practical concerns are integrating insurance forms and making medical records accessible.

In an amusing side-note it seems that AI/ML world is already in arms-race territory and that the winners are those data providers with high quality datasets. One presenter spoke of using AI/ML to improve their marketing and sales and another conference-goer confided that his company was attempting to do the same thing…. with the same dataset! I can envision both competing sales team arriving to the same doctor’s office on the same day due to the same AI-driven recommendation.