February, 2018 - Volume to Value

Data Science Salon: Miami

February 26, 2018February 27, 2018Stephen Borstelmann MD

There is a developing data science, machine learning, and deep learning community in the South Florida area that I support. The topics were diverse, from business intelligence to online ad buying to health tech. I was invited by Data Science Salon to attend and I was really pleased to do so.

The conference was hosted by Formulated.by and was held in Miami’s CIC near University of Miami/Jackson Memorial Hospital. It was a two day conference – I attended only the second day.

Vendors participating and hosting the conference were: Dataiku, Vertica, Plot.ly, and Formulated.ly, O’Reilly, Alteryx, & Domino Data Lab.

Here is the Thursday conference agenda:

I got through the traffic in Miami just in time to make the tail end of the Meditation exercise. I’ll be honest – talking about data science gets me excited, so I really wasn’t in the mood to calm down. Miami traffic also doesn’t make me calm down. But it was fun, nonetheless.

Brian MacDonald of the Florida Panthers started off with an interesting presentation about how the Florida Panthers, as an organization, solved the problem of how much to charge for their seats in a game, which ended up being a very traditional data science problem, beginning with an exploration of the data, discerning relationships in that data, and then creating predictive models. It turns out that the demand of seats is related to: day of week, opposing team, home team performance, holidays (some were highly negative, like Valentine’s day) and how late in the season the game is being played. They utilized a regression model controlling for independent variables, and thereafter were able to predictively model sales, attendance, and even season ticket holder renewals.

Michael Conway from Bidtellect spoke on their self-service predictive analytics platform for online ad bidding – which is using the Vertica service. Eye opening (for me as a physician) that they are participating in 15,000,000,000 (yes, number is accurate) auctions daily for online ad placement. Communicated that engagement rates are important, and by measuring post-click consumer activity you can document the value of the ad.

Relationship Mapping by Carnival for social selling — Relationship Mapping by Carnival Data Science Team used in social selling

The data science team of Kevin U and Mark Fridson from Carnival Cruise Lines spoke – this was a really excellent talk, first about digital transformation of a traditional Fortune 500 company, and then some nuts & bolts. Kevin hammered home the importance of having a data-driven culture, which flows from the highest levels of the organization to spur adoption and deal with “change management” (That exists in healthcare too, by the way). One reality of being in South Florida was the skills gap – qualified data people are hard to come by.

Mark discussed the importance of multichannel engagement via snail mail, email, and social media as digital channels – sharing insights closely tied to generational cohorts. For each age group, Carnival has an “ideal customer” which they try to match as closely as possible to Boomers respond best via snail mail (USPS), while Gen X and Millenials use email and social media. For Generation Z, its all social media, but for different purposes. Snapchat creates exposure, while Instagram represents captured moments. Facebook is for acquaintance update and communication, and Twitter is most useful for interest and influencers. I thought that breakdown was particularly useful for those in marketing.

Propensity Modeling by Carnival for Customer Lifetime Value — Propensity Modeling by Carnival Data Science Team for Customer Lifetime Value

They use propensity modeling to calculate CLV (Customer Lifetime Value) and use Bayesian analysis. Content personalization is performed with: demographics, frequency, booking patterns, after-purchase add-ons, and even an element of serendipty (! – remember that piece on antifragility I did? These guys get it). They do use social relationship mapping and have been using some NLP text analysis but feel its hard to use AI NLP in SoMe.

Catalina Arango next spoke, and her talk was non-technical, aimed at beginners and managers desiring to implement data science elements in their enterprises. I took this opportunity to speak to the Dataiku and Vertica folks as this was a refresher for me.

Next up was Alex Rubynstein from Mt. Sinai in NYC – Mt . Sinai is One of the more proactive medical centers in the country regarding analytics and recognizing the value of data. I have seen them advertising for multiple positions to monetize their research.

Tumor Control with personalized vaccine — 4/6 recipients of vaccine were disease free 25months after vaccine while 2/6 with recurrent disease were subsequently treated and experience complete cancer regression.

This was an interesting take on personalized medicine and genomics using big data for analysis. Because of cancer’s lethality, more experimentation is possible which have resulted in some novel therapies which approach cure, or at least transforming cancer into a chronic condition. The cancer vaccine approach treats the patient’s immune system to either enhance the immune response (to overcome immune suppression) or to increase the sensitivity of the immune system to the cancer (to overcome immune escape). They take the patient’s gene sequence, and the tumor gene sequence, filter the two and target on the order of 5-20 mutations, combining the vaccine with an adjuvant. They use machine learning on the candidate targets as the number of mutations exceeds the number of targets. They are continuing to expand on their sample size, which is extremely small, and because of the individualized nature of the therapy, very costly. Nevertheless, early results are promising. The primary limitation is the individualized and handcrafted nature of the vaccine.

Lunch followed – Subway sandwich boxes, which were fine. Networking at a data science conference can be tough (stereotypes anyone?) but I managed to find a few good folks to chat with.

A panel followed composed of three speakers – Dr. Irma Fernandez, chief academic officer of St. Thomas University; Colleen Farrelly, Data Scientist at Kaplan; Mauro Damo, chief data scientist at Dell; and Anton Antonov, Consultant at Accendo Data. There were a broad number of topics discussed. Main points were the following: Publishing data can be damaging, so be aware of what you are putting out there. Narrow AI only at this time – no general AI (we know that)! This was a good, in-the-fields survey of current trends and issues.

Athanassios Kintaskis, Sr. Machine Learning engineer at Capital One had an interesting presentation on MCL (Markov Clustering) Sparse Graphs – this was a good technical talk, some of which went over my head. As opposed to K-means clustering algorithms which are sensitive, but can’t tell you how many groups are present (you need to choose), this approach simulates random walks in a graph and uses a flow dynamic to create clusters.

Markov Chain transitions can be modeled as a matrix, and that’s about as far as I got before I was interrupted by a phone call. This was an interesting and meaty talk, and I probably need to read up more on the topic before publically displaying my ignorance.

Anabetsy Rivero of Metastatic AI gave a nice introductory presentation on Convolutional Networks in medical imaging (head over to my other blog: www.ai-imaging.org for more on that or read my prior articles on this). Anabetsy is a machine learner that is focusing on breast cancer diagnostics.

There were a few other presentations but this is a blog, not a manifesto!

All in all, I appreciated what Formulated.by did to bring this type of conference to Miami. It is a necessary part of growing the Miami Data Science community, and I would love to see more events like Data Science Salon in the future. A 2nd DataScienceSalon:Miami is slated for November 6-7, 2018.

FULL DISCLOSURE: Because of my involvement in the South Florida Data Science and Machine Learning community, I received complimentary entrance.

Are computers better than doctors ? Will the computer see you now ? What we learnt from the ChexNet paper for pneumonia diagnosis …

February 15, 2018February 15, 2018Stephen Borstelmann MD

Author’s Note: This was a fun side-project for the American College of Radiology’s Residents and Fellows Section. Judy Gichoya and I co-wrote the article. The original article was posted by Judy to Medium and appeared on HackerNoon. It was really an enlightening gathering of experts in the field. There is a small, but hopefully growing number of radiologists who are also deep learning practitioners.

Written by Judy Gichoya & Stephen Borstelmann MD

In December 2017 , we (radiologists both in training, staff radiologists and AI practitioners) discussed our role as knowledge experts in world of AI, summarized here https://becominghuman.ai/radiologists-as-knowledge-experts-in-a-world-of-artificial-intelligence-summary-of-radiology-ec63a7002329. For the month of January, we addressed the performance of deep learning algorithms for disease diagnosis , specifically focusing on the paper by the stanford group — CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. We continue to generate a large interest in the journal club , with 347 people registered , 150 of whom signed on January 24th 2018 to participate in the discussion.

The paper has had 3 revisions and is available here https://arxiv.org/abs/1711.05225 . Like many deep learning papers that claim super human performance , the paper was widely circulated in the news media, several blog posts , on reddit and twitter.

Please note that the findings of superhuman performance are increasingly being reported in medical AI papers. For example, this article denotes that “Medical AI May Be Better at Spotting Eye Disease Than Real Doctors”

To help critique the ChexNet paper , we constituted a panel composed of the author team (most of the authors listed on the paper were kind enough to be in attendance — thank you!), Dr. Luke(blog) and Dr. Paras (blog) who had critiqued the data used and Jeremy Howard (past president and chief scientist of Kaggle, a data analytics competition site, Ex-CEO of Enlitic, a healthcare imaging company, and the Current CEO of Fast.ai, a deep learning educational site) to provide insight to deep learning methodology.

In this blog we summarise the methodology of reviewing medical AI papers.

Radiology 101

The ChexNet paper reviews performance of AI versus 4 trained radiologists in diagnosing pneumonia. Pneumonia is a clinical diagnosis — a patient will present with fever and cough , and can get a chest Xray(CXR) to identify complications of pneumonia. Patients will usually get blood cultures to supplement diagnosis. Pneumonia on a CXR is not easily distinguishable from other findings that fill the alevolar spaces — specifically pus , blood , fluid or collapsed lung called atelectasis. The radiologists interpreting these studies can therefore use terms like infiltrates , consolidation and atelectasis interchangeably.

Show me the data

The data used for this study is the ChestX-ray14 dataset which is the largest publicly available imaging data set that consists of 112,120 frontal chext xray radiographs of 30,805 unique patients and expands the ChestX-Ray 8, described by Wang, et. al. Each radiograph is labeled with one or more of 14 different pathology labels, or a ‘no finding’ label.

Labeling of the radiographs was performed using Natural Language Processing (NLP) by mining the text in the radiology reports. Individual case labels were not assigned by humans.

Critique: Labeling medical data remains a big challenge especially because the radiology report is a tool for communicating to ordering doctors and not a description of the images. For example , in an ICU film with a central line, tracheostomy tube and chest tube may be reported as “stable lines and tubes” without detailed description of the every individual finding on the CXR. This can be missclassified by NLP as a study without findings. This image-report disconcordance occurs at a high rate on this dataset.

Moreover reportable findings could be ignored by the NLP technique and/or labeling schema, either through error or pathology outside of one of the 14 labels. The paper’s claims of 90%+ NLP mining accuracy do not appear to be accurate. (SMB,LOR,JH). One of the panelists — Luke reviewed several hundred examples and found the NLP labeling about 50% accurate overall compared to the image, with the pneumonia labeling worse — 30–40%.

Jeremy Howard notes that the use of an old NLP tool contributes to the inaccuracy due to the preponderance of ‘No Findings’ cases in the dataset skewing the data — he doesn’t think that the precision of normal findings in this dataset is likely improved over random. Looking at the pneumonia label, it is only 60% accurate. A lot of the discrepancy can be drawn back to the core NLP method, which he characterized as “massively out of date and known to be inaccurate”. He feels a re-characterization of the labels with a more up-to-date NLP system is appropriate.

Chest X Ray, CXR, Deep Learning, CheXNet, n2value, tracheostomy, infiltrates, pulmonary edema — Chest Xray showing a tracheostomy tube , right internal jugular dialysis line and diffuse infiltrates likely pulmonary edema. The lines and tubes for an ICU patient are easily reported as “Stable”

The stanford group tackled the labeling challenge by having 4 radiologists (one specializing in thoracic imaging and 3 non thoracic radiologists) assign labels to a subset of the data for training created through a stratified random sampling, for a minimum of 50 positive cases of each label, with a final N=420.

Critique: The ChestXRay14 contains many patients with only one radiograph but those who had multiple studies tended to have many. While the text-mined reports may match clinical information, any mismatch between the assigned label and radiographic appearance hurts the predictive power of the dataset.

Moreover , what do the labels actually mean? Dr. Oakden-Rayner questions what the labels mean — do they mean a radiologic pneumonia or a clinical pneumonia? In an immunocompromised patient, radiography of a pneumonia might be negative, largely because the patient cannot mount an immune response to the pathogen. This does not mean that the clinical diagnosis of pneumonia is inaccurate. The imaging appearance and clinical appearance/diagnosis therefore would not match.

The closeness of four of the labels: Pneumonia, Consolidation, Infiltration, and Atelectasis introduces a new level of complexity. Pneumonia is a subset of consolidation and infiltration is a superset of consolidation. While the dataset labels these as 4 separate entities, to the radiologic practitioner they may not be separate at all. It is important to have experts look at images when doing an image classification task.

See a great summary of the data problems on this blog posting from Luke who was one of the panelists here.

Model

The CheXNet algorithm is a 121-layer deep 2D Convolutional Neural Network; a Densenet after Huang & Liu. The Densenet’s multiple residual connections reduce parameters and training time, allowing a deeper, more powerful model. The model accepts a vectorized two-dimensional image of size 224 pixels by 224 pixels.

DenseNet, Convolutional Neural Network, CNN, AI, machine learning, deep learning — Densenet

To improve trust in CheXNet’s output, a Class Activation Mapping (GRAD-CAM) heatmap was utilized after Zhou et al. This allows the human user to “see” what areas of the radiograph provide the strongest activation of the Densenet for the highest probability label.

Critique: Jeremy notes that image preprocessing of resizing to 224×224 pixel size images and adding random horizontal flips is fairly standard, but leaves room for potential improvement, as effective data augmentation is one of the best ways to improve a model. Image downsizing to 224×224 is a known issue — both from research and practical experience at Enlitic, larger images perform better in medical imaging (SMB: Multiple top 5 winners of the 2017 RSNA Bone age challenge had image sizes near 512×512). Mr. Howard feels there is no reason to leave Imagenet trained models this size any longer. Regarding the model choice, the Densenet model is adequate, but NasNets in the last 12 months have shown significant improvement (50%) over older models.

Pre-trained Imagenet weights were used, which is fine & a standard approach; but Jeremy felt it would be nice if we had a medical imagenet for some semi-supervised training of an AutoML encoder or a siamese network to cross validate patients — leaving room for improvement. Consider that Imagenet consists of color images of dogs, cats, planes and trains — and we are getting great results on X-rays? While better than nothing, ANY pretrained network trained on medical images in any modality would probably perform superiorly.

The Stanford team’s best idea was to train on multiple labels at the same time — it is best to build a single model that predicts multiple classes — counterintuitive, but bears out in deep learning models, and likely responsible for their model yielding better results than prior studies. The more classes you train the model on properly, the better results you can expect.

Results

F1 scores were used to evaluate both CheXNet model and the Stanford Radiologists.

Precision, Recall, F1 Score, ROC, AUC, AUCROC, metrics, measure, n2value — Calculating F1 score

Each Radiologists’ F1 score was calculated by considering the other three radiologists as “ground truth.” ChexNet’s F1 score, was calculated vs. all 4 radiologists. A bootstrap calculation was added to yield 95% confidence intervals.

CheXnet’s results are as follows:

From the results, ChexNet outperforms human radiologists. The varying F1 scores can be interpreted to imply that for each study , 4 radiologists do not seem to agree with each other on findings. However there is an outlier (rad 4 — with an F score of 0.442) who is the thoracic trained radiologists who performs better than the ChexNet.

Moreover CheXNet has State of the Art (SOTA) performance on all 14 pathologies compared to prior publications.

In my (JG) search , the Machine Intelligence Lab, Institute of Computer Science & Technology, Peking University, directed by Prof. Yadong Mu reports superior performance than the Stanford group. The code is open source and available here — https://github.com/arnoweng/CheXNet

CheXNet, AUROC, ROC, n2value — Results from various implementations of ChexNet

Results from various implementations of ChexNet

Critique — Various studies that assess cognitive fit show that human performance can be affected by lack of clinical information or prior comparisons that may affect their performance. Moreover, before the most recent version of the paper, human performance was unfairly scored against the machine.

Clinical significance

With the majority of labelled CXRs with pneumothorax having chest tubes present, the question must be raised: “are we training the Densenet to recognize pneumothoraces or chest tubes?”

Peer review

Luke Oakden-Rayner MD, a radiologist in Australia with expertise in AI & deep learning who was on our panel independently evaluated the ChestXRay-14 dataset, and CheXNet. He praises the Stanford team for their openness and patience in discussing the paper’s methodology, and their willingness to modify the paper to correct a methodologic flaw which biased against evaluating radiologists.

Summary

For the second AI journal club we analysed the pipeline of AI papers in medicine. You must make sure you are asking the right clinical question to be answered and not doing algorithms for the sake of doing something. Thereafter understand whether your data will help you answer the question you have, looking into details on how the data was collected and labeled.

To determine human level or super human performance, ensure the baseline metrics are adequate and not biased against one group.

The model appears to give at-human performance for experts, or better than human performance for less-trained practitioners. This is in line with research findings and Enlitic’s experience. We should not be surprised by that; the research in Convolutional Neural Networks has consistently reported near-human or super-human performance consistently.

Take Aways

There is exists a critical gap in the labeling of medical data.
Do not forget the clinical significance of your results.
Embrace peer review especially in medicine and AI

These were the best tweets regarding the problem of labeling medical data — aka do not get discouraged to attempt deep learning for medicine.

The journal club was a success, so if you are a doctor or an AI scientist , join us at https://tribe.radai.club to continue with the conversations on AI and medicine. You can listen to the recording of this journal club here : https://youtu.be/xoUpKjxbeC0 . Our next guest is Timnit Gebru who worked on US demographic household prediction using Google Street view images on 22nd February 2018. She will be talking on Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States (http://www.pnas.org/content/114/50/13108).

Coming soon

For the journal club we developed a human versus AI competition for interepreting the CXRs in the dataset hosted at https://radai.club. We will be publishing the outcome of our crowdsourced labels soon, with a detailed analysis to check whether the model performance improves.

Say thanks

This I would like to thank the panelists including Jeremy Howard, Paras Lakhani, Luke Oakden-Rayner , and the Stanford ML team. Thanks to the ACR RFS AI advisory council members including Kevin Seals.

Article corrections made

This article referred to Jeremy Howard (Ex-CEO of Kaggle) — updated to “president and chief scientist of Kaggle”
Article stated NLP performance on that dataset is not likely improved over random.Jeremy clarified that the precision of the normal finding was what was not likely improved over random

Volume to Value

Month: February 2018

Data Science Salon: Miami

Like this: