Flowchart, AI, Deep Learning, Medicine, n2value

Are computers better than doctors ? Will the computer see you now ? What we learnt from the ChexNet paper for pneumonia diagnosis …

Author’s Note: This was a fun side-project for the American College of Radiology’s Residents and Fellows Section.  Judy Gichoya and I co-wrote the article.   The original article was posted by Judy to Medium and appeared on HackerNoon.  It was really an enlightening gathering of experts in the field.  There is a small, but hopefully growing number of radiologists who are also deep learning practitioners.

 

Written by Judy Gichoya & Stephen Borstelmann MD

 

In December 2017 , we (radiologists both in training, staff radiologists and AI practitioners) discussed our role as knowledge experts in world of AI, summarized here https://becominghuman.ai/radiologists-as-knowledge-experts-in-a-world-of-artificial-intelligence-summary-of-radiology-ec63a7002329. For the month of January, we addressed the performance of deep learning algorithms for disease diagnosis , specifically focusing on the paper by the stanford group — CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. We continue to generate a large interest in the journal club , with 347 people registered , 150 of whom signed on January 24th 2018 to participate in the discussion.

The paper has had 3 revisions and is available here https://arxiv.org/abs/1711.05225 . Like many deep learning papers that claim super human performance , the paper was widely circulated in the news media, several blog posts , on reddit and twitter.

ngtwitter

Please note that the findings of superhuman performance are increasingly being reported in medical AI papers. For example, this article denotes that “Medical AI May Be Better at Spotting Eye Disease Than Real Doctors”

CVDretina

To help critique the ChexNet paper , we constituted a panel composed of the author team (most of the authors listed on the paper were kind enough to be in attendance — thank you!), Dr. Luke(blog) and Dr. Paras (blog) who had critiqued the data used and Jeremy Howard (past president and chief scientist of Kaggle, a data analytics competition site, Ex-CEO of Enlitic, a healthcare imaging company, and the Current CEO of Fast.ai, a deep learning educational site) to provide insight to deep learning methodology.

chexnet

In this blog we summarise the methodology of reviewing medical AI papers.

Radiology 101

The ChexNet paper reviews performance of AI versus 4 trained radiologists in diagnosing pneumonia. Pneumonia is a clinical diagnosis — a patient will present with fever and cough , and can get a chest Xray(CXR) to identify complications of pneumonia. Patients will usually get blood cultures to supplement diagnosis. Pneumonia on a CXR is not easily distinguishable from other findings that fill the alevolar spaces — specifically pus , blood , fluid or collapsed lung called atelectasis. The radiologists interpreting these studies can therefore use terms like infiltrates , consolidation and atelectasis interchangeably.

Show me the data

The data used for this study is the ChestX-ray14 dataset which is the largest publicly available imaging data set that consists of 112,120 frontal chext xray radiographs of 30,805 unique patients and expands the ChestX-Ray 8, described by Wang, et. al. Each radiograph is labeled with one or more of 14 different pathology labels, or a ‘no finding’ label.

Labeling of the radiographs was performed using Natural Language Processing (NLP) by mining the text in the radiology reports. Individual case labels were not assigned by humans.

Critique: Labeling medical data remains a big challenge especially because the radiology report is a tool for communicating to ordering doctors and not a description of the images. For example , in an ICU film with a central line, tracheostomy tube and chest tube may be reported as “stable lines and tubes” without detailed description of the every individual finding on the CXR. This can be missclassified by NLP as a study without findings. This image-report disconcordance occurs at a high rate on this dataset.

Moreover reportable findings could be ignored by the NLP technique and/or labeling schema, either through error or pathology outside of one of the 14 labels. The paper’s claims of 90%+ NLP mining accuracy do not appear to be accurate. (SMB,LOR,JH). One of the panelists — Luke reviewed several hundred examples and found the NLP labeling about 50% accurate overall compared to the image, with the pneumonia labeling worse — 30–40%.

Jeremy Howard notes that the use of an old NLP tool contributes to the inaccuracy due to the preponderance of ‘No Findings’ cases in the dataset skewing the data — he doesn’t think that the precision of normal findings in this dataset is likely improved over random. Looking at the pneumonia label, it is only 60% accurate. A lot of the discrepancy can be drawn back to the core NLP method, which he characterized as “massively out of date and known to be inaccurate”. He feels a re-characterization of the labels with a more up-to-date NLP system is appropriate.

Chest X Ray, CXR, Deep Learning, CheXNet, n2value, tracheostomy, infiltrates, pulmonary edema
Chest Xray showing a tracheostomy tube , right internal jugular dialysis line and diffuse infiltrates likely pulmonary edema. The lines and tubes for an ICU patient are easily reported as “Stable”

The stanford group tackled the labeling challenge by having 4 radiologists (one specializing in thoracic imaging and 3 non thoracic radiologists) assign labels to a subset of the data for training created through a stratified random sampling, for a minimum of 50 positive cases of each label, with a final N=420.

Critique: The ChestXRay14 contains many patients with only one radiograph but those who had multiple studies tended to have many. While the text-mined reports may match clinical information, any mismatch between the assigned label and radiographic appearance hurts the predictive power of the dataset.

Moreover , what do the labels actually mean? Dr. Oakden-Rayner questions what the labels mean — do they mean a radiologic pneumonia or a clinical pneumonia? In an immunocompromised patient, radiography of a pneumonia might be negative, largely because the patient cannot mount an immune response to the pathogen. This does not mean that the clinical diagnosis of pneumonia is inaccurate. The imaging appearance and clinical appearance/diagnosis therefore would not match.

The closeness of four of the labels: Pneumonia, Consolidation, Infiltration, and Atelectasis introduces a new level of complexity. Pneumonia is a subset of consolidation and infiltration is a superset of consolidation. While the dataset labels these as 4 separate entities, to the radiologic practitioner they may not be separate at all. It is important to have experts look at images when doing an image classification task.

See a great summary of the data problems on this blog posting from Luke who was one of the panelists here.

Model

The CheXNet algorithm is a 121-layer deep 2D Convolutional Neural Network; a Densenet after Huang & Liu. The Densenet’s multiple residual connections reduce parameters and training time, allowing a deeper, more powerful model. The model accepts a vectorized two-dimensional image of size 224 pixels by 224 pixels.

DenseNet, Convolutional Neural Network, CNN, AI, machine learning, deep learning
Densenet

To improve trust in CheXNet’s output, a Class Activation Mapping (GRAD-CAM) heatmap was utilized after Zhou et al. This allows the human user to “see” what areas of the radiograph provide the strongest activation of the Densenet for the highest probability label.

Critique: Jeremy notes that image preprocessing of resizing to 224×224 pixel size images and adding random horizontal flips is fairly standard, but leaves room for potential improvement, as effective data augmentation is one of the best ways to improve a model. Image downsizing to 224×224 is a known issue — both from research and practical experience at Enlitic, larger images perform better in medical imaging (SMB: Multiple top 5 winners of the 2017 RSNA Bone age challenge had image sizes near 512×512). Mr. Howard feels there is no reason to leave Imagenet trained models this size any longer. Regarding the model choice, the Densenet model is adequate, but NasNets in the last 12 months have shown significant improvement (50%) over older models.

Pre-trained Imagenet weights were used, which is fine & a standard approach; but Jeremy felt it would be nice if we had a medical imagenet for some semi-supervised training of an AutoML encoder or a siamese network to cross validate patients — leaving room for improvement. Consider that Imagenet consists of color images of dogs, cats, planes and trains — and we are getting great results on X-rays? While better than nothing, ANY pretrained network trained on medical images in any modality would probably perform superiorly.

The Stanford team’s best idea was to train on multiple labels at the same time — it is best to build a single model that predicts multiple classes — counterintuitive, but bears out in deep learning models, and likely responsible for their model yielding better results than prior studies. The more classes you train the model on properly, the better results you can expect.

Results

F1 scores were used to evaluate both CheXNet model and the Stanford Radiologists.

Precision, Recall, F1 Score, ROC, AUC, AUCROC, metrics, measure, n2value
Calculating F1 score

Each Radiologists’ F1 score was calculated by considering the other three radiologists as “ground truth.” ChexNet’s F1 score, was calculated vs. all 4 radiologists. A bootstrap calculation was added to yield 95% confidence intervals.

CheXnet’s results are as follows:Evaluation-results

From the results, ChexNet outperforms human radiologists. The varying F1 scores can be interpreted to imply that for each study , 4 radiologists do not seem to agree with each other on findings. However there is an outlier (rad 4 — with an F score of 0.442) who is the thoracic trained radiologists who performs better than the ChexNet.

Moreover CheXNet has State of the Art (SOTA) performance on all 14 pathologies compared to prior publications.eval - prior benchmarks

In my (JG) search , the Machine Intelligence Lab, Institute of Computer Science & Technology, Peking University, directed by Prof. Yadong Mu reports superior performance than the Stanford group. The code is open source and available here — https://github.com/arnoweng/CheXNet 

CheXNet, AUROC, ROC, n2value
Results from various implementations of ChexNet
Results from various implementations of ChexNet

Critique — Various studies that assess cognitive fit show that human performance can be affected by lack of clinical information or prior comparisons that may affect their performance. Moreover, before the most recent version of the paper, human performance was unfairly scored against the machine.

Clinical significance

With the majority of labelled CXRs with pneumothorax having chest tubes present, the question must be raised: “are we training the Densenet to recognize pneumothoraces or chest tubes?”

Peer review

Luke Oakden-Rayner MD, a radiologist in Australia with expertise in AI & deep learning who was on our panel independently evaluated the ChestXRay-14 dataset, and CheXNet. He praises the Stanford team for their openness and patience in discussing the paper’s methodology, and their willingness to modify the paper to correct a methodologic flaw which biased against evaluating radiologists.

Summary

For the second AI journal club we analysed the pipeline of AI papers in medicine. You must make sure you are asking the right clinical question to be answered and not doing algorithms for the sake of doing something. Thereafter understand whether your data will help you answer the question you have, looking into details on how the data was collected and labeled.

To determine human level or super human performance, ensure the baseline metrics are adequate and not biased against one group.

Flowchart, AI, Deep Learning, Medicine, n2value
Pipeline for AI in medicine

The model appears to give at-human performance for experts, or better than human performance for less-trained practitioners. This is in line with research findings and Enlitic’s experience. We should not be surprised by that; the research in Convolutional Neural Networks has consistently reported near-human or super-human performance consistently.

Take Aways

  1. There is exists a critical gap in the labeling of medical data.
  2. Do not forget the clinical significance of your results.
  3. Embrace peer review especially in medicine and AI

These were the best tweets regarding the problem of labeling medical data — aka do not get discouraged to attempt deep learning for medicine.

twitterJHKL

The journal club was a success, so if you are a doctor or an AI scientist , join us at https://tribe.radai.club to continue with the conversations on AI and medicine. You can listen to the recording of this journal club here : https://youtu.be/xoUpKjxbeC0 . Our next guest is Timnit Gebru who worked on US demographic household prediction using Google Street view images on 22nd February 2018. She will be talking on Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States (http://www.pnas.org/content/114/50/13108).

Coming soon

For the journal club we developed a human versus AI competition for interepreting the CXRs in the dataset hosted at https://radai.club. We will be publishing the outcome of our crowdsourced labels soon, with a detailed analysis to check whether the model performance improves.

Say thanks

This I would like to thank the panelists including Jeremy Howard, Paras Lakhani, Luke Oakden-Rayner , and the Stanford ML team. Thanks to the ACR RFS AI advisory council members including Kevin Seals.

Article corrections made

  1. This article referred to Jeremy Howard (Ex-CEO of Kaggle) — updated to “president and chief scientist of Kaggle”
  2. Article stated NLP performance on that dataset is not likely improved over random.Jeremy clarified that the precision of the normal finding was what was not likely improved over random

 

 

 

OODA loop revisited – medical errors, heuristics, and AI.

OODA loop revisited – medical errors, heuristics, and AI.

My OODA loop post is actually one of the most popular on this site.   I  blame Venkatesh Rao of Ribbonfarm and his Tempo book and John Robb’s Brave New War for introducing me to Boyd’s methodology.   Venkatesh focuses on philosophy and management consulting, and Robb focuses on COIN and human social networks. Both are removed from healthcare, but applying Boyd’s principles to medicine: our enemy is disease, perhaps even ourselves.

Consider aerial dogfighting.  The human OODA loop is – Observe, Orient, Decide, Act.   You want to “get inside your opponent’s OODA loop” and out-think them, knowing their actions before they do, assuring victory.  If you know your opponent’s next move, you can anticipate where to shoot and end the conflict decisively.  Quoting Sun Tzu in The Art of War:

Sun Tzu Art of War OODA loops and AI

If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.

Focused, directed, lengthy and perhaps exhausting training for a fighter pilot enables them to “know their enemy” and anticipate action in a high-pressure, high-stakes aerial battle.  The penalty for failure is severe – loss of the pilot’s life.   Physicians prepare similarly – a lengthy and arduous training process in often adverse circumstances.  The penalty for failure is also severe – a patient’s death.  Given adequate intelligence and innate skill, successful pilots and physicians internalize their decision trees – transforming the OODA loop to a simpler OA loop – Observe and Act.  Focused practice allows the Orient and Decide portions of the loop to become automatic and intuitive, almost Zen-like.  This is what some people refer to as ‘Flow’ – an effortlessly hyperproductive state where total focus and immersion in a task suspends the perception of the passage of time.

For a radiologist, ‘flow’ is when you sit down at your PACS at 8am, continuously reading cases, making one great diagnosis after another, smiling as the words appear on Powerscribe. You’re killing the cases and you know it.  Then your stomach rumbles – probably time for lunch – you look up at the clock and it is 4pm.  That’s flow.

Flow is one of the reasons why experienced professionals are highly productive – and a smart manager will try to keep a star employee ‘in the zone’ as much as possible, removing extraneous interruptions, unnecessary low-value tasks, and distractions.

Kahneman defines this as fast type 1 thinking, intuitive and heuristic : quick, easy, and with sufficient experience/training, usually accurate.  But type 1 thinking can fail : a complex process masquerades as a simple one, additional important data is undiscovered or ignored, or a novel agent is introduced.  In these circumstances type 2 critical thinking is needed : slow, methodological, deductive and logical.  But humans err, substituting heuristic thinking for analytical thinking, and we get it wrong.

For the enemy fighter pilot, its the scene in Top Gun where Tom Cruise hits the air brakes to drop behind an attacking Mig to deliver a kill shot with his last missile. For a physician, it is an uncommon or rare disease presenting like a common one, resulting in a missed diagnosis and lawsuit.

To those experimenting in deep learning and Artificial intelligence, the time to train or teach the network far exceeds the time needed to process an unknown through the trained network.  Training can take hours to days, evaluation takes seconds.

Narrow AI’s like Convolutional Neural Networks take advantage of their speed to go through the OODA loop quickly, in a process called inference.  I suggest a deep learning algorithm functions as an OA loop on the specific type of data it has been trained on.  Inference is quick.

I believe that OODA loops are Kahneman’s Type 2 slow thinking.  OA loops are Kahneman’s Type 1 fast thinking.  Narrow AI inference is a type 1 OA loop.   An AI version of type 2 slow thinking doesn’t yet exist.*

And like humans, Narrow AI can be fooled.

Can your classifier tell the difference between a chihuahau and blueberry muffin?

If you haven’t seen the Chihuahua vs. blueberry muffin clickbait picture, consider yourself sheltered. Claims that narrow AI can’t tell the difference are largely, but not entirely, bogus.  While Narrow AI is generally faster than people, and potentially more accurate, it can still make errors. But so can people. In general, classification errors can be reduced by creating a more powerful, or ‘deeper’ network. I think collectively we have yet to decide how much error to tolerate in our AI’s. If we are willing to tolerate an error of 5% in humans, are we willing to tolerate the same in our AI’s, or do we expect 97.5%?  Or 99%? Or 99.9%?

The single pixel attack is a bit more interesting.  While similar images such as the ones above probably won’t pass careful human scrutiny, and frankly adversarial images unrecognizable to humans can be misinterpreted by a classifier:

Convolutional Neural Networks can be fooled by adversarial images

Selecting and perturbing a single pixel is much more subtle, and probably could escape human scrutiny.  Jaiwei Su et al address this in their “One Pixel Attack” paper, where the modification of one pixel in an image had between a 66% to 73% chance of changing the classification of that image.  By changing more than one pixel, success rates respectively rose.  The paper used older, less deep Narrow AI’s like VGG-16 and Network-in-network.  Newer models such as DenseNets and ResNets might be harder to fool.  This type of “attack” represents a real-world situation where the OA loop fails to account for unexpected new (or perturbed) information, and is incorrect.

Contemporaneous update: Google has developed images that use an adversarial attack to uniformly defeat classification attempts by standard CNN models.  By making “stickers” out of these processed images, the presence of such an image, even at less than 20% of the image size, is sufficient to change the classification to what the ensemble dictates, rather than the primary object in an image.  They look like this:

adversarial images capable of overriding CNN classifier
https://arxiv.org/pdf/1712.09665.pdf

 

I am not aware of defined solutions to these problems – the obvious images that fool the classifier can probably be dealt with by ensembling other, more traditional forms of computer vision image analysis such as HOG or SVM’s.  For a one-pixel attack, perhaps widening the network and increasing the number of training samples by either data augmentation or adversarially generated features might make the network more robust.  This probably falls into the “too soon to tell” category.

There has been a great deal of interest and emphasis placed lately on understanding black-box models.  I’ve written about some of these techniques in other posts.  Some investigators feel this is less relevant.  However, by understanding how the models fail, they can be strengthened.  I’ve also written about this, but from a management standpoint.  There is a trade off between accuracy at speed, robustness, and serendipity.  I think the same principle applies to our AI’s as well.  By understanding the frailty of speedy accuracy vs. redundancies that come at the expense of cost, speed, and sometimes accuracy, we can build systems and processes that not only work but are less likely to fail in unexpected & spectacular ways.

Let’s acknowledge the likelihood of failure of narrow AI where it is most likely to fail, and design our healthcare systems and processes around that, as we begin to incorporate AI into our practice and management.  If we do that, we will truly get inside the OODA loop of our opponent – disease – and eradicate it before it even had a chance.  What a world to live in where the only thing disease can say is, “I never saw it coming.”

 

*I believe OODA loops have mathematical analogues. The OODA loop is inherently Bayesian – next actions iteratively decided by prior probabilities. Iterative deep learning constructs include LSTM and RNN’s (Recurrent Neural Networks) and of course, General Adversarial Networks (GANs). There have been attempts to not only use Bayesian learning for hyperparameter optimization but also combining it with RL(Reinforcement Learning) & GANs.  Time will only tell if this brings us closer to the vaunted AGI (Artificial General Intelligence)**.

**While I don’t think we will soon solve the AGI question, I wouldn’t be surprised if complex combinations of these methods, along with ones not yet invented, bring us close to top human expert performance in a Narrow AI. But I also suspect that once we start coding creativity and resilience into these algorithms, we will take a hit in accuracy as we approach less narrow forms of AI.  We will ultimately solve for the best performance of these systems, and while it may even eventually exceed human ability, there will likely always be an error present.  And in that area of error is where future medicine will advance.

© 2018