https://doi.org/10.1148/radiol.2020200230

Combining AI and Value – the COVID 19 coronavirus as a potential use case.

Note: Update 4/22/20: Since this post was first written there has been a great deal of information released about the novel coronavirus COVID19, with an equal amount of confusion. Many preprints have been released to satisfy the hunger for information about this new and concerning disease. Unfortunately, there are both problems in quality and veracity and source material must be considered, as not all is of the same quality, and frankly, some might be intentional misinformation.

After this blogpost was published, Radiology leadership from the organized specialty societies came out with the recommendation NOT to use imaging on COVID-19 patients for diagnosis, but only for investigation of a worsening clinical course not explainable by COVID. Many AI projects were initiated of extremely poor/questionable quality, using as few as 5 cases for training. (!) I’ll instead suggest an adaptation of the 2018 RSNA pneumonia challenge winner which is being hosted on the Arterys platform if you are looking for such a product . (Full Disclosure: I have no financial interest in Arterys or the makers of that product)

I’ve tried hard to avoid hype for the purposes of this blog. So as to the question – is AI useful in detecting COVID via imaging? At this time I don’t know – further imaging data and study will be needed. My hope is that the point is somewhat moot, as if accurate rapid PCR tests and antibody testing are widely available, the need for detection via imaging will be less of a concern, and I think that is why the specialty societies argued against imaging as a front line diagnostic tool.

First post in a while. This was supposed to be “Wither Value? – an update on Value Based Care in 2020” But that can wait as the coronavirus situation in China gives us pause.

First off, our hearts go out to the afflicted in China, especially Wuhan. Diseases like COVID afflict randomly and pass sentence indiscriminately upon their victims, uncaring of race, creed, national origin, or political affiliation. We can all appreciate the primary risk is simply being alive and in the wrong place at the wrong time. As human beings, we all are potentially vulnerable.

I write this after the nearly doubling of reported cases following a change in diagnostic criteria, first brought to my attention 2/12/20 by Scott Gottlieb, former FDA commissioner. Previously, the only reported cases were those confirmed by PCR, and some people commented that cases were rising steadily at about 3000/day. That might have been the capacity limit of national PCR testing. Clearly though, imaging (specifically a CT scan showing bilateral patchy ground glass opacities which are fairly typical of viral pneumonitis but nonspecific) is becoming part of the diagnostic algorithm, and revealing a more extensive distribution which was suspected from anecdotal social media reports.

Which begs the question – could AI help?

 https://doi.org/10.1148/radiol.2020200230
Radiology – CT Imaging Features of 2019 Novel Coronavirus

As CT imaging now is part of the diagnostic criteria, we have an opportunity to take positive CT scans from patients with positive COVID-19 PCR and make an initial, standardized dataset which could be in DICOM format. It should be as high quality as possible, and ideally have the ground glass opacities and progressive consolidation / ARDS-like appearance labeled with bounding boxes. If that Chinese startup with $100 million in funding doesn’t have the appropriate tools, my friend George over at MD.AI does. Once the basic dataset is created, a second dataset with similar, presumptive cases based upon the new WHO criteria (CT scan +, Wuhan contact, lymphopenia) can be assembled with weak labeling, perhaps aided by some semi-supervised learning techniques to make the labeling less onerous (similar to what we did in this paper). In order not to make it an ARDS detector, initially presenting imaging should be used. A similar number of controlled negative cases should be assembled for the dataset with a negative label. Then a classifier can be trained for detection / absence of disease.

While such a CT-based classifier would prove helpful in the developed world and urban China, the developing world is not helped by this CT classifier. For that reason, it would be prudent to create a dataset of standard chest radiographs from the ground truth positive patients identified in the two CT datasets above. This would hold value, with or without the bounding boxes and could be used for screening purposes.

Chest X-ray based deep learning screening would be useful for the rest of the world where not only access to advanced imaging is limited, but also access to radiologists. Studies can wait for up to 60 days to be interpreted by a physician in some developing countries. Chest X-ray is also less bulky and more portable, and could also be used at borders and immigration control points in conjunction with temperature checks to assess for potential geographic spread of COVID-19.

Part of the Value of AI in this circumstance is the rapid reproduction and dissemination of knowledge in front of a clear and present danger. Physicians might simply not know of COVID-19’s findings or not recognize them because they are too unlikely in their practice. Therefore, those first few cases slip by, and allow for increased local infections. The ability to be proactive is valuable. The dataset, and to a lesser extent any actual model represents a rapid transfer of knowledge among physicians. Dissemination of the dataset, and eventual AI classifier is limited only by internet access and end-user integration. The cost of developing such a single-purpose classifier would be miniscule – on the order of a research grant. More is being lost by economic slowdown on an hourly basis.

With the dataset open sourced, new cases could be added either on a collective basis via a registry created by governmental or NGO’s for this purpose or administered by a respective country’s health department/ministry. Additional cases could be added, and supervised by local academic physicians, who would be best suited to detect spectrum shift in disease presentation if the virus should mutate and change its pathologic presentation.

So, this post is predominantly aimed at my Asian colleagues who are simply by virtue of their location, present at the epicenter of this illness. To the healthcare professionals of the middle kingdom – it must be difficult to be witness to what you are seeing – I cannot even personally conceive it. You have a long tradition of excellence in the healing arts, stretching back to Hua Tuo of the Three Kingdoms and beyond. Receive my heartfelt respect for taking up the rod of Aesculapius at personal risk to yourself in order to minister to the needs of others. In conjunction with your close research colleagues, you are in a unique position to create this database – you have the clinical case knowledge and the PCR data. Furloughed or idle technologists, residents, and radiology physicians can annotate the data you provide, not only providing +disease / -disease labels on CT slice images, but bounding boxes on the areas of pulmonary involvement. While you are doing that, take the presenting chest x-ray and do the same! Those steps, by the way, can be done from home, over the internet. Once assembled into a high quality labelled dataset of a sufficient number of cases, release it publicly for the good of all people and nations. Models can be created and shared on the internet once the data is released. At the last RSNA I saw a neat open-source data lake for imaging data /DICOM called KHEOPS – perhaps this would be a way to accomplish this across users and platforms.

coronavirus
Coronavirus image from sciencemag.org – used under fair use principles for scientific purposes.

I’m aware that this post might not age well. On the other hand, I have a bit of a bully pulpit so I am going to use it. It is far too easy to fall either into hysteria or indifference when faced with a fearsome circumstance. I believe that rationality combined with empathy is the best response.

Data Science Salon Miami, Conference, Formulated.by, N2value, data science, machine learning, miami, south florida, SoFla, SoFLo, SoBe,CIC

Data Science Salon: Miami

There is a developing data science, machine learning, and deep learning community in the South Florida area that I support.  The topics were diverse, from business intelligence to online ad buying to health tech.  I was invited by  Data Science Salon to attend and I was really pleased to do so.

The conference was hosted by Formulated.by and was held in Miami’s CIC near University of Miami/Jackson Memorial Hospital.  It was a two day conference – I attended only the second day.

Vendors participating and hosting the conference were: Dataiku, Vertica, Plot.ly, and Formulated.ly, O’Reilly, Alteryx, & Domino Data Lab.

Here is the Thursday conference agenda:Data Science Salon, Miami, Machine Learning, N2value, deep learning, data visualization, data science, AI, artificial intelligence

I got through the traffic in Miami just in time to make the tail end of the Meditation exercise.  I’ll be honest – talking about data science gets me excited, so I really wasn’t in the mood to calm down.  Miami traffic also doesn’t make me calm down.  But it was fun, nonetheless.

Brian MacDonald of the Florida Panthers started off with an interesting presentation about how the Florida Panthers, as an organization, solved the problem of how much to charge for their seats in a game, which ended up being a very traditional data science problem, beginning with an exploration of the data, discerning relationships in that data, and then creating predictive models.  It turns out that the demand of seats is related to: day of week, opposing team, home team performance, holidays (some were highly negative, like Valentine’s day) and how late in the season the game is being played.  They utilized a regression model controlling for independent variables, and thereafter were able to predictively model sales, attendance, and even season ticket holder renewals.

Michael Conway from Bidtellect spoke on their self-service predictive analytics platform for online ad bidding – which is using the Vertica service.   Eye opening (for me as a physician) that they are participating in 15,000,000,000 (yes, number is accurate) auctions daily for online ad placement.  Communicated that engagement rates are important, and by measuring post-click consumer activity you can document the value of the ad.

Relationship Mapping by Carnival for social selling
Relationship Mapping by Carnival Data Science Team used in social selling

The data science team of Kevin U and Mark Fridson from Carnival Cruise Lines spoke – this was a really excellent talk, first about digital transformation of a traditional Fortune 500 company, and then some nuts & bolts.  Kevin hammered home the importance of having a data-driven culture, which flows from the highest levels of the organization to spur adoption and deal with “change management” (That exists in healthcare too, by the way).  One reality of being in South Florida was the skills gap – qualified data people are hard to come by.

Mark discussed the importance of multichannel engagement via snail mail, email, and social media as digital channels – sharing insights closely tied to generational cohorts. For each age group, Carnival has an “ideal customer” which they try to match as closely as possible to  Boomers respond best via snail mail (USPS), while Gen X and Millenials use email and social media.  For Generation Z, its all social media, but for different purposes.  Snapchat creates exposure, while Instagram represents captured moments.  Facebook is for acquaintance update and communication, and Twitter is most useful for interest and influencers. I thought that breakdown was particularly useful for those in marketing.

Propensity Modeling by Carnival for Customer Lifetime Value
Propensity Modeling by Carnival Data Science Team for Customer Lifetime Value

They use propensity modeling to calculate CLV (Customer Lifetime Value) and use Bayesian analysis.  Content personalization is performed with: demographics, frequency, booking patterns, after-purchase add-ons, and even an element of serendipty (! – remember that piece on antifragility I did?  These guys get it).  They do use social relationship mapping and have been using some NLP text analysis but feel its hard to use AI NLP in SoMe.

Catalina Arango next spoke, and her talk was non-technical, aimed at beginners and managers desiring to implement data science elements in their enterprises.  I took this opportunity to speak to the Dataiku and Vertica folks as this was a refresher for me.

Cancer Vaccine for Melanoma from Dana Farber using neoantigens

Next up was Alex Rubynstein from Mt. Sinai in NYC – Mt . Sinai is One of the more proactive medical centers in the country regarding analytics and recognizing the value of data. I have seen them advertising for multiple positions to monetize their research.

Tumor Control with personalized vaccine
4/6 recipients of vaccine were disease free 25months after vaccine while 2/6 with recurrent disease were subsequently treated and experience complete cancer regression.

This was an interesting take on personalized medicine and genomics using big data for analysis. Because of cancer’s lethality, more experimentation is possible which have resulted in some novel therapies which approach cure, or at least transforming cancer into a chronic condition.   The cancer vaccine approach treats the patient’s immune system to either enhance the immune response (to overcome immune suppression) or to increase the sensitivity of the immune system to the cancer (to overcome immune escape).  They take the patient’s gene sequence, and the tumor gene sequence, filter the two and target on the order of 5-20 mutations, combining the vaccine with an adjuvant.  They use machine learning on the candidate targets as the number of mutations exceeds the number of targets.  They are continuing to expand on their sample size, which is extremely small, and because of the individualized nature of the therapy, very costly. Nevertheless, early results are promising. The primary limitation is the individualized and handcrafted nature of the vaccine.

Lunch followed – Subway sandwich boxes, which were fine.    Networking at a data science conference can be tough (stereotypes anyone?)  but I managed to find a few good folks to chat with.

A panel followed composed of three speakers – Dr. Irma Fernandez, chief academic officer of St. Thomas University; Colleen Farrelly, Data Scientist at Kaplan; Mauro Damo, chief data scientist at Dell; and Anton Antonov, Consultant at Accendo Data.  There were a broad number of topics discussed.  Main points were the following: Publishing data can be damaging, so be aware of what you are putting out there.   Narrow AI only at this time – no general AI (we know that)!  This was a good, in-the-fields survey of current trends and issues.

Markov Chain Sparse Matrices

Athanassios Kintaskis, Sr. Machine Learning engineer at Capital One had an interesting presentation on MCL (Markov Clustering) Sparse Graphs – this was a good technical talk, some of which went over my head.  As opposed to K-means clustering algorithms which are sensitive, but can’t tell you how many groups are present (you need to choose), this approach simulates random walks in a graph and uses a flow dynamic to create clusters.

Markov Chain Clustering flowchart

Markov Chain transitions can be modeled as a matrix, and that’s about as far as I got before I was interrupted by a phone call.  This was an interesting and meaty talk, and I probably need to read up more on the topic before publically displaying my ignorance.

Anabetsy Rivero of Metastatic AI gave a nice introductory presentation on Convolutional Networks in medical imaging (head over to my other blog: www.ai-imaging.org for more on that or read my prior articles on this). Anabetsy is a machine learner that is focusing on breast cancer diagnostics.

There were a few other presentations but this is a blog, not a manifesto!

All in all, I appreciated what Formulated.by did to bring this type of conference to Miami.  It is a necessary part of growing the Miami Data Science community, and I would love to see more events like Data Science Salon in the future.  A 2nd DataScienceSalon:Miami is slated for November 6-7, 2018.

Data Science Salon Miami, Conference, Formulated.by, N2value, data science, machine learning, miami, south florida, SoFla, SoFLo, SoBe,CIC

 

 

 

 

 

 

 

 

FULL DISCLOSURE: Because of my involvement in the South Florida Data Science and Machine Learning community, I received complimentary entrance.

OODA loop revisited – medical errors, heuristics, and AI.

OODA loop revisited – medical errors, heuristics, and AI.

My OODA loop post is actually one of the most popular on this site.   I  blame Venkatesh Rao of Ribbonfarm and his Tempo book and John Robb’s Brave New War for introducing me to Boyd’s methodology.   Venkatesh focuses on philosophy and management consulting, and Robb focuses on COIN and human social networks. Both are removed from healthcare, but applying Boyd’s principles to medicine: our enemy is disease, perhaps even ourselves.

Consider aerial dogfighting.  The human OODA loop is – Observe, Orient, Decide, Act.   You want to “get inside your opponent’s OODA loop” and out-think them, knowing their actions before they do, assuring victory.  If you know your opponent’s next move, you can anticipate where to shoot and end the conflict decisively.  Quoting Sun Tzu in The Art of War:

Sun Tzu Art of War OODA loops and AI

If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.

Focused, directed, lengthy and perhaps exhausting training for a fighter pilot enables them to “know their enemy” and anticipate action in a high-pressure, high-stakes aerial battle.  The penalty for failure is severe – loss of the pilot’s life.   Physicians prepare similarly – a lengthy and arduous training process in often adverse circumstances.  The penalty for failure is also severe – a patient’s death.  Given adequate intelligence and innate skill, successful pilots and physicians internalize their decision trees – transforming the OODA loop to a simpler OA loop – Observe and Act.  Focused practice allows the Orient and Decide portions of the loop to become automatic and intuitive, almost Zen-like.  This is what some people refer to as ‘Flow’ – an effortlessly hyperproductive state where total focus and immersion in a task suspends the perception of the passage of time.

For a radiologist, ‘flow’ is when you sit down at your PACS at 8am, continuously reading cases, making one great diagnosis after another, smiling as the words appear on Powerscribe. You’re killing the cases and you know it.  Then your stomach rumbles – probably time for lunch – you look up at the clock and it is 4pm.  That’s flow.

Flow is one of the reasons why experienced professionals are highly productive – and a smart manager will try to keep a star employee ‘in the zone’ as much as possible, removing extraneous interruptions, unnecessary low-value tasks, and distractions.

Kahneman defines this as fast type 1 thinking, intuitive and heuristic : quick, easy, and with sufficient experience/training, usually accurate.  But type 1 thinking can fail : a complex process masquerades as a simple one, additional important data is undiscovered or ignored, or a novel agent is introduced.  In these circumstances type 2 critical thinking is needed : slow, methodological, deductive and logical.  But humans err, substituting heuristic thinking for analytical thinking, and we get it wrong.

For the enemy fighter pilot, its the scene in Top Gun where Tom Cruise hits the air brakes to drop behind an attacking Mig to deliver a kill shot with his last missile. For a physician, it is an uncommon or rare disease presenting like a common one, resulting in a missed diagnosis and lawsuit.

To those experimenting in deep learning and Artificial intelligence, the time to train or teach the network far exceeds the time needed to process an unknown through the trained network.  Training can take hours to days, evaluation takes seconds.

Narrow AI’s like Convolutional Neural Networks take advantage of their speed to go through the OODA loop quickly, in a process called inference.  I suggest a deep learning algorithm functions as an OA loop on the specific type of data it has been trained on.  Inference is quick.

I believe that OODA loops are Kahneman’s Type 2 slow thinking.  OA loops are Kahneman’s Type 1 fast thinking.  Narrow AI inference is a type 1 OA loop.   An AI version of type 2 slow thinking doesn’t yet exist.*

And like humans, Narrow AI can be fooled.

Can your classifier tell the difference between a chihuahau and blueberry muffin?

If you haven’t seen the Chihuahua vs. blueberry muffin clickbait picture, consider yourself sheltered. Claims that narrow AI can’t tell the difference are largely, but not entirely, bogus.  While Narrow AI is generally faster than people, and potentially more accurate, it can still make errors. But so can people. In general, classification errors can be reduced by creating a more powerful, or ‘deeper’ network. I think collectively we have yet to decide how much error to tolerate in our AI’s. If we are willing to tolerate an error of 5% in humans, are we willing to tolerate the same in our AI’s, or do we expect 97.5%?  Or 99%? Or 99.9%?

The single pixel attack is a bit more interesting.  While similar images such as the ones above probably won’t pass careful human scrutiny, and frankly adversarial images unrecognizable to humans can be misinterpreted by a classifier:

Convolutional Neural Networks can be fooled by adversarial images

Selecting and perturbing a single pixel is much more subtle, and probably could escape human scrutiny.  Jaiwei Su et al address this in their “One Pixel Attack” paper, where the modification of one pixel in an image had between a 66% to 73% chance of changing the classification of that image.  By changing more than one pixel, success rates respectively rose.  The paper used older, less deep Narrow AI’s like VGG-16 and Network-in-network.  Newer models such as DenseNets and ResNets might be harder to fool.  This type of “attack” represents a real-world situation where the OA loop fails to account for unexpected new (or perturbed) information, and is incorrect.

Contemporaneous update: Google has developed images that use an adversarial attack to uniformly defeat classification attempts by standard CNN models.  By making “stickers” out of these processed images, the presence of such an image, even at less than 20% of the image size, is sufficient to change the classification to what the ensemble dictates, rather than the primary object in an image.  They look like this:

adversarial images capable of overriding CNN classifier
https://arxiv.org/pdf/1712.09665.pdf

 

I am not aware of defined solutions to these problems – the obvious images that fool the classifier can probably be dealt with by ensembling other, more traditional forms of computer vision image analysis such as HOG or SVM’s.  For a one-pixel attack, perhaps widening the network and increasing the number of training samples by either data augmentation or adversarially generated features might make the network more robust.  This probably falls into the “too soon to tell” category.

There has been a great deal of interest and emphasis placed lately on understanding black-box models.  I’ve written about some of these techniques in other posts.  Some investigators feel this is less relevant.  However, by understanding how the models fail, they can be strengthened.  I’ve also written about this, but from a management standpoint.  There is a trade off between accuracy at speed, robustness, and serendipity.  I think the same principle applies to our AI’s as well.  By understanding the frailty of speedy accuracy vs. redundancies that come at the expense of cost, speed, and sometimes accuracy, we can build systems and processes that not only work but are less likely to fail in unexpected & spectacular ways.

Let’s acknowledge the likelihood of failure of narrow AI where it is most likely to fail, and design our healthcare systems and processes around that, as we begin to incorporate AI into our practice and management.  If we do that, we will truly get inside the OODA loop of our opponent – disease – and eradicate it before it even had a chance.  What a world to live in where the only thing disease can say is, “I never saw it coming.”

 

*I believe OODA loops have mathematical analogues. The OODA loop is inherently Bayesian – next actions iteratively decided by prior probabilities. Iterative deep learning constructs include LSTM and RNN’s (Recurrent Neural Networks) and of course, General Adversarial Networks (GANs). There have been attempts to not only use Bayesian learning for hyperparameter optimization but also combining it with RL(Reinforcement Learning) & GANs.  Time will only tell if this brings us closer to the vaunted AGI (Artificial General Intelligence)**.

**While I don’t think we will soon solve the AGI question, I wouldn’t be surprised if complex combinations of these methods, along with ones not yet invented, bring us close to top human expert performance in a Narrow AI. But I also suspect that once we start coding creativity and resilience into these algorithms, we will take a hit in accuracy as we approach less narrow forms of AI.  We will ultimately solve for the best performance of these systems, and while it may even eventually exceed human ability, there will likely always be an error present.  And in that area of error is where future medicine will advance.

© 2018

n2value AI deep learning applied to chest x ray for radiology

CheXNet – a brief evaluation

Chest X-Ray deep dreamed - our AI & deep learning future
Chest Radiograph from ChestX-ray14 dataset processed with the deep dream algorithm trained on ImageNet

1/25/18: NOTE:  Since the November release of the CheXNet paper on ArXiV, there has been a healthy and extensive online discussion on twitter, reddit, and online blogs.  The Stanford paper has undergone at least two revisions with some substantial modifications, most importantly the replacement of ROC curves with F1 scores and a bootstrap calculation of significance.  Some details about the methodology which were not released in the original version have come out, particularly the “re-labeling” of ground truth by Stanford radiologists.  My comment about the thoracic specialist has completely borne out on further release of information. And the problems with ChestXRay14’s labeling (why the Stanford docs re-labeled) are now well-known.

The investigation and discussion of this paper has been spearheaded by Luke Oaken Rayner, who has spent months corresponding with the author and discussing the paper.  For further Information, see below.

The discussion on CheXNet appears to be over, and there has been a great deal of collective learning in it.  The Stanford group should be lauded for their willingness to engage in open peer review and modify their paper substantially after it.  There is no question that a typical 18-24 month process of review and discussion was fast-tracked in the last two months.  Relevant blog links are below after my December addendum.   This will be my last update on this post, as it is “not so brief” any longer!

 

Andrew Ng released CheXNet yesterday on ArXiv (citation) and promoted it with a tweet which caused a bit of a stir on the internet and related radiology social media sites like Aunt Minnie.  Before Radiologists throw away their board certifications and look for jobs as Uber drivers, a few comments on what this does and does not do.

First off, from the Machine Learning perspective, methodologies check out.  It uses a 121 layer DenseNet, which is a powerful convolutional neural network.  While code has not yet been provided, the DenseNet seems similar to code repositories online where 121 layers are a pre-made format.  80/20 split for Training/Validation seems pretty reasonable (from my friend, Kirk Borne), Random initialization, minibatches of 16 w/oversampling positive classes, and a progressively decaying validation loss are utilized, all of which are pretty standard.  Class activation mappings are used to visualize areas in the image most indicative of the activated class (in this case, pneumonia).  This is an interesting technique that can be used to provide some human-interpretable insights into the potentially opaque DenseNet.

The last Fully Connected (FC) layer is replaced by a single output (only one class is being tested for – pneumonia) coupled to a sigmoid function (an activation function – see here) to give a probability between 0 and 1.   Again, pretty standard for a binary classification.  The multiclass portion of the study was performed seperately/later.

The test portion of the study was 420 Chest X-rays read by four radiologists, one of whom was a thoracic specialist.  They could choose between the 14 pathologies in the ChestX-ray14 dataset, read blind without any clinical data.

So, a ROC curve was created, showing three radiologists similar to each other, and one outlier.The radiologists lie slightly under the ROC curve of the CheXNet classifier.  But, a miss is as good as a mile, so the claims of at or above radiologist performance are accurate, because math.  As Luke Oakden Rayner points out, this would probably not pass statistical muster.

So that’s the study.  Now, I will pick some bones with the study.

First, only including one thoracic radiologist is relevant, if you are going to make ground truth agreement of 3 out of four radiologists.  (Addendum: And, for statistical and methodological reasons discussed online, the 3 out of 4 implementation was initially flawed as scored)  General radiologists will be less specific than specialist radiologists, and that is one of the reasons why we have moved to specialty-specific reads over the last 20 years.  If the three general rads disagreed with the thoracic rad, the thoracic rad’s ground truth would be discarded.  Think about this – you would take the word of the generalist over the specialist, despite greater training.  (1/25 Addendum: proven right on this one.  The thoracic radiologist is an outlier with a higher F1 score)  Even Google didn’t do this in their retinal machine learning paper.  Instead, Google used their three retinal specialists as ground truth and then looked at how the non-specialty opthalmologists were able to evaluate that data and what it meant to the training dataset.  (Thanks, Melody!)  Nevertheless, all rads lie reasonably along the same ROC curve, so methodologically it checks out the radiologists are likely of equal ability but different sensitivities/specificities.

Second, the Wang ChestXray14 dataset is a dataset that was data-mined from NIH radiology reports.  This means that for the dataset, ground truth was whatever the radiologists said it was.  I’m not casting aspersions on the NIH radiologists, as I am sure they are pretty good.  I’m simply saying that the dataset’s ground truth is what it says it is, not necessarily what the patient’s clinical condition was.  As proof of that, here are a few cells from the findings field on this dataset.

Findings field from the ChestX-ray14 dataset (representative)

In any case, the NIH radiologists more than a few times perhaps couldn’t tell either, or identified one finding as the cause of the other (Infiltrate & Pneumonia mentioned side by side) and at the top you have the three fields “atelectasis” “consolidation” & “Pneumonia” – is this concurrent pneumonia with consolidation with some atelectasis elsewhere, or is it “atelectasis vs consolidation cannot r/o pneumonia” (as radiologists we say these things). While the text miner purports to use several advanced NLP tools to avoid these kinds of problems, in practice it does not seem to do so. (See addendum below, further addendum, confirmed by Jeremy Howard)  Dr. Ng, if you read this, I have the utmost respect for you and your team, and I have learned from you.  But I would love to know your rebuttal, and I would urge you to publish those results.  Or perhaps someone should do it for reproducibility purposes.

Finally, I’m bringing up these points not to be a killjoy, but to be balanced.  I think it is important to see this and prevent someone from making a really boneheaded decision of firing their radiologists to put in a computer diagnostic system (not in the US, but elsewhere) and realizing it doesn’t work after spending a vast sum of money on it.  Startups competing in the field who do not have deep healthcare experience need to be aware of potential pitfalls in their product.  I’m saying this because real people could be really hurt and impacted if we don’t manage this transition into AI well.  Maybe all parties involved in medical image analysis should join us in taking the Hippocratic Oath, CEO’s and developers included.

Thanks for reading, and feel free to comment here or on twitter or connect on linkedin to me: @drsxr

December Addendum: ChestX-ray14 is based on the ChestX-ray8 database which is described in a paper released on ArXiv by Xiaosong Wang et al. The text mining is based upon a hand-crafted rule-based parser using weak labeling designed to account for “negation & uncertainty”, not merely application of regular expressions. Relationships between multiple labels are expressed, and while labels can stand alone, for the label ‘pneumonia’, the most common associated label is ‘infiltrate’.  A graph showing relationships between the different labels in the dataset is here (from Wang Et Al.)

Label map from the ChestX-ray14 dataset by Wang et. al.

Pneumonia is purple with 2062 cases, and one can see the largest association is with infiltration, then edema and effusion.  A few associations with atelectasis also exist (thinner line).

The dataset methodology claims to account for these issues at up to 90% precision reported in ChestX-ray8, with similar precision inferred in ChestX-ray14.

No Findings (!) from NIH CXR14 dataset
“No Findings”

No Findings (!) from NIH CXR14 Dataset
“No Findings”

However, expert review of the dataset (ChestX-ray14) does not support this.  In fact, there are significant concerns that the labeling of the dataset is a good deal weaker.  I’ll just pick out two examples above that show a patient likely post R lobectomy with attendant findings classified as “No Findings” and the lateral chest X-ray which doesn’t even belong in the study database of all PA and AP films.  These sorts of findings aren’t isolated – Dr. Luke Oakden-Rayner addresses this extensively in this post, from which his own observations are garnered below:

Sampled PPV for ChestX-Ray14 dataset vs reported
Dr. Luke Oakden Rayner’s own Positive Predictive Value on visual inspection of 130 images vs reported

His final judgment is that the ChestX-ray14 dataset is not fit for training medical AI systems to do diagnostic work.  He makes a compelling argument, but I think it is primarily a labelling problem, where the proposed 90% acccuracy on the NLP data mining techniques of Wang et al does not hold up.  ChestX-ray14 is a useful dataset for the images alone, but the labels are suspect.  I would call upon the NIH group to address this and learn from this experience.  In that light, I am surprised that the system did not do a great deal better than the human radiologists involved in Dr. Ng’s group’s study, and I don’t really have a good explanation for it.

The evaluation of CheXNet by these individuals should be recognized:

Luke Oakden-Rayner: CheXNet an in-depth review

Paras Lakhani : Dear Mythical Editor: Radiologist-level Pneumonia Detection in CheXNet

Bailint Botz: A Few thoughts about ChexNet

Copyright © 2017