Andrew Ng released CheXNet yesterday on ArXiv (citation) and promoted it with a tweet which caused a bit of a stir on the internet and related radiology social media sites like Aunt Minnie. Before Radiologists throw away their board certifications and look for jobs as Uber drivers, a few comments on what this does and does not do.
First off, from the Machine Learning perspective, methodologies check out. It uses a 121 layer DenseNet, which is a powerful convolutional neural network. While code has not yet been provided, the DenseNet seems similar to code repositories online where 121 layers are a pre-made format. 80/20 split for Training/Validation seems pretty reasonable (from my friend, Kirk Borne), Random initialization, minibatches of 16 w/oversampling positive classes, and a progressively decaying validation loss are utilized. Class activation mappings are used to visualize areas in the image most indicative of the activated class (in this case, pneumonia). This is an interesting technique that can be used to provide some human-interpretable insights into the potentially opaque DenseNet.
The last Fully Connected (FC) layer is replaced by a single output (only one class is being tested for – pneumonia) coupled to a sigmoid function (an activation function – see here) to give a probability between 0 and 1.
The test portion of the study was 420 Chest X-rays read by four radiologists, one of whom was a thoracic specialist. They could choose between the 14 pathologies in the ChestX-ray14 dataset, read blind without any clinical data.
So, a ROC curve was created, showing three radiologists similar to each other, and one outlier.The radiologists lie slightly under the ROC curve of the CheXNet classifier. But, a miss is as good as a mile, so the claims of at or above radiologist performance are accurate, because math.
So that’s the study. Now, I will pick some bones with the study.
First, only including one thoracic radiologist is relevant, if you are going to make ground truth agreement of 3 out of four radiologists. General radiologists will be less specific than specialist radiologists, and that is one of the reasons why we have moved to specialty-specific reads over the last 20 years. If the three general rads disagreed with the thoracic rad, the thoracic rad’s ground truth would be discarded. Think about this – you would take the word of the generalist over the specialist, despite greater training. Even Google didn’t do this in their retinal machine learning paper. Instead, Google used their three retinal specialists as ground truth and then looked at how the non-specialty opthalmologists were able to evaluate that data and what it meant to the training dataset. (Thanks, Melody!)
Second, the Wang ChestXray14 dataset is a dataset that I believe was data-mined from NIH radiology reports. This means that for the dataset, ground truth was whatever the radiologists said it was. I’m not casting aspersions on the NIH radiologists, as I am sure they are pretty good. I’m simply saying that the dataset’s ground truth is what it says it is, not necessarily what the patient’s clinical condition was. As proof of that, here are a few cells from the findings field on this dataset. I’m not sure if the multiple classes strengthens or weakens Mr. Ng’s argument.
In any case, the NIH radiologists more than a few times perhaps couldn’t tell either, or identified one finding as the cause of the other (Infiltrate & Pneumonia mentioned side by side) and at the top you have the three fields “atelectasis” “consolidation” & “Pneumonia” – is this concurrent pneumonia with consolidation with some atelectasis elsewhere, or is it “atelectasis vs consolidation cannot r/o pneumonia” (as radiologists we say these things). Perhaps I am missing something here and the classifier is making a stronger decision between the pathologies. (See addendum below – on further review of the dataset methodology it claims to account for these issues, at up to 90% precision reported in ChestX-ray8 – see label map below – it is acknowledged that the dataset is weakly labeled) But without the sigmoid activation percentages for each class in the 14, I can’t tell. Dr. Ng, if you read this, I have the utmost respect for you and your team, and I have learned from you. But I would love to know your rebuttal, and I would urge you to publish those results. Or perhaps someone should do it for reproducibility purposes.
Finally, I’m bringing up these points not to be a killjoy, but to be balanced. I think it is important to see this and prevent an administrator from making a really boneheaded decision of firing their radiologists to put in a computer diagnostic system (not in the US, but elsewhere) and realizing it doesn’t work after spending a vast sum of money on it. Startups competing in the field who do not have healthcare experience need to be aware of these pitfalls in their product. I’m saying this because real people could be really hurt and impacted if we don’t manage this transition into AI well. Maybe all parties involved in medical image analysis should join us in taking the Hippocratic Oath.
Thanks for reading, and feel free to comment here or on twitter or connect on linkedin to me: @drsxr
Addendum: ChestX-ray14 is based on the ChestX-ray8 database which is described in a paper released on ArXiv by Xiaosong Wang et al. The text mining is based upon a hand-crafted rule-based parser designed to account for “negation & uncertainty”, and is not merely application of regular expressions. Relationships between multiple labels are expressed, and while labels can stand alone, for the label ‘pneumonia’, the most common associated label is ‘infiltrate’. A graph showing relationships between the different labels in the dataset is here (from Wang Et Al.)
Pneumonia is purple with 2062 cases, and one can see the largest association is with infiltration, then edema and effusion. A few associations with atelectasis also exist (thinner line).
Some additional edits and clarifications were made to this piece as it has garnered quite a few hits.
Copyright © 2017