Andrew Ng released CheXNet yesterday on ArXiv (citation) and promoted it with a tweet which caused a bit of a stir on the internet and related radiology social media sites like Aunt Minnie. Before Radiologists throw away their board certifications and look for jobs as Uber drivers, a few comments on what this does and does not do.
First off, from the Machine Learning perspective, methodologies check out. It uses a 121 layer DenseNet, which is a powerful convolutional neural network. While code has not yet been provided, the DenseNet seems similar to code repositories online where 121 layers are a pre-made format. 80/20 split for Training/Validation seems pretty reasonable (from my friend, Kirk Borne), Random initialization, minibatches of 16 w/oversampling positive classes, and a progressively decaying validation loss are utilized. Class activation mappings are used to visualize areas in the image most indicative of the activated class (in this case, pneumonia). This is an interesting technique that can be used to provide some human-interpretable insights into the potentially opaque DenseNet.
The last Fully Connected (FC) layer is replaced by a single output (only one class is being tested for – pneumonia) coupled to a sigmoid function (an activation function – see here) to give a probability between 0 and 1.
The test portion of the study was 420 Chest X-rays read by four radiologists, one of whom was a thoracic specialist. They could choose between the 14 pathologies in the ChestX-ray14 dataset, read blind without any clinical data.
So, a ROC curve was created, showing three radiologists similar to each other, and one outlier.The radiologists lie slightly under the ROC curve of the CheXNet classifier. But, a miss is as good as a mile, so the claims of at or above radiologist performance are accurate, because math. Addendum – Even though this claim would likely not meet statistical significance. Thanks, Luke.
So that’s the study. Now, I will pick some bones with the study.
First, only including one thoracic radiologist is relevant, if you are going to make ground truth agreement of 3 out of four radiologists. General radiologists will be less specific than specialist radiologists, and that is one of the reasons why we have moved to specialty-specific reads over the last 20 years. If the three general rads disagreed with the thoracic rad, the thoracic rad’s ground truth would be discarded. Think about this – you would take the word of the generalist over the specialist, despite greater training. Google didn’t do this in their retinal machine learning paper. Instead, Google used their three retinal specialists as ground truth and then looked at how the non-specialty opthalmologists were able to evaluate that data and what it meant to the training dataset. (Thanks, Melody!)
Second, the Wang ChestXray14 dataset is a dataset that I believe was data-mined from NIH radiology reports. This means that for the dataset, ground truth was whatever the radiologists said it was. I’m not casting aspersions on the NIH radiologists, as I am sure they are pretty good. I’m simply saying that the dataset’s ground truth is what it says it is, not necessarily what the patient’s clinical condition was. As proof of that, here are a few cells from the findings field on this dataset. I’m not sure if the multiple classes strengthens or weakens Mr. Ng’s argument.
In any case, the NIH radiologists more than a few times perhaps couldn’t tell either, or identified one finding as the cause of the other (Infiltrate & Pneumonia mentioned side by side) and at the top you have an “atelectasis vs consolidation vs Pneumonia” (as radiologists we say these things). Perhaps I am missing something here and the classifier is making a stronger decision between the pathologies. But without the sigmoid activation percentages for each class in the 14, I can’t tell. Andrew, if you read this, I have the utmost respect for you and your team, and I have learned from you. But I would love to know your rebuttal, and I would urge you to publish those results. Or perhaps someone should do it for reproducibility.
Addendum: Other radiologists with machine learning chops whom I respect are also concerned about how the ground truth was decided in the ChestXRay14 dataset. This needs to be further looked into.
Finally, I’m bringing up these points not to be a killjoy, but to be balanced. I think it is important to see this and prevent an administrator from making a really boneheaded decision of firing their radiologists to put in a computer diagnostic system (not in the US, but elsewhere) and realizing it doesn’t work after spending a vast sum of money on it. Startups competing in the field who do not have healthcare experience need to be aware of these pitfalls in their product. I’m saying this because real people could be really hurt and impacted if we don’t manage this transition into AI well.
Thanks for reading, and feel free to comment here or on twitter or linkedin to me: @drsxr