CheXNet – a brief evaluation

CheXNet – a brief evaluation

Chest X-Ray deep dreamed - our AI & deep learning future
Chest Radiograph from ChestX-ray14 dataset processed with the deep dream algorithm trained on ImageNet

Andrew Ng released CheXNet yesterday on ArXiv (citation) and promoted it with a tweet which caused a bit of a stir on the internet and related radiology social media sites like Aunt Minnie.  Before Radiologists throw away their board certifications and look for jobs as Uber drivers, a few comments on what this does and does not do.

First off, from the Machine Learning perspective, methodologies check out.  It uses a 121 layer DenseNet, which is a powerful convolutional neural network.  While code has not yet been provided, the DenseNet seems similar to code repositories online where 121 layers are a pre-made format.  80/20 split for Training/Validation seems pretty reasonable (from my friend, Kirk Borne), Random initialization, minibatches of 16 w/oversampling positive classes, and a progressively decaying validation loss are utilized.  Class activation mappings are used to visualize areas in the image most indicative of the activated class (in this case, pneumonia).  This is an interesting technique that can be used to provide some human-interpretable insights into the potentially opaque DenseNet.

The last Fully Connected (FC) layer is replaced by a single output (only one class is being tested for – pneumonia) coupled to a sigmoid function (an activation function – see here) to give a probability between 0 and 1.

The test portion of the study was 420 Chest X-rays read by four radiologists, one of whom was a thoracic specialist.  They could choose between the 14 pathologies in the ChestX-ray14 dataset, read blind without any clinical data.

So, a ROC curve was created, showing three radiologists similar to each other, and one outlier.The radiologists lie slightly under the ROC curve of the CheXNet classifier.  But, a miss is as good as a mile, so the claims of at or above radiologist performance are accurate, because math.

So that’s the study.  Now, I will pick some bones with the study.

First, only including one thoracic radiologist is relevant, if you are going to make ground truth agreement of 3 out of four radiologists.  General radiologists will be less specific than specialist radiologists, and that is one of the reasons why we have moved to specialty-specific reads over the last 20 years.  If the three general rads disagreed with the thoracic rad, the thoracic rad’s ground truth would be discarded.  Think about this – you would take the word of the generalist over the specialist, despite greater training.  Even Google didn’t do this in their retinal machine learning paper.  Instead, Google used their three retinal specialists as ground truth and then looked at how the non-specialty opthalmologists were able to evaluate that data and what it meant to the training dataset.  (Thanks, Melody!)

Second, the Wang ChestXray14 dataset is a dataset that I believe was data-mined from NIH radiology reports.  This means that for the dataset, ground truth was whatever the radiologists said it was.  I’m not casting aspersions on the NIH radiologists, as I am sure they are pretty good.  I’m simply saying that the dataset’s ground truth is what it says it is, not necessarily what the patient’s clinical condition was.  As proof of that, here are a few cells from the findings field on this dataset.  I’m not sure if the multiple classes strengthens or weakens Mr. Ng’s argument.

Findings field from the ChestX-ray14 dataset (representative)

In any case, the NIH radiologists more than a few times perhaps couldn’t tell either, or identified one finding as the cause of the other (Infiltrate & Pneumonia mentioned side by side) and at the top you have the three fields “atelectasis” “consolidation” & “Pneumonia” – is this concurrent pneumonia with consolidation with some atelectasis elsewhere, or is it “atelectasis vs consolidation cannot r/o pneumonia” (as radiologists we say these things).  Perhaps I am missing something here and the classifier is making a stronger decision between the pathologies. (See addendum below – on further review of the dataset methodology it claims to account for these issues, at up to 90% precision reported in ChestX-ray8 – see label map below – it is acknowledged that the dataset is weakly labeled) But without the sigmoid activation percentages for each class in the 14, I can’t tell.  Dr. Ng, if you read this, I have the utmost respect for you and your team, and I have learned from you.  But I would love to know your rebuttal, and I would urge you to publish those results.  Or perhaps someone should do it for reproducibility purposes.

Finally, I’m bringing up these points not to be a killjoy, but to be balanced.  I think it is important to see this and prevent an administrator from making a really boneheaded decision of firing their radiologists to put in a computer diagnostic system (not in the US, but elsewhere) and realizing it doesn’t work after spending a vast sum of money on it.  Startups competing in the field who do not have healthcare experience need to be aware of these pitfalls in their product.  I’m saying this because real people could be really hurt and impacted if we don’t manage this transition into AI well.  Maybe all parties involved in medical image analysis should join us in taking the Hippocratic Oath.

Thanks for reading, and feel free to comment here or on twitter or connect on linkedin to me: @drsxr

Addendum: ChestX-ray14 is based on the ChestX-ray8 database which is described in a paper released on ArXiv by Xiaosong Wang et al. The text mining is based upon a hand-crafted rule-based parser designed to account for “negation & uncertainty”, and is not merely application of regular expressions. Relationships between multiple labels are expressed, and while labels can stand alone, for the label ‘pneumonia’, the most common associated label is ‘infiltrate’.  A graph showing relationships between the different labels in the dataset is here (from Wang Et Al.)

Label map from the ChestX-ray14 dataset by Wang et. al.

Pneumonia is purple with 2062 cases, and one can see the largest association is with infiltration, then edema and effusion.  A few associations with atelectasis also exist (thinner line).

Some additional edits and clarifications were made to this piece as it has garnered quite a few hits.

Copyright © 2017

Dear Doctor, (letter to a Doctor)

Physician reviewing records

This is a post from a person I interact with on social media. It has been heavily modified to keep anonymity. I have obtained express consent from this person to share their views here.

Dear Dr. — Thanks for seeing my child today & conducting a comprehensive exam. We were pleased with your care & the recommendations received.

However, please work with your staff on:
1 -Don’t tell me ‘1 hour’ if I ask how long the the appointment will last and then expect me to be happy after more than three.   Yes – I do know I will have to wait  – a range would be helpful.
2 – When called to reconfirm by your staff, I asked if they had all of our reports sent 2 months ago which were printed for you (it’s a little complicated).   Don’t have them tell me ‘yes’ when the answer was ‘NO’.   Putting a ‘see me’ post it note on the file from a staff member who is out of the office is not helpful.
3 – You are excellent in what you do.  I’m happy to pay for your knowledge and expertise but not your data entry skills (see above).
4 – When I explain to your staff that my child is uncomfortable going to physician’s offices and I need to prepare him about what to expect, please don’t giggle.  Is this the first time your staff has been asked this question?  I can’t believe that.

Thank you.

 

Comments:

-A friend once sent a bill to his doctor for making him wait 3 hours.

-I hear you . Waiting forever is the worst! Some health professionals need to brush up on their interpersonal skills.

-(We) were just talking about the medical practitioners we’ve left over the years…because of their staff!!

-…staff was really frustrating.  …tried to give feedback constructively and professionally but the attitude was unreal.

 

Can anyone not relate to this?  (Unless you are a practicing physician or administrator and you are so busy you have no time to go to the doctor!)  I view this as a systems failure.  The processes to make sure that this patient had an excellent experience were not there – the Doctor seems to being doing all he can to make the experience great (except for the ubiquitous data-entry EMR curse that patients hate as much as physicians!), but the staff undermines his efforts and this visit goes squarely into the negative category.  Regardless of where you want to place accountability (the staff, the physician, the office manager, the administrator), the root cause of this negative experience could be looked at and improved.

What the patient (patient’s parent/responsible party) wanted in this circumstance was:

  1. Accurate scheduling (responsible booking, integration with MD’s calendar)
  2. Accurate information (saying “you should block off your afternoon, but we will try to get you out in an hour” would go a long way here)
  3. No data entry (hire a scribe or switch your EMR system!)
  4. Transmissible Review of information by a staff member (no “see me” post-its – that’s poor continuity of care)
  5. To be treated with respect and dignity (NO giggling or attitude).

The last item is the most concerning – I know that we are starting to recognize ‘compassion fatigue’ and ‘burnout’ in docs in increasing numbers, and it almost certainly crosses over to support staff.  But this offending staff needs to be trained/educated, or shown the door.  Someone else’s discomfort is never a cause for a healthcare staffer’s entertainment.  Better to create systems and processes that rein in the chaos and allow these staffers to feel less besieged and give a level of care that supports the hard-working doctor’s efforts, not negates them.