CheXNet – a brief evaluation

CheXNet – a brief evaluation

Chest X-Ray deep dreamed - our AI & deep learning future
Chest Radiograph from ChestX-ray14 dataset processed with the deep dream algorithm trained on ImageNet

1/25/18: NOTE:  Since the November release of the CheXNet paper on ArXiV, there has been a healthy and extensive online discussion on twitter, reddit, and online blogs.  The Stanford paper has undergone at least two revisions with some substantial modifications, most importantly the replacement of ROC curves with F1 scores and a bootstrap calculation of significance.  Some details about the methodology which were not released in the original version have come out, particularly the “re-labeling” of ground truth by Stanford radiologists.  My comment about the thoracic specialist has completely borne out on further release of information. And the problems with ChestXRay14’s labeling (why the Stanford docs re-labeled) are now well-known.

The investigation and discussion of this paper has been spearheaded by Luke Oaken Rayner, who has spent months corresponding with the author and discussing the paper.  For further Information, see below.

The discussion on CheXNet appears to be over, and there has been a great deal of collective learning in it.  The Stanford group should be lauded for their willingness to engage in open peer review and modify their paper substantially after it.  There is no question that a typical 18-24 month process of review and discussion was fast-tracked in the last two months.  Relevant blog links are below after my December addendum.   This will be my last update on this post, as it is “not so brief” any longer!

 

Andrew Ng released CheXNet yesterday on ArXiv (citation) and promoted it with a tweet which caused a bit of a stir on the internet and related radiology social media sites like Aunt Minnie.  Before Radiologists throw away their board certifications and look for jobs as Uber drivers, a few comments on what this does and does not do.

First off, from the Machine Learning perspective, methodologies check out.  It uses a 121 layer DenseNet, which is a powerful convolutional neural network.  While code has not yet been provided, the DenseNet seems similar to code repositories online where 121 layers are a pre-made format.  80/20 split for Training/Validation seems pretty reasonable (from my friend, Kirk Borne), Random initialization, minibatches of 16 w/oversampling positive classes, and a progressively decaying validation loss are utilized, all of which are pretty standard.  Class activation mappings are used to visualize areas in the image most indicative of the activated class (in this case, pneumonia).  This is an interesting technique that can be used to provide some human-interpretable insights into the potentially opaque DenseNet.

The last Fully Connected (FC) layer is replaced by a single output (only one class is being tested for – pneumonia) coupled to a sigmoid function (an activation function – see here) to give a probability between 0 and 1.   Again, pretty standard for a binary classification.  The multiclass portion of the study was performed seperately/later.

The test portion of the study was 420 Chest X-rays read by four radiologists, one of whom was a thoracic specialist.  They could choose between the 14 pathologies in the ChestX-ray14 dataset, read blind without any clinical data.

So, a ROC curve was created, showing three radiologists similar to each other, and one outlier.The radiologists lie slightly under the ROC curve of the CheXNet classifier.  But, a miss is as good as a mile, so the claims of at or above radiologist performance are accurate, because math.  As Luke Oakden Rayner points out, this would probably not pass statistical muster.

So that’s the study.  Now, I will pick some bones with the study.

First, only including one thoracic radiologist is relevant, if you are going to make ground truth agreement of 3 out of four radiologists.  (Addendum: And, for statistical and methodological reasons discussed online, the 3 out of 4 implementation was initially flawed as scored)  General radiologists will be less specific than specialist radiologists, and that is one of the reasons why we have moved to specialty-specific reads over the last 20 years.  If the three general rads disagreed with the thoracic rad, the thoracic rad’s ground truth would be discarded.  Think about this – you would take the word of the generalist over the specialist, despite greater training.  (1/25 Addendum: proven right on this one.  The thoracic radiologist is an outlier with a higher F1 score)  Even Google didn’t do this in their retinal machine learning paper.  Instead, Google used their three retinal specialists as ground truth and then looked at how the non-specialty opthalmologists were able to evaluate that data and what it meant to the training dataset.  (Thanks, Melody!)  Nevertheless, all rads lie reasonably along the same ROC curve, so methodologically it checks out the radiologists are likely of equal ability but different sensitivities/specificities.

Second, the Wang ChestXray14 dataset is a dataset that was data-mined from NIH radiology reports.  This means that for the dataset, ground truth was whatever the radiologists said it was.  I’m not casting aspersions on the NIH radiologists, as I am sure they are pretty good.  I’m simply saying that the dataset’s ground truth is what it says it is, not necessarily what the patient’s clinical condition was.  As proof of that, here are a few cells from the findings field on this dataset.

Findings field from the ChestX-ray14 dataset (representative)

In any case, the NIH radiologists more than a few times perhaps couldn’t tell either, or identified one finding as the cause of the other (Infiltrate & Pneumonia mentioned side by side) and at the top you have the three fields “atelectasis” “consolidation” & “Pneumonia” – is this concurrent pneumonia with consolidation with some atelectasis elsewhere, or is it “atelectasis vs consolidation cannot r/o pneumonia” (as radiologists we say these things). While the text miner purports to use several advanced NLP tools to avoid these kinds of problems, in practice it does not seem to do so. (See addendum below, further addendum, confirmed by Jeremy Howard)  Dr. Ng, if you read this, I have the utmost respect for you and your team, and I have learned from you.  But I would love to know your rebuttal, and I would urge you to publish those results.  Or perhaps someone should do it for reproducibility purposes.

Finally, I’m bringing up these points not to be a killjoy, but to be balanced.  I think it is important to see this and prevent someone from making a really boneheaded decision of firing their radiologists to put in a computer diagnostic system (not in the US, but elsewhere) and realizing it doesn’t work after spending a vast sum of money on it.  Startups competing in the field who do not have deep healthcare experience need to be aware of potential pitfalls in their product.  I’m saying this because real people could be really hurt and impacted if we don’t manage this transition into AI well.  Maybe all parties involved in medical image analysis should join us in taking the Hippocratic Oath, CEO’s and developers included.

Thanks for reading, and feel free to comment here or on twitter or connect on linkedin to me: @drsxr

December Addendum: ChestX-ray14 is based on the ChestX-ray8 database which is described in a paper released on ArXiv by Xiaosong Wang et al. The text mining is based upon a hand-crafted rule-based parser using weak labeling designed to account for “negation & uncertainty”, not merely application of regular expressions. Relationships between multiple labels are expressed, and while labels can stand alone, for the label ‘pneumonia’, the most common associated label is ‘infiltrate’.  A graph showing relationships between the different labels in the dataset is here (from Wang Et Al.)

Label map from the ChestX-ray14 dataset by Wang et. al.

Pneumonia is purple with 2062 cases, and one can see the largest association is with infiltration, then edema and effusion.  A few associations with atelectasis also exist (thinner line).

The dataset methodology claims to account for these issues at up to 90% precision reported in ChestX-ray8, with similar precision inferred in ChestX-ray14.

No Findings (!) from NIH CXR14 dataset
“No Findings”
No Findings (!) from NIH CXR14 Dataset
“No Findings”

However, expert review of the dataset (ChestX-ray14) does not support this.  In fact, there are significant concerns that the labeling of the dataset is a good deal weaker.  I’ll just pick out two examples above that show a patient likely post R lobectomy with attendant findings classified as “No Findings” and the lateral chest X-ray which doesn’t even belong in the study database of all PA and AP films.  These sorts of findings aren’t isolated – Dr. Luke Oakden-Rayner addresses this extensively in this post, from which his own observations are garnered below:

Sampled PPV for ChestX-Ray14 dataset vs reported
Dr. Luke Oakden Rayner’s own Positive Predictive Value on visual inspection of 130 images vs reported

His final judgment is that the ChestX-ray14 dataset is not fit for training medical AI systems to do diagnostic work.  He makes a compelling argument, but I think it is primarily a labelling problem, where the proposed 90% acccuracy on the NLP data mining techniques of Wang et al does not hold up.  ChestX-ray14 is a useful dataset for the images alone, but the labels are suspect.  I would call upon the NIH group to address this and learn from this experience.  In that light, I am surprised that the system did not do a great deal better than the human radiologists involved in Dr. Ng’s group’s study, and I don’t really have a good explanation for it.

The evaluation of CheXNet by these individuals should be recognized:

Luke Oakden-Rayner: CheXNet an in-depth review

Paras Lakhani : Dear Mythical Editor: Radiologist-level Pneumonia Detection in CheXNet

Bailint Botz: A Few thoughts about ChexNet

Copyright © 2017

Black Swans, Antifragility, Six Sigma and Healthcare Operations – What medicine can learn from Wall St Part 7

Black Swans, Antifragility, Six Sigma and Healthcare Operations – What medicine can learn from Wall St Part 7

antifragile

I am an admirer of Nicholas Nassim Taleb – a mercurial options trader who has evolved into a philosopher-mathematician.  The focus of his work is on the effects of randomness, how we sometimes mistake randomness for predictable change, and fail to prepare for randomness by excluding outliers in statistics and decision making.  These “black swans” arise unpredictably and cause great harm, amplified by systems that have put into place which are ‘fragile’.

Perhaps the best example of a black swan event is the period of financial uncertainty we have lived through during the last decade.  A quick recap: the 1998 global financial crisis was caused by a bubble in US real estate assets.  This in turn from legislation mandating lower lending standards and facilitating securitization of these loans combining with lower lending standards (subprime, Alt-A) allowed by the proverbial passing of the ‘hot potato’.  These mortgages were packaged into derivatives named collateralized debt obligations (CDO’s), using statistical models to gauge default risks in these loans.  Loans more likely to default were blended with loans less likely to default, yielding an overall package that was statistically unlikely to default.  However, as owners of these securities found out, the statistical models that made them unlikely to default were based on a small sample period in which there were low defaults.  The models indicated that the financial crisis was a 25-sigma (standard deviations) event that should only happen once in:

Lots of Zeroesyears. (c.f.wolfram alpha)

Of course, the default events happened in the first five years of their existence, proving that calculation woefully inadequate.

The problem with major black swans is that they are sufficiently rare and impactful enough that it is difficult to plan for them.  Global Pandemics, the Fukushima Reactor accident, and the like.  By designing robust systems, expecting system perturbations, you can mitigate their effects when they occur and shake off the more frequent minor black (grey) swans – system perturbations that occur occasionally (but more often than you expect); 5-10 sigma events that are not devastating but disruptive (like local disease outbreaks or power outages).

Taleb classifies how things react to randomness into three categories: Fragile, Robust, and Anti-Fragile.  While the interested would benefit from reading the original work, here is a brief summary:

1.     The Fragile consists of things that hate, or break, from randomness.  Think about tightly controlled processes, just-in-time delivery, tightly scheduled areas like the OR when cases are delayed or extended, etc…
2.     The Robust consists of things that resist randomness and try not to change.  Think about warehousing inventories, overstaffing to mitigate surges in demand, checklists and standard order sets, etc…
3.     The Anti-Fragile consists of things that love randomness and improve with serendipity.  Think about cross-trained floater employees, serendipitous CEO-employee hallway meetings, lunchroom physician-physician interactions where the patient benefits.

In thinking about FragileRobustAnti-Fragile, be cautious about injecting bias into meaning.  After all, we tend to avoid breakable objects, preferring things that are hardy or robust.  So, there is a natural tendency to consider fragility ‘bad’, robustness ‘good’ and anti-fragility must be therefore be ‘great!’  Not true – when we approach these categories from an operational or administrative viewpoint.

Fragile processes and systems are those prone to breaking. They hate variation and randomness and respond well to six-sigma analyses and productivity/quality improvement.  I believe that fragile systems and processes are those that will benefit the most from automation & technology.  Removing human input & interference decreases cycle time and defects.  While the fragile may be prone to breaking, that is not necessarily bad.  Think of the new entrepreneur’s mantra – ‘fail fast’.  Agile/SCRUM development, most common in software (but perhaps useful in Healthcare?) relies on rapid iteration to adapt to a moving target.scrum.jpg   Fragile systems and processes cannot be avoided – instead they should be highly optimized with the least human involvement.  These need careful monitoring (daily? hourly?) to detect failure, at which point a ready team can swoop in, fix whatever has caused the breakage, re-optimize if necessary, and restore the system to functionality.  If a fragile process breaks too frequently and causes significant resultant disruption, it probably should be made into a Robust one.

Robust systems and processes are those that resist failure due to redundancy and relative waste.  These probably are your ‘mission critical’ ones where some variation in the input is expected, but there is a need to produce a standardized output.  From time to time your ER is overcome by more patients than available beds, so you create a second holding area for less-acute cases or patients who are waiting transfers/tests.  This keeps your ER from shutting down.  While it can be wasteful to run this area when the ER is at half-capacity, the waste is tolerable vs. the lost revenue and reputation of patients leaving your ER for your competitor’s ER or the litigation cost of a patient expiring in the ER after waiting 8 hours.    The redundant patient histories of physicians, nurses & medical students serve a similar purpose – increasing diagnostic accuracy.  Only when additional critical information is volunteered to one but not the other is it a useful practice.  Attempting to tightly manage robust processes may either be a waste of time, or turn a robust process into a fragile one by depriving it of sufficient resilience – essentially creating a bottleneck.  I suspect that robust processes can be optimized to the first or second sigma – but no more.

Anti-fragile processes and systems benefit from randomness, serendipity, and variability.  I believe that many of these are human-centric.  The automated process that breaks is fragile, but the team that swoops in to repair it – they’re anti-fragile.  The CEO wandering the halls to speak to his or her front-line employees four or five levels down the organizational tree for information – anti-fragile.  Clinicians that practice ‘high-touch’ medicine result in good feelings towards the hospital and the unexpected high-upside multi-million dollar bequest of a grateful donor 20 years later – that’s very anti-fragile.  It is important to consider that while anti-fragile elements can exist at any level, I suspect that more of them are present at higher-level executive and professional roles in the healthcare delivery environment.  It should be considered that automating or tightly managing anti-fragile systems and processes will likely make them LESS productive and efficient.  Would the bequest have happened if that physician was tasked and bonused to spend only 5.5 minutes per patient encounter?  Six sigma management here will cause the opposite of the desired results.

I think a lot more can be written on this subject, particularly from an operational standpoint.   Systems and processes in healthcare can be labeled fragile, robust, or anti-fragile as defined above.  Fragile components should have human input reduced to the bare minimum possible, then optimize the heck out of these systems.  Expect them to break – but that’s OK – have a plan & team ready for dealing with it, fix it fast, and re-optimize until the next failure.  Robust systems should undergo some optimization, and have some resilience or redundancy also built in – and then left the heck alone!  Anti-fragile systems should focus on people and great caution should be used in not only optimization, but the metrics used to manage these systems – lest you take an anti-fragile process, force it into a fragile paradigm, and cause failure of that system and process.  It is the medical equivalent of forcing a square peg into a round hole.  I suspect that when an anti-fragile process fails, this is why.