{"id":13421,"date":"2017-11-17T11:48:08","date_gmt":"2017-11-17T16:48:08","guid":{"rendered":"http:\/\/n2value.com\/blog\/?p=13421"},"modified":"2018-01-25T11:42:35","modified_gmt":"2018-01-25T16:42:35","slug":"chexnet-a-brief-evaluation","status":"publish","type":"post","link":"https:\/\/n2value.com\/blog\/chexnet-a-brief-evaluation\/","title":{"rendered":"CheXNet \u2013 a brief evaluation"},"content":{"rendered":"<figure id=\"attachment_13422\" aria-describedby=\"caption-attachment-13422\" style=\"width: 1024px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-13422\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/dreamcxr.png\" alt=\"Chest X-Ray deep dreamed - our AI &amp; deep learning future\" width=\"1024\" height=\"1024\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/dreamcxr.png 1024w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/dreamcxr-150x150.png 150w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/dreamcxr-300x300.png 300w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/dreamcxr-768x768.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-13422\" class=\"wp-caption-text\">Chest Radiograph from ChestX-ray14 dataset processed with the deep dream algorithm trained on ImageNet<\/figcaption><\/figure>\n<p><em><strong>1\/25\/18: NOTE:\u00a0 Since the November release of the CheXNet paper on ArXiV, there has been a healthy and extensive online discussion on twitter, reddit, and online blogs.\u00a0 The Stanford paper has undergone at least two revisions with some substantial modifications, most importantly the replacement of ROC curves with F1 scores and a bootstrap calculation of significance.\u00a0 Some details about the methodology which were not released in the original version have come out, particularly the &#8220;re-labeling&#8221; of ground truth by Stanford radiologists.\u00a0 My comment about the thoracic specialist has completely borne out on further release of information. And the problems with ChestXRay14&#8217;s labeling (why the Stanford docs re-labeled) are now well-known.<br \/>\n<\/strong><\/em><\/p>\n<p><em><strong>The investigation and discussion of this paper has been spearheaded by Luke Oaken Rayner, who has spent months corresponding with the author and discussing the paper.\u00a0 For further Information, see below.<\/strong><br \/>\n<\/em><\/p>\n<p><em><strong>The discussion on CheXNet appears to be over, and there has been a great deal of collective learning in it.\u00a0 The Stanford group should be lauded for their willingness to engage in open peer review and modify their paper substantially after it.\u00a0 There is no question that a typical 18-24 month process of review and discussion was fast-tracked in the last two months.\u00a0 Relevant blog links are below after my December addendum. \u00a0 This will be my last update on this post, as it is &#8220;not so brief&#8221; any longer! <\/strong><\/em><\/p>\n<p>&nbsp;<\/p>\n<p>Andrew Ng released<a href=\"https:\/\/stanfordmlgroup.github.io\/projects\/chexnet\/\"> CheXNet <\/a>yesterday on <a href=\"https:\/\/arxiv.org\/abs\/1711.05225\" target=\"_blank\" rel=\"noopener\">ArXiv (citation)<\/a> and <a href=\"https:\/\/twitter.com\/AndrewYNg\/status\/931026446717296640\">promoted it with a tweet <\/a>which caused a <a href=\"https:\/\/twitter.com\/AndrewYNg\/status\/930938692310482944\">bit of a stir on the internet <\/a>and <a href=\"https:\/\/www.reddit.com\/r\/Radiology\/comments\/7d8f5k\/chexnet_radiologistlevel_pneumonia_detection_on\/\">related radiology social media sites <\/a>like Aunt Minnie.\u00a0 Before Radiologists throw away their board certifications and look for jobs as Uber drivers, a few comments on what this does and does not do.<\/p>\n<p>First off, from the Machine Learning perspective, methodologies check out.\u00a0 It uses a 121 layer <a href=\"http:\/\/openaccess.thecvf.com\/content_cvpr_2017\/papers\/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf\">DenseNet<\/a>, which is a powerful convolutional neural network.\u00a0 While code has not yet been provided, the DenseNet seems similar to code repositories online where 121 layers are a pre-made format.\u00a0 80\/20 split for Training\/Validation seems pretty reasonable (from my friend, <a href=\"https:\/\/twitter.com\/kirkdborne\">Kirk Borne<\/a>), Random initialization, minibatches of 16 w\/oversampling positive classes, and a progressively decaying validation loss are utilized, all of which are pretty standard.\u00a0 <a href=\"http:\/\/cnnlocalization.csail.mit.edu\/supp.pdf\">Class activation mappings are used to visualize areas in the image most indicative of the activated class <\/a>(in this case, pneumonia).\u00a0 This is an interesting technique that can be used to provide some human-interpretable insights into the potentially opaque DenseNet.<\/p>\n<p>The last Fully Connected (FC) layer is replaced by a single output (only one class is being tested for &#8211; pneumonia) coupled to a sigmoid function (<a href=\"https:\/\/towardsdatascience.com\/activation-functions-and-its-types-which-is-better-a9a5310cc8f\">an activation function &#8211; see here<\/a>) to give a probability between 0 and 1.\u00a0\u00a0 Again, pretty standard for a binary classification.\u00a0 The multiclass portion of the study was performed seperately\/later.<\/p>\n<p>The test portion of the study was 420 Chest X-rays read by four radiologists, one of whom was a thoracic specialist.\u00a0 They could choose between the 14 pathologies in the ChestX-ray14 dataset, read blind without any clinical data.<\/p>\n<p>So, a ROC curve was created, showing three radiologists similar to each other, and one outlier.The radiologists lie <span style=\"text-decoration: underline;\">slightly<\/span> under the ROC curve of the CheXNet classifier.\u00a0 But, a miss is as good as a mile, so the claims of at or above radiologist performance are accurate, because math.\u00a0 As Luke Oakden Rayner points out, this would probably not pass statistical muster.<\/p>\n<p>So that&#8217;s the study.\u00a0 Now, I will pick some bones with the study.<\/p>\n<p>First, only including one thoracic radiologist is relevant, if you are going to make ground truth agreement of 3 out of four radiologists.\u00a0 <em>(Addendum: And, for statistical and methodological reasons discussed online, the 3 out of 4 implementation was initially flawed as scored)<\/em>\u00a0 General radiologists will be less specific than specialist radiologists, and that is one of the reasons why we have moved to specialty-specific reads over the last 20 years.\u00a0 If the three general rads disagreed with the thoracic rad, the thoracic rad&#8217;s ground truth would be discarded.\u00a0 Think about this &#8211; you would take the word of the generalist over the specialist, despite greater training.\u00a0 (1\/25 <strong>Addendum: proven right on this one.\u00a0 The thoracic radiologist is an outlier with a higher F1 score<\/strong>)\u00a0 Even <a href=\"https:\/\/jamanetwork.com\/journals\/jama\/fullarticle\/2588763\">Google didn&#8217;t do this in their retinal machine learning paper<\/a>.\u00a0 Instead, Google used their three retinal specialists as ground truth and then looked at how the non-specialty opthalmologists were able to evaluate that data and what it meant to the training dataset.\u00a0 (Thanks, Melody!)\u00a0 Nevertheless, all rads lie reasonably along the same ROC curve, so <del>methodologically it checks out<\/del>\u00a0<em>the radiologists are likely of equal ability but different sensitivities\/specificities<\/em>.<del> <\/del><\/p>\n<p>Second, the Wang ChestXray14 dataset is a dataset that was data-mined from NIH radiology reports.\u00a0 This means that for the dataset, ground truth was whatever the radiologists said it was.\u00a0 I&#8217;m not casting aspersions on the NIH radiologists, as I am sure they are pretty good.\u00a0 I&#8217;m simply saying that the dataset&#8217;s ground truth is what it says it is, not necessarily what the patient&#8217;s clinical condition was.\u00a0 As proof of that, here are a few cells from the findings field on this dataset.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-13473\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/Pneumonia-vs-infiltrate.png\" alt=\"Findings field from the ChestX-ray14 dataset (representative)\" width=\"378\" height=\"274\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/Pneumonia-vs-infiltrate.png 378w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/Pneumonia-vs-infiltrate-300x217.png 300w\" sizes=\"auto, (max-width: 378px) 100vw, 378px\" \/><\/p>\n<p>In any case, the NIH radiologists more than a few times perhaps couldn&#8217;t tell either, or identified one finding as the cause of the other (Infiltrate &amp; Pneumonia mentioned side by side) and at the top you have the three fields &#8220;atelectasis&#8221; &#8220;consolidation&#8221; &amp; &#8220;Pneumonia&#8221; &#8211; is this concurrent pneumonia with consolidation with some atelectasis elsewhere, or is it &#8220;atelectasis vs consolidation cannot r\/o pneumonia&#8221; (as radiologists we say these things). While the text miner purports to use several advanced NLP tools to avoid these kinds of problems, in practice it does not seem to do so. <strong>(See addendum below, further addendum, confirmed by Jeremy Howard<\/strong>)\u00a0 Dr. Ng, if you read this, I have the utmost respect for you and your team, and I have learned from you.\u00a0 But I would love to know your rebuttal, and I would urge you to publish those results.\u00a0 Or perhaps someone should do it for reproducibility purposes.<\/p>\n<p>Finally, I&#8217;m bringing up these points not to be a killjoy, but to be balanced.\u00a0 I think it is important to see this and prevent someone from making a really boneheaded decision of firing their radiologists to put in a computer diagnostic system (not in the US, but elsewhere) and realizing it doesn&#8217;t work after spending a vast sum of money on it.\u00a0 Startups competing in the field who do not have deep healthcare experience need to be aware of potential pitfalls in their product.\u00a0 I&#8217;m saying this because real people could be really hurt and impacted if we don&#8217;t manage this transition into AI well.\u00a0 Maybe all parties involved in medical image analysis should join us in taking the Hippocratic Oath, CEO&#8217;s and developers included.<\/p>\n<p>Thanks for reading, and feel free to comment here or on twitter or connect on linkedin to me: <a href=\"https:\/\/twitter.com\/drsxr\">@drsxr<\/a><\/p>\n<p><span style=\"text-decoration: underline;\">December Addendum<\/span>: ChestX-ray14 is based on the ChestX-ray8 database which is described in a paper released on <a href=\"https:\/\/arxiv.org\/abs\/1705.02315v4\">ArXiv<\/a> by Xiaosong Wang et al. The text mining is based upon a hand-crafted rule-based parser using <a href=\"https:\/\/hazyresearch.github.io\/snorkel\/blog\/weak_supervision.html\" target=\"_blank\" rel=\"noopener\">weak labeling<\/a> designed to account for &#8220;negation &amp; uncertainty&#8221;, not merely application of regular expressions. Relationships between multiple labels are expressed, and while labels can stand alone, for the label &#8216;pneumonia&#8217;, the most common associated label is &#8216;infiltrate&#8217;.\u00a0 A graph showing relationships between the different labels in the dataset is here (from Wang Et Al.)<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13590\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/WangCXR14figA.png\" alt=\"Label map from the ChestX-ray14 dataset by Wang et. al.\" width=\"535\" height=\"551\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/WangCXR14figA.png 535w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/WangCXR14figA-291x300.png 291w\" sizes=\"auto, (max-width: 535px) 100vw, 535px\" \/><\/p>\n<p>Pneumonia is purple with 2062 cases, and one can see the largest association is with infiltration, then edema and effusion.\u00a0 A few associations with atelectasis also exist (thinner line).<\/p>\n<p>The dataset methodology claims to account for these issues at up to 90% precision reported in ChestX-ray8, with similar precision inferred in ChestX-ray14.<\/p>\n<figure id=\"attachment_13620\" aria-describedby=\"caption-attachment-13620\" style=\"width: 450px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-13620\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00001437_048.png\" alt=\"No Findings (!) from NIH CXR14 dataset\" width=\"450\" height=\"450\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00001437_048.png 1024w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00001437_048-150x150.png 150w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00001437_048-300x300.png 300w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00001437_048-768x768.png 768w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><figcaption id=\"caption-attachment-13620\" class=\"wp-caption-text\">&#8220;No Findings&#8221;<\/figcaption><\/figure>\n<figure id=\"attachment_13621\" aria-describedby=\"caption-attachment-13621\" style=\"width: 432px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-13621\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00005192_001.png\" alt=\"No Findings (!) from NIH CXR14 Dataset\" width=\"432\" height=\"432\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00005192_001.png 1024w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00005192_001-150x150.png 150w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00005192_001-300x300.png 300w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2018\/01\/00005192_001-768x768.png 768w\" sizes=\"auto, (max-width: 432px) 100vw, 432px\" \/><figcaption id=\"caption-attachment-13621\" class=\"wp-caption-text\">&#8220;No Findings&#8221;<\/figcaption><\/figure>\n<p>However, expert review of the dataset (ChestX-ray14) does not support this.\u00a0 In fact, there are significant concerns that the labeling of the dataset is a good deal weaker.\u00a0 I&#8217;ll just pick out two examples above that show a patient likely post R lobectomy with attendant findings classified as &#8220;No Findings&#8221; and the lateral chest X-ray which doesn&#8217;t even belong in the study database of all PA and AP films.\u00a0 These sorts of findings aren&#8217;t isolated &#8211; Dr. Luke Oakden-Rayner <a href=\"https:\/\/lukeoakdenrayner.wordpress.com\/2017\/12\/18\/the-chestxray14-dataset-problems\/\" target=\"_blank\" rel=\"noopener\">addresses this extensively in this post<\/a>, from which his own observations are garnered below:<\/p>\n<figure id=\"attachment_13604\" aria-describedby=\"caption-attachment-13604\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13604\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/12\/cxr14-accuracy.png\" alt=\"Sampled PPV for ChestX-Ray14 dataset vs reported\" width=\"300\" height=\"128\" \/><figcaption id=\"caption-attachment-13604\" class=\"wp-caption-text\">Dr. Luke Oakden Rayner&#8217;s own Positive Predictive Value on visual inspection of 130 images vs reported<\/figcaption><\/figure>\n<p>His final judgment is that the ChestX-ray14 dataset is not fit for training medical AI systems to do diagnostic work.\u00a0 He makes a compelling argument, but I think it is primarily a labelling problem, where the proposed 90% acccuracy on the NLP data mining techniques of Wang et al does not hold up.\u00a0 ChestX-ray14 is a useful dataset for the images alone, but the labels are suspect.\u00a0 I would call upon the NIH group to address this and learn from this experience.\u00a0 In that light, I am surprised that the system did not do a great deal better than the human radiologists involved in Dr. Ng&#8217;s group&#8217;s study, and I don&#8217;t really have a good explanation for it.<\/p>\n<p><strong><em>The evaluation of CheXNet by these individuals should be recognized:<\/em><\/strong><\/p>\n<p>Luke Oakden-Rayner: <a href=\"https:\/\/lukeoakdenrayner.wordpress.com\/2018\/01\/24\/chexnet-an-in-depth-review\/\" target=\"_blank\" rel=\"noopener\">CheXNet an in-depth review<\/a><\/p>\n<p>Paras Lakhani : Dear Mythical Editor: <a href=\"https:\/\/medium.com\/@paras42\/dear-mythical-editor-radiologist-level-pneumonia-in-chexnet-c91041223526\" target=\"_blank\" rel=\"noopener\">Radiologist-level Pneumonia Detection in CheXNet<\/a><\/p>\n<p>Bailint Botz: <a href=\"https:\/\/medium.com\/@BalintBotz\/a-few-thoughts-about-chexnet-and-the-way-human-performance-should-and-should-not-be-measured-68031dca7bf\" target=\"_blank\" rel=\"noopener\">A Few thoughts about ChexNet<\/a><\/p>\n<p>Copyright \u00a9 2017<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1\/25\/18: NOTE:\u00a0 Since the November release of the CheXNet paper on ArXiV, there has been a healthy and extensive online discussion on twitter, reddit, and online blogs.\u00a0 The Stanford paper has undergone at least two revisions with some substantial modifications, most importantly the replacement of ROC curves with F1 scores and a bootstrap calculation of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":13422,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"My thoughts on @andrewng 's CheXNet.  Special N2value.com post. #radiology #deeplearning","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[29,22,4,24],"tags":[20,27,15],"class_list":["post-13421","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-computer-vision","category-data-science","category-radiology","tag-antifragile","tag-machine-learning","tag-patient-care"],"jetpack_publicize_connections":[],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/11\/dreamcxr.png","jetpack_shortlink":"https:\/\/wp.me\/p4mtfP-3ut","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts\/13421","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/comments?post=13421"}],"version-history":[{"count":81,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts\/13421\/revisions"}],"predecessor-version":[{"id":13630,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts\/13421\/revisions\/13630"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/media\/13422"}],"wp:attachment":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/media?parent=13421"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/categories?post=13421"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/tags?post=13421"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}