Aug. 29, 2018, 11:52 a.m.
When doing binary image segmentation, segmenting images into foreground and background, cross entropy is far from ideal as a loss function. As these datasets tend to be highly unbalanced, with far more background pixels than foreground, the model will usually score best by predicting everything as background. I have confronted this issue during my work with mammography and my solution was to use a weighted sigmoid cross entropy loss function giving the foreground pixels higher weight than the background.
While this worked it was far from ideal, for one thing it introduced another hyperparameters - the weight - and altering the weight had a large impact on the model. Higher weights favored predicting pixels as positive, increasing recall and decreasing precision, and lowering the weight had the opposite effect. When training my models I usually began with a high weight to encourage the model to make positive predictions and gradually decayed the weight to encourage it to make negative predictions.
For these types of segmentation tasks Intersection over Union tends to be the most relevant metric as pixel level accuracy, precision and recall do not account for the overlap between predictions and ground truth. Especially for this task, where overlap can be the difference between life and death for the patient, accuracy is not as relevant as IOU. So why not use IOU as a loss function?
The reason was because IOU was not differentiable so can not be used for gradient descent. However Wang et al have written a paper - Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation - which provides an easy way to use IOU as a loss function. In addition, this site provides code to implement this loss function in TensorFlow.
The essence of this method is that rather than using the binary predictions to calculate IOU we use the sigmoid probability output by the logits to estimate it which allows IOU to provide gradients. At first I was skeptical of this method, mostly because I understood cross entropy better and it is more common, but after I hit a performance wall with my mammography models I decided to give it a try.
My models using cross-entropy loss had ceased to improve validation performance so I switched the loss function and trained them for a few more epochs. The validation metrics began to improve, so I decided to train a copy of the model from scratch with the IOU loss. This has been a resounding success. The IOU loss accounts for the imbalanced data, eliminating the need to weight the cross entropy. With the cross entropy loss the models usually began with recall of near 1 and precision of near 0 and then the precision would increase while the recall slowly decreased until it plateaued. With IOU loss they both start near 0 and gradually increase, which to me seems more natural.
Training with an IOU loss has two concrete benefits for this task - it has allowed the model to detect more subtle abnormalities which models trained with cross entropy loss did not detect; and it has reduced the number of false positives significantly. As the false positives are on a pixel level this effectively means that the predictions are less noisy and the shapes are more accurate.
The biggest benefit is that we are directly optimizing for our target metric rather than attempting to use an imperfect substitute which we hope will approximate the target metric. Note that this method only works for binary segmentation at the moment. It also is a bit slower than using cross entropy, but if you are doing binary segmentation the performance boost is well worth it.
July 17, 2018, 7:34 a.m.
When I began working on this project my intention was to do multi-class classification of the images. To this end I built my graph with logits and a cross-entropy loss function. I soon realized that the decision to do multi-class classification was quite ambitious, and scaled back to doing binary classification into positive and negative. My goal was to implement the multi-class approach once I had the binary approach working reasonably well, so I left the cross entropy in place.
Over the months I have been working on this I have realized that, for many reasons, the multi-class classification was a bad idea. For an academic project it might have made sense, but for any sort of real world use case it made none. There is really no use to outputting a simple classification for something as important as detecting cancer. A much more useful output is the probabilities that each area of the image contains an abnormality as this could aid a radiologist in diagnosing abnormalities rather than completely replacing her. Yet for some reason I never bothered to change the output or the loss function.
The limiting factor on the size of the model has been the GPU memory of the Google Cloud instance I am training this monstrosity on. So I've been trying to optimize the model to run within the RAM constraints and train in a reasonable amount of time. Mostly this has involved trying to keep the number of parameters to a minimum, but today I was looking at the model and realized that the logits were definitely not helping the situation.
For this problem classification was absolutely the wrong approach. We aren't trying to classify the content of the image, we are trying to detect abnormalities. The negative class was not really a separate class, but the absence of any abnormalities, and the graph and the loss function should reflect this. In order to coerce the logits into an output that reflected the reality just described, I put the logits through a softmax and then discarded the negative probability - as I said the negative class doesn't really exist. However the cross entropy function does not know this and it places equal importance on the imaginary negative class as on the positive class (subject to the cross entropy weighting of course.) This means that the gradients placed equal weight on trying to find imaginary "normal" patterns, despite the fact that this information is discarded and never used.
So I reduced the logits layer to one unit, replaced the softmax activation with a sigmoid activation, and replaced the cross entropy with binary cross entropy. And the change has been more impactful than I imagined it would be. The model immediately began performing better than the same model with the logits/cross entropy structure. It seems obvious that this would be the case as now the model can focus on detecting abnormalities rather than wasting half of it's efforts on trying to detect normal patterns.
I am not sure why I waited so long to make this change and my best guess is that I was seduced by the undeniable elegance of the cross entropy loss function. For multi-class classification it is truly a thing of beauty, and I may have been blinded by that into attempting to use it in a situation it was not designed for.
July 5, 2018, 10:37 a.m.
I wrote about this paper before, but I am going to again because it has been so enormously useful to me. I am still working on segmentation of mammograms to highlight abnormalities and I recently decided to scrap the approach I had been taking to upsampling the image and start that part from scratch.
When I started I had been using the earliest approach to upsampling, which basically was take my classifier, remove the last fully-connected layer and upsample that back to full resolution with transpose convolutions. This worked well enough, but the network had to upsample images from 2x2x1024 to 640x640x2 and in order to do this I needed to add skip connections from the downsizing section to the upsampling section. This caused problems because the network would add features of the input image to the output, regardless of whether the features were relevant to the label. I tried to get around this by adding bottleneck layers before the skip connection in order to only select the pertinent features, but this greatly slowed down training and didn't help much and the output ended up with a lot of weird artifacts.
In "Deconvolution and Checkerboard Artifacts", Odena et al. have demonstrated that replacing transpose convolutions with nearest neighbors resizing produces smoother images than using transpose convolutions. I tried replacing a few of my tranpose convolutions with resizes and the results improved.
Then I started reading about dilated convolutions and I started wondering why I was downsizing my input from 640x640 to 5x5 just to have to resize it back up. I removed all the fully-connected layers (which in fact were 1x1 convolutions rather than fully-connected layers) and then replaced the last max pool with a dilated convolution.
I replaced all of the transpose convolutions with resizes, except for the last two layers, as suggested by Odena et al, and the final tranpose convolution has a stride of 1 in order to smooth out artifacts.
In the downsizing section, the current model reduces the input from 640x640x1 to 20x20x512, then it is upsampled by using nearest neighbors resizing followed by plain convolutions to 320x320x32. Finally there is a tranpose convolution with a stride of 2 followed by a transpose convolution with a stride of 1 and then a softmax for the output. As an added bonus, this version of the model trains significantly faster than upsampling with transpose convolutions.
I just started training this model, but I am fairly confident it will perform better than previous upsampling schemes as when I extracted the last downsizing convolutional layer from the model that layer appeared closer to the label (although much smaller) than the final output did. I will update when I have actual results.
Update - After training the model for just one epoch, with the downsizing layer weights initialized from a previous model, the results are already significantly better than under the previous scheme.
June 6, 2018, 3:07 p.m.
Even though I only have 1/5 of the images uploaded so far, I decided to do some tests to see if this method would work. It does, but it took quite a bit of tweaking to get performance to reasonable levels.
At first I just plugged the new dataset into the old graph, and this worked but was incredibly slow with the GPU sitting idle most of the time. I tried quite a few different methods to speed the pre-processing bottleneck up, but the solution was simpler than I had thought it would be.
The biggest factor was increasing number of threads in the tf.train.batch from the default of 1. This one change made a huge difference, cutting the training time down to about 1/4 of what it had been.
I also experimented with moving some pre-processing operations around, including resizing the images individually when loading them and after being batched. This had negligible effects, but resizing them individually was slightly faster than doing it as a batch. In general I found that the more pre-processing operations I moved to the queue (and the CPU) the better the performance.
This version still trains at about 1/2 the speed the tfrecords version did, which is a big difference, but the size of the training set has increased by orders of magnitude so I guess I can live with it.
The code is available on my GitHub.
June 6, 2018, 1:32 p.m.
I am continuing to work with the CBIS-DDSM datasets and recently decided to take a new direction with the training data. Previously I had been locally segmenting the raw scans into images of varying sizes and writing those images to tfrecords to use as training data. I started by classifying the images by pathology with categorical labels, and while I got decent results using this approach, the models performed terribly on images from different datasets and on full-size images. I suspected the model was using features of the images that were not related to the actual ROIs to make its predictions, such as the amount of contrast or presence of extremely high pixel values.
To address this I started using the masks as labels and training the model to do segmentation of the images into normal and ROI. This had the added advantage of allowing me to exclude images from the DDSM dataset and only use CBIS-DDSM images which eliminated the features I believed the previous models had been relying on, as the DDSM and CBIS-DDSM datasets had substantially different variances, mins, maxes and means. The disadvantage of this approach was that the dataset was double the size due to the fact that the labels are now the same size as the images.
I started with a dataset of 320x320 images, however models trained on this dataset often had trouble with images which had bright patches running of the edge of the image and images with high contrast, misclassifying the bright patches as positive. To attempt to address this I started training the model on 320x320 images, and then switched to another dataset of 640x640 images after training through 50 or so epochs.
The dataset of 640x640 images only had 13,000 training examples in it, about 1/3 the number of examples in the 320x320 dataset, but was still larger due to the fact that each example and label is four times the size of the 320x320 images. I considered making another dataset with either more or larger images, but saw that this process could continue indefinitely as I had to keep creating new datasets of larger and larger size.
Instead I decided to create one new dataset which could be used indefinitely, for all purposes. To do this I loaded each image in the CBIS-DDSM dataset into Python. While the JPEGS are RGB, the images are grayscale so I only kept one channel of each image. I Some images have multiple masks, and rather than have multiple versions of each image with different masks, which could confuse the model, I combined all masks for each image into one mask, and then added that as the second channel of each image. In order to be able to save the array as an image I added a third channel of all 0s. Each new images was then saved as a PNG.
The resulting dataset is about 12GB, about four times the size of the largest tfrecords dataset, but the entirety of the CBIS-DDSM dataset (minus a few images which had masks of incorrect sizes and were discarded) is now represented. Now, in my model, I load each full image and then take a random crop of it and use that as training data. Since the mask is part of the image I can use TensorFlow's random crop function to crop the full image, and then separate the channels into the training example and it's label.
This not only increases the size of the training data set exponentially, but since my model is fully convolutional, I can also easily change the crop size without having to create a new dataset.
The major problem with this approach is that the mean of the labels is very low - around 0.015 - meaning that only 1% of the pixels have a positive label and the rest are negative. The previous dataset had a mean of 0.05. This will be addressed by raising the cross entropy weight from 20 to 75 so that the model doesn't just predict everything as negative. When creating the images I had trimmed as much background as possible from them to avoid having a large amount of training images of pure black, but still the random cropping produces a large number of images with little to no actual content.
At the moment I am uploading the data to S3 which should take another couple days. Once this is done I will attempt to train on this new dataset and see if the empty images cause major problems.
May 28, 2018, 9:38 a.m.
My original work with the DDSM and CBIS-DDSM dataset yielded good accuracy and recall on the test and validation data, but the model didn't perform so well when applied to the MIAS images, which came from a completely different dataset. Additional analysis of the images indicated that the negative images (from the DDSM) and the positive images (from the CBIS-DDSM) were different in some subtle but important ways:
We had become concerned about point 2 when we discovered that increasing the contrast of any image made it more likely to be predicted as positive and discovered point 1 while investigating this further. When applying our fully convolutional model trained on the combined data to complete scans, rather than the 299x299 images we had trained on, we noticed that more than 50% of the sections of a positive image were predicted as positive, even if the ROI was, in fact, only present in one section. This indicates that the model was using some feature of the images other than the ROI in its prediction.
When starting this project, we had initially planned to segment the CBIS-DDSM images and use images which did not contain an ROI as negative images, but we were not certain that there were not differences in the tissue of positive and negative scans which might make this approach not generalize to completely negative scans. When we realized that the scans had been pre-processed differently we attempted to adjust the negative images in such a way as to make them more similar to the positive images but were unable to do so without knowledge of how they had been processed.
Our solution to both of these issues was to train the model to do the segmentation of the scans rather than simple classification, using the masks as the labels. This approach had several advantages:
We recreated the model to do semantic segmentation by removing the last "fully connected" layer (which were implemented as a 1x1 convolution) and the logits layer and upsampling the results with transpose convolutions. In order for the upsampling to work properly we needed to have the size of the images be a multiple of 2 so that the dimension reduction could be properly undone, so we used images of size 320x320.
We were able to get fairly good results training on this data with a pixel level accuracy of about 90% and a pixel recall of 70%. The image level accuracy and recall were 70% and 87%, respectively. While these results were respectable, we noticed certain patterns of incorrect predictions. Images which were mostly dark, with patches that were much brighter, tended to have the bright patches predicted positive regardless of the actual label. This pattern was mostly observed when the bright patch ran off the edge of the image.
We know that the context of an ROI is important in detecting and diagnosing it, and we suspected that in the absence of context the model was predicting any patch substantially brighter than it's surroundings to be positive. While for cancer detection, it is better to make a false positive than a false negative we thought that this pattern might become problematic when applying the model to images larger than those it was trained on. To address this issue we decided to create a dataset of larger images and continue training our model on those.
We created a dataset of 640x640 images and adjusted our existing model to take those as input. As the model is fully convolutional we can restore the model trained on 320x320 images and continue training it on the larger images with no problems, which we are currently in the process of doing. If the results of this are promising we may create another dataset of even larger images are fine-tune this model on those images until we have a model which takes complete images as input.
May 23, 2018, 10:02 a.m.
For a course I was taking at EPFL I was working on classifying images from the DDSM dataset with ConvNets. I had some success, although not as much as I would have liked, and I posted an edited version of my report on Medium.
Although the course is over I am still working on this project, attempting to fix some of the issues that came up during the first stage.