July 5, 2018, 10:37 a.m.
I wrote about this paper before, but I am going to again because it has been so enormously useful to me. I am still working on segmentation of mammograms to highlight abnormalities and I recently decided to scrap the approach I had been taking to upsampling the image and start that part from scratch.
When I started I had been using the earliest approach to upsampling, which basically was take my classifier, remove the last fully-connected layer and upsample that back to full resolution with transpose convolutions. This worked well enough, but the network had to upsample images from 2x2x1024 to 640x640x2 and in order to do this I needed to add skip connections from the downsizing section to the upsampling section. This caused problems because the network would add features of the input image to the output, regardless of whether the features were relevant to the label. I tried to get around this by adding bottleneck layers before the skip connection in order to only select the pertinent features, but this greatly slowed down training and didn't help much and the output ended up with a lot of weird artifacts.
In "Deconvolution and Checkerboard Artifacts", Odena et al. have demonstrated that replacing transpose convolutions with nearest neighbors resizing produces smoother images than using transpose convolutions. I tried replacing a few of my tranpose convolutions with resizes and the results improved.
Then I started reading about dilated convolutions and I started wondering why I was downsizing my input from 640x640 to 5x5 just to have to resize it back up. I removed all the fully-connected layers (which in fact were 1x1 convolutions rather than fully-connected layers) and then replaced the last max pool with a dilated convolution.
I replaced all of the transpose convolutions with resizes, except for the last two layers, as suggested by Odena et al, and the final tranpose convolution has a stride of 1 in order to smooth out artifacts.
In the downsizing section, the current model reduces the input from 640x640x1 to 20x20x512, then it is upsampled by using nearest neighbors resizing followed by plain convolutions to 320x320x32. Finally there is a tranpose convolution with a stride of 2 followed by a transpose convolution with a stride of 1 and then a softmax for the output. As an added bonus, this version of the model trains significantly faster than upsampling with transpose convolutions.
I just started training this model, but I am fairly confident it will perform better than previous upsampling schemes as when I extracted the last downsizing convolutional layer from the model that layer appeared closer to the label (although much smaller) than the final output did. I will update when I have actual results.
Update - After training the model for just one epoch, with the downsizing layer weights initialized from a previous model, the results are already significantly better than under the previous scheme.
May 28, 2018, 9:38 a.m.
My original work with the DDSM and CBIS-DDSM dataset yielded good accuracy and recall on the test and validation data, but the model didn't perform so well when applied to the MIAS images, which came from a completely different dataset. Additional analysis of the images indicated that the negative images (from the DDSM) and the positive images (from the CBIS-DDSM) were different in some subtle but important ways:
We had become concerned about point 2 when we discovered that increasing the contrast of any image made it more likely to be predicted as positive and discovered point 1 while investigating this further. When applying our fully convolutional model trained on the combined data to complete scans, rather than the 299x299 images we had trained on, we noticed that more than 50% of the sections of a positive image were predicted as positive, even if the ROI was, in fact, only present in one section. This indicates that the model was using some feature of the images other than the ROI in its prediction.
When starting this project, we had initially planned to segment the CBIS-DDSM images and use images which did not contain an ROI as negative images, but we were not certain that there were not differences in the tissue of positive and negative scans which might make this approach not generalize to completely negative scans. When we realized that the scans had been pre-processed differently we attempted to adjust the negative images in such a way as to make them more similar to the positive images but were unable to do so without knowledge of how they had been processed.
Our solution to both of these issues was to train the model to do the segmentation of the scans rather than simple classification, using the masks as the labels. This approach had several advantages:
We recreated the model to do semantic segmentation by removing the last "fully connected" layer (which were implemented as a 1x1 convolution) and the logits layer and upsampling the results with transpose convolutions. In order for the upsampling to work properly we needed to have the size of the images be a multiple of 2 so that the dimension reduction could be properly undone, so we used images of size 320x320.
We were able to get fairly good results training on this data with a pixel level accuracy of about 90% and a pixel recall of 70%. The image level accuracy and recall were 70% and 87%, respectively. While these results were respectable, we noticed certain patterns of incorrect predictions. Images which were mostly dark, with patches that were much brighter, tended to have the bright patches predicted positive regardless of the actual label. This pattern was mostly observed when the bright patch ran off the edge of the image.
We know that the context of an ROI is important in detecting and diagnosing it, and we suspected that in the absence of context the model was predicting any patch substantially brighter than it's surroundings to be positive. While for cancer detection, it is better to make a false positive than a false negative we thought that this pattern might become problematic when applying the model to images larger than those it was trained on. To address this issue we decided to create a dataset of larger images and continue training our model on those.
We created a dataset of 640x640 images and adjusted our existing model to take those as input. As the model is fully convolutional we can restore the model trained on 320x320 images and continue training it on the larger images with no problems, which we are currently in the process of doing. If the results of this are promising we may create another dataset of even larger images are fine-tune this model on those images until we have a model which takes complete images as input.