June 6, 2018, 1:32 p.m.
I am continuing to work with the CBIS-DDSM datasets and recently decided to take a new direction with the training data. Previously I had been locally segmenting the raw scans into images of varying sizes and writing those images to tfrecords to use as training data. I started by classifying the images by pathology with categorical labels, and while I got decent results using this approach, the models performed terribly on images from different datasets and on full-size images. I suspected the model was using features of the images that were not related to the actual ROIs to make its predictions, such as the amount of contrast or presence of extremely high pixel values.
To address this I started using the masks as labels and training the model to do segmentation of the images into normal and ROI. This had the added advantage of allowing me to exclude images from the DDSM dataset and only use CBIS-DDSM images which eliminated the features I believed the previous models had been relying on, as the DDSM and CBIS-DDSM datasets had substantially different variances, mins, maxes and means. The disadvantage of this approach was that the dataset was double the size due to the fact that the labels are now the same size as the images.
I started with a dataset of 320x320 images, however models trained on this dataset often had trouble with images which had bright patches running of the edge of the image and images with high contrast, misclassifying the bright patches as positive. To attempt to address this I started training the model on 320x320 images, and then switched to another dataset of 640x640 images after training through 50 or so epochs.
The dataset of 640x640 images only had 13,000 training examples in it, about 1/3 the number of examples in the 320x320 dataset, but was still larger due to the fact that each example and label is four times the size of the 320x320 images. I considered making another dataset with either more or larger images, but saw that this process could continue indefinitely as I had to keep creating new datasets of larger and larger size.
Instead I decided to create one new dataset which could be used indefinitely, for all purposes. To do this I loaded each image in the CBIS-DDSM dataset into Python. While the JPEGS are RGB, the images are grayscale so I only kept one channel of each image. I Some images have multiple masks, and rather than have multiple versions of each image with different masks, which could confuse the model, I combined all masks for each image into one mask, and then added that as the second channel of each image. In order to be able to save the array as an image I added a third channel of all 0s. Each new images was then saved as a PNG.
The resulting dataset is about 12GB, about four times the size of the largest tfrecords dataset, but the entirety of the CBIS-DDSM dataset (minus a few images which had masks of incorrect sizes and were discarded) is now represented. Now, in my model, I load each full image and then take a random crop of it and use that as training data. Since the mask is part of the image I can use TensorFlow's random crop function to crop the full image, and then separate the channels into the training example and it's label.
This not only increases the size of the training data set exponentially, but since my model is fully convolutional, I can also easily change the crop size without having to create a new dataset.
The major problem with this approach is that the mean of the labels is very low - around 0.015 - meaning that only 1% of the pixels have a positive label and the rest are negative. The previous dataset had a mean of 0.05. This will be addressed by raising the cross entropy weight from 20 to 75 so that the model doesn't just predict everything as negative. When creating the images I had trimmed as much background as possible from them to avoid having a large amount of training images of pure black, but still the random cropping produces a large number of images with little to no actual content.
At the moment I am uploading the data to S3 which should take another couple days. Once this is done I will attempt to train on this new dataset and see if the empty images cause major problems.
May 28, 2018, 9:38 a.m.
My original work with the DDSM and CBIS-DDSM dataset yielded good accuracy and recall on the test and validation data, but the model didn't perform so well when applied to the MIAS images, which came from a completely different dataset. Additional analysis of the images indicated that the negative images (from the DDSM) and the positive images (from the CBIS-DDSM) were different in some subtle but important ways:
We had become concerned about point 2 when we discovered that increasing the contrast of any image made it more likely to be predicted as positive and discovered point 1 while investigating this further. When applying our fully convolutional model trained on the combined data to complete scans, rather than the 299x299 images we had trained on, we noticed that more than 50% of the sections of a positive image were predicted as positive, even if the ROI was, in fact, only present in one section. This indicates that the model was using some feature of the images other than the ROI in its prediction.
When starting this project, we had initially planned to segment the CBIS-DDSM images and use images which did not contain an ROI as negative images, but we were not certain that there were not differences in the tissue of positive and negative scans which might make this approach not generalize to completely negative scans. When we realized that the scans had been pre-processed differently we attempted to adjust the negative images in such a way as to make them more similar to the positive images but were unable to do so without knowledge of how they had been processed.
Our solution to both of these issues was to train the model to do the segmentation of the scans rather than simple classification, using the masks as the labels. This approach had several advantages:
We recreated the model to do semantic segmentation by removing the last "fully connected" layer (which were implemented as a 1x1 convolution) and the logits layer and upsampling the results with transpose convolutions. In order for the upsampling to work properly we needed to have the size of the images be a multiple of 2 so that the dimension reduction could be properly undone, so we used images of size 320x320.
We were able to get fairly good results training on this data with a pixel level accuracy of about 90% and a pixel recall of 70%. The image level accuracy and recall were 70% and 87%, respectively. While these results were respectable, we noticed certain patterns of incorrect predictions. Images which were mostly dark, with patches that were much brighter, tended to have the bright patches predicted positive regardless of the actual label. This pattern was mostly observed when the bright patch ran off the edge of the image.
We know that the context of an ROI is important in detecting and diagnosing it, and we suspected that in the absence of context the model was predicting any patch substantially brighter than it's surroundings to be positive. While for cancer detection, it is better to make a false positive than a false negative we thought that this pattern might become problematic when applying the model to images larger than those it was trained on. To address this issue we decided to create a dataset of larger images and continue training our model on those.
We created a dataset of 640x640 images and adjusted our existing model to take those as input. As the model is fully convolutional we can restore the model trained on 320x320 images and continue training it on the larger images with no problems, which we are currently in the process of doing. If the results of this are promising we may create another dataset of even larger images are fine-tune this model on those images until we have a model which takes complete images as input.
May 23, 2018, 10:02 a.m.
For a course I was taking at EPFL I was working on classifying images from the DDSM dataset with ConvNets. I had some success, although not as much as I would have liked, and I posted an edited version of my report on Medium.
Although the course is over I am still working on this project, attempting to fix some of the issues that came up during the first stage.
May 17, 2018, 3:34 p.m.
I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.
Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.
In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems.
I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.
Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.