Aug. 29, 2018, 11:52 a.m.
When doing binary image segmentation, segmenting images into foreground and background, cross entropy is far from ideal as a loss function. As these datasets tend to be highly unbalanced, with far more background pixels than foreground, the model will usually score best by predicting everything as background. I have confronted this issue during my work with mammography and my solution was to use a weighted sigmoid cross entropy loss function giving the foreground pixels higher weight than the background.
While this worked it was far from ideal, for one thing it introduced another hyperparameters - the weight - and altering the weight had a large impact on the model. Higher weights favored predicting pixels as positive, increasing recall and decreasing precision, and lowering the weight had the opposite effect. When training my models I usually began with a high weight to encourage the model to make positive predictions and gradually decayed the weight to encourage it to make negative predictions.
For these types of segmentation tasks Intersection over Union tends to be the most relevant metric as pixel level accuracy, precision and recall do not account for the overlap between predictions and ground truth. Especially for this task, where overlap can be the difference between life and death for the patient, accuracy is not as relevant as IOU. So why not use IOU as a loss function?
The reason was because IOU was not differentiable so can not be used for gradient descent. However Wang et al have written a paper - Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation - which provides an easy way to use IOU as a loss function. In addition, this site provides code to implement this loss function in TensorFlow.
The essence of this method is that rather than using the binary predictions to calculate IOU we use the sigmoid probability output by the logits to estimate it which allows IOU to provide gradients. At first I was skeptical of this method, mostly because I understood cross entropy better and it is more common, but after I hit a performance wall with my mammography models I decided to give it a try.
My models using cross-entropy loss had ceased to improve validation performance so I switched the loss function and trained them for a few more epochs. The validation metrics began to improve, so I decided to train a copy of the model from scratch with the IOU loss. This has been a resounding success. The IOU loss accounts for the imbalanced data, eliminating the need to weight the cross entropy. With the cross entropy loss the models usually began with recall of near 1 and precision of near 0 and then the precision would increase while the recall slowly decreased until it plateaued. With IOU loss they both start near 0 and gradually increase, which to me seems more natural.
Training with an IOU loss has two concrete benefits for this task - it has allowed the model to detect more subtle abnormalities which models trained with cross entropy loss did not detect; and it has reduced the number of false positives significantly. As the false positives are on a pixel level this effectively means that the predictions are less noisy and the shapes are more accurate.
The biggest benefit is that we are directly optimizing for our target metric rather than attempting to use an imperfect substitute which we hope will approximate the target metric. Note that this method only works for binary segmentation at the moment. It also is a bit slower than using cross entropy, but if you are doing binary segmentation the performance boost is well worth it.
July 30, 2018, 9:37 a.m.
I recently began using the early stopping feature of Light GBM, which allows you to stop training when the validation score doesn't improve for a certain number of rounds. This is especially useful if you are bagging models, as you don't need to watch each one and figure out when training should stop. The way it works is you specify a number of rounds, and if the validation score doesn't improve during that number of rounds the training is stopped and the round with the best validation score is used.
When working with this I noticed that often the best validation round is a very early round, which has a very good validation score but an incredibly low training score. As an example here is the output from a model I am currently training. Normally the training F1 gets up to the high 0.90s:
Early stopping, best iteration is:  train's macroF1: 0.525992 valid's macroF1: 0.390373
Out of at least 400 rounds of training, the best performance on the validation set was on the 7th, at which time it was performing incredibly poorly on the training data. This indicates overfitting to the validation set, which is just as bad as overfitting to the training set in that the model is not likely to generalize well.
So what to do about this issue? The obvious solution would be to provide a minimum number of rounds and begin to monitor the validation score for early stopping once that number of rounds has passed, but I don't see any way to do this through the LGB API.
I am running this code using sklearn's joblib to do parallel processing, so I have create a list of the estimators to fit and then pass that list to the parallel processing which calls a function which fits the estimator to the data and returns it. The early stopping is taken care of by LGB, so what I did is after the estimator is fit I manually get the validation results and the train performance for the best validation round. If the train performance is above a specified threshold I return the estimator as normal. If, however, the train performance is below that threshold I recursively call the function again.
The downside to this is that it is possible to get into an infinite loop, but if the thresholds are properly tuned this should be easily avoidable.
July 23, 2018, 11:10 a.m.
I recently started looking at a Kaggle Challenge about predicting poverty levels in Costa Rica. I used sklearn train_test_split to split the training data into train and validation sets and fit a few models. The first thing I noticed was that my submissions scored significantly lower than my validation sets: 0.36 on the submission vs. .96 on my validation data.
The data consists of information about individuals with the target as their poverty level. The features include both information relating to that individual as well as information for the household they live in. The data includes multiple individuals from the same household, and some exploratory data analysis indicated that most of the features were on a household level rather than the individual level.
This means that doing a random split ends up including data from the same household in both the train and validation datasets, which will result in the leakage that artificially raised my initial validation scores. This also means that my models were all tuned on a validation dataset which was essentially useless.
To fix this I did the split on unique household IDs, so no household would be included in both datasets. After re-tuning the models appropriately, the validation f1 scores had gone down from 0.96 to 0.65. The submissions scores went up to 0.41, which was not a huge increase, but it was much closer to the validation scores.
The moral of this story is never forget to make sure that your training and validation sets don't contain overlap or leakage, or the validation set becomes useless.
July 17, 2018, 7:34 a.m.
When I began working on this project my intention was to do multi-class classification of the images. To this end I built my graph with logits and a cross-entropy loss function. I soon realized that the decision to do multi-class classification was quite ambitious, and scaled back to doing binary classification into positive and negative. My goal was to implement the multi-class approach once I had the binary approach working reasonably well, so I left the cross entropy in place.
Over the months I have been working on this I have realized that, for many reasons, the multi-class classification was a bad idea. For an academic project it might have made sense, but for any sort of real world use case it made none. There is really no use to outputting a simple classification for something as important as detecting cancer. A much more useful output is the probabilities that each area of the image contains an abnormality as this could aid a radiologist in diagnosing abnormalities rather than completely replacing her. Yet for some reason I never bothered to change the output or the loss function.
The limiting factor on the size of the model has been the GPU memory of the Google Cloud instance I am training this monstrosity on. So I've been trying to optimize the model to run within the RAM constraints and train in a reasonable amount of time. Mostly this has involved trying to keep the number of parameters to a minimum, but today I was looking at the model and realized that the logits were definitely not helping the situation.
For this problem classification was absolutely the wrong approach. We aren't trying to classify the content of the image, we are trying to detect abnormalities. The negative class was not really a separate class, but the absence of any abnormalities, and the graph and the loss function should reflect this. In order to coerce the logits into an output that reflected the reality just described, I put the logits through a softmax and then discarded the negative probability - as I said the negative class doesn't really exist. However the cross entropy function does not know this and it places equal importance on the imaginary negative class as on the positive class (subject to the cross entropy weighting of course.) This means that the gradients placed equal weight on trying to find imaginary "normal" patterns, despite the fact that this information is discarded and never used.
So I reduced the logits layer to one unit, replaced the softmax activation with a sigmoid activation, and replaced the cross entropy with binary cross entropy. And the change has been more impactful than I imagined it would be. The model immediately began performing better than the same model with the logits/cross entropy structure. It seems obvious that this would be the case as now the model can focus on detecting abnormalities rather than wasting half of it's efforts on trying to detect normal patterns.
I am not sure why I waited so long to make this change and my best guess is that I was seduced by the undeniable elegance of the cross entropy loss function. For multi-class classification it is truly a thing of beauty, and I may have been blinded by that into attempting to use it in a situation it was not designed for.