May 17, 2018, 3:34 p.m.
I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.
Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.
In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems.
I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.
Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.
May 16, 2018, 3:20 p.m.
I was training a model on a Google Cloud instance with a Tesla K80 GPU. This particular model had more data pre-processing required than normal. The model was training very slowly, the GPU usage was oscillating between 0% and 75-100%. I thought the CPU was the bottleneck and was trying to put as much pre-processing on the GPU as possible.
I read TensorFlow's optimization guide, which suggested forcing the pre-processing to be on the CPU by enclosing it with:
Since I thought the CPU was the bottleneck I didn't think that would help, but I tried it anyway because I had no other good ideas and was surprised that it worked like magic! The GPU usage now stays constant around 95-100% while the CPU usage stays at about the same levels as before.
May 4, 2018, 8:41 a.m.
I have been working on a project to detect abnormalities in mammograms. I have been training it on Google Cloud instances with Nvidia Tesla K80 GPUs, which allow a model to be trained in days rather than weeks or months. However when I tried to do online data augmentation it became a huge bottleneck because it did the data augmentation on the CPU.
I had been using tf.image.random_flip_left_right and tf.image.random_flip_up_down but since those operations were run on the CPU the training slowed down to a crawl as the GPU sat idle waiting for the queue to be filled.
I found this post on Medium, Data Augmentation on GPU in Tensorflow, which uses tf.contrib.image instead of tf.image. tf.contrib.image is written to run on the GPU, so using this code allows the data augmentation to be performed on the GPU instead of the CPU and thus eliminates the bottleneck.
This has been a life saver for me. Adding it to my graph allows me to train for longer without overfitting and this get better results.
April 23, 2018, 1:48 p.m.