May 17, 2018, midnight
By : Eric A. Scuccimarra
I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.
Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.
In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems.
I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.
Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.
Labels: python , machine_learning , tensorflow
There are no comments for this article.