June 13, 2018, 1:53 p.m.
As I continue to work on my mammography project I save a lot of time by re-using weights from models I have already trained rather than training every iteration of every model from scratch, which would be very time consuming. However a drawback to this method is that if I add a new layer or change a layer when I continue training the model the layers which have not changed are prone to overfit as they have been trained for substantially longer than the new layers.
I tried only training certain variables, but when the checkpoint is saved only the trained variables are included in it, which means that the checkpoint can not be restored as it is missing many variables. This could be overcome by restoring certain variables from one checkpoint and others from a different checkpoint, but that is overly complicated and not very convenient.
Earlier today, I had added another deconvolution layer to my model. When I trained just that layer the accuracy of the model went very high very quickly, much more quickly than training all of the layers. But then I couldn't continue training all of the layers because the checkpoint only contained the layer trained. I don't have the time to retrain the entire monstrosity from scratch, so I found an ugly hack that allows me to train mostly the layers I want to train while saving all of the weights in the checkpoint.
I create two training ops - one for all variables (train_op_1) and one for the variables I want to train (train_op_2). I run train_op_2 most of the time. But right before I save the checkpoint I do one iteration of train_op_1 which updates all layers, so all variables are saved in the checkpoint. It's not pretty, but it works and best of all, the code doesn't have to be changed depending on what I want to train. I specify whether I want to train all vars or just the subset as a command line arg and if I want to train all vars, then set train_op_2 = train_op_1.
I just ran a few quick tests with no issues, hopefully this will continue to work.
June 12, 2018, 1:34 p.m.
June 7, 2018, 9:24 a.m.
Most of the videos I see on YouTube discussing the dangers of AI and machine learning are by people who really have no idea about the subject. I recently started to watch a TED talk by a guy who said that machine learning programs wrote themselves. I had to turn it off about 30 seconds in because the guy obviously had learned about AI from watching Terminator or some other Hollywood movie. I also get pissed off when I hear Elon Musk talk about the "dangers" of AI. It amazes me that someone who is obviously intelligent is so clueless about the subject.
I just watched a discussion about AI from the World Science Festival which was notable for being an intelligent discussion of the subject by people who are actually familiar with the technology. It is rather long, but it touches on many subjects and every subject is discussed intelligently.
June 6, 2018, 3:07 p.m.
Even though I only have 1/5 of the images uploaded so far, I decided to do some tests to see if this method would work. It does, but it took quite a bit of tweaking to get performance to reasonable levels.
At first I just plugged the new dataset into the old graph, and this worked but was incredibly slow with the GPU sitting idle most of the time. I tried quite a few different methods to speed the pre-processing bottleneck up, but the solution was simpler than I had thought it would be.
The biggest factor was increasing number of threads in the tf.train.batch from the default of 1. This one change made a huge difference, cutting the training time down to about 1/4 of what it had been.
I also experimented with moving some pre-processing operations around, including resizing the images individually when loading them and after being batched. This had negligible effects, but resizing them individually was slightly faster than doing it as a batch. In general I found that the more pre-processing operations I moved to the queue (and the CPU) the better the performance.
This version still trains at about 1/2 the speed the tfrecords version did, which is a big difference, but the size of the training set has increased by orders of magnitude so I guess I can live with it.
The code is available on my GitHub.