Sept. 22, 2019, 1:34 p.m.
I started working on a variational auto-encoder (VAE) for faces a few months ago. I was easily able to make a non-variational autoencoder to reproduce images that worked incredibly well, but since it was not variational there wasn't much you could do with it other than compress images. I wanted to be able to play with interpolation and such, and for that you need a VAE. So I converted my auto-encoder to a variational one, but the problem was that the resulting images were very blurry and the quality wasn't all that great. So I thought maybe I could attach a GAN to this to make the images look more realistic. And I tried that but unfortunately it didn't work very well, the GAN was trying to produce to generate images of what it though were faces will the autoencoder was trying to reproduce its input, as seen in the images below:
After fighting with this for a few months I decided to try to make sure that the GAN was working properly before I added on the autoencoder, and although I had to fight with the GAN quite a bit and was never able to get it to generate really high quality images, I was sure that it was working properly. So I decided to try to hook it up to the autoencoder again.
Then I discovered this paper Autoencoding beyond pixels using a learned similarity metric, which does the same thing I was trying to do but in a much smarter way. What I had been doing was using the MSE between the input and the generated images for my VAE loss, and training both the encoder and the decoder with the GAN loss. Obviously this did not work.
What they do in the paper is basically separate the encoder and leave the decoder and discriminator as the GAN, which is trained as usual. I had tried to think of ways to train the encoder and decoder separately, but my ideas were much more primitive and didn't work at all. What they do that is train the encoder separately, using the KLD loss and - this is the brilliant part - instead of using MSE between the input and the recreation they use the MSE between a feature map from an intermediate layer of the discriminator for the real and faked images. So rather than trying to produce an exact duplicate of the input, the encoder is trying to produce something that the discriminator thinks is close to the input.
It took me a few hours to rewrite my code to make use of this new loss, and come up with a version that would be able to run without having to keep all of the graphs in memory and be able to train in a reasonable amount of time, and I think everything is finally working. Hopefully this works better than my previous attempts, and next time I will try to remember to review the literature before trying to implement a new idea on my own.
Aug. 24, 2019, 7:20 a.m.
I had been trying to train my autoencoder with a GAN component on and off for a couple of months and it just didn't seem to be working very well. I thought that maybe the autoencoder and the discriminator errors were somehow cancelling each other out or something. Just for the hell of it I decided to try to use the discriminator to optimize a reconstructed image to look real, just to see what the result would be. Instead of optimizing the weights, I created a Variable of the input and optimized that instead. To my surprise I ended up with weird splotches of primary colors against a white background, it actually made the image look less and less real rather than more. After seeing that I decided that there must be some major problem with my code so I went through it in greater detail.
I decided to train all three networks from scratch (the three being the encoder, the decoder and the discriminator) to see what would happen. I was surprised to see that the generator did not seem to be learning ANYTHING and neither did the discriminator. I found a tutorial on creating a GAN in PyTorch and I went through the training code to see how it differed from mine.
I had written my code to optimize it for speed, training the autoencoder without the GAN already took about 4 hours per epoch on a (free) K80 on Colab so I didn't want to slow that down much more, so I tried to minimize the numebr of times data had to be passed through the networks. The tutorial did not do that. First it ran a batch of real data through the discriminator, computed the gradients but did NOT back propagate them. Then it used the generator to generate a batch of faked data, passed that through the discriminator, computed the gradients, added them to the gradients from the first batch and THEN did the back prop. Then it ran the same batch of faked data through the discriminator again, and used that to update the generator. This was different from my code in several major ways:
After updating my code to bring it more in line with the tutorial both networks began to learn, I think that major change was detaching the reconstructed images before putting them through the discriminator. However I noticed a few strange things regarding the discriminator batches:
I read in a couple of places that using separate batches was a trick to make GANs train better, but no one really had an explanation for why this worked. What I am currently doing it using separate batches most of the time, before every n batches I use a single batch to encourage the discriminator to learn a bit more. I've tested values for n of 8, 16, 32 and 64. Most of those seemed to result in the worst of both worlds, nothing really seemed to improve, but with n = 64 the autoencoder loss is again decreasing, although slowly, and the discriminator accuracy is hovering around 52% rather than the 49-50% it was at using all separate batches.
To me using separate batches doesn't intuitively make sense, I don't see how the network can really learn to differentiate between classes when it only sees one class at a time. Of course the gradients are then added, and the differences should cancel out, with what's left indicating how to differentiate the classes; but to me it seems much more efficient to learn from mixed batches. One would never consider training a network on, say, the CIFAR dataset with each batch consisting exclusively of a single class. Maybe that's the point, to slow down the discriminator's learning enough for the generator to keep up? Anyway I will continue to experiment and see what works and what doesn't work.
Aug. 3, 2019, 7:33 a.m.
I am still working on my face autoencoder in my spare time, although I have much less spare time lately. My non-variational autoencoder works great - it can very accurately reconstruct any face in my dataset of 400,000 faces, but it doesn't work at all for interpolation or anything like that. So I have also been trying to train a variational autoencoder, but it has a lot more difficulty learning.
For a face which is roughly centered and looking in the general direction of the camera it can do a somewhat decent job, but if the picture is off in any way - there is another face off to the side, there is something blocking the face, the face is at a strange angle, etc it does a pretty bad job. And since I want to try to use this for interpolation training it on these bad faces doesn't really help anything.
One of the biggest datasets I am using is this one from ETHZ. The dataset was created to train a network to predict the age of the person, and while the images are all of good quality it does include many images that have some of the issues I mentioned above, as well as pictures that are not faces at all - like drawings or cartoons. Other datasets I am using consist entirely of properly cropped faces as I described above, but this dataset is almost 200k images, so omitting it completely significantly reduces the size of my training data.
The other day I decided I needed to improve the quality of my training dataset if I ever want to get this variational autoencoder properly trained, and to do that I need to filter out the bad images from the ETHZ IMDB dataset. They had already created the dataset using face detectors, but I want to remove faces that have certain attributes:
I started trying to curate them manually, but after going through 500 images of the 200k I realized that would not be feasible. It would be easy to train a neural network to classify the faces, but that would require training data, but that still means manually classifying the faces. So, what I did is I took another dataset of faces that were all good and added about 700 bad faces from the IMDB dataset for a total size of about 7000 images and made a new dataset. Then I took a pre-trained discriminator I had previously used as part of a GAN to try to generate faces and retrained it to classify the faces as good or bad.
I ran this for about 10 epochs, until it was achieving very good accuracy, and then I used it to evaluate the IMDB dataset. Any image which it gave a less than 0.03 probability of being good I moved into the bad training dataset, and any images which it gave a 0.99 probability of being good I moved to the good training dataset. Then I continued training it and so on and so on.
This is called weak supervision or semi-supervised learning, and it works a lot better than I thought it would. After training for a few hours, the images which are moved all seem to be correctly classified, and after each iteration the size of the training dataset grows to allow the network to continue learning. Since I only move images which have very high or very low probabilities, the risk of a misclassification should be relatively low, and I expect to be able to completely sort the IMDB dataset by the end of tomorrow, maybe even sooner. What would have taken weeks or longer to do manually has been reduced to days thanks to transfer learning and weak supervision!
June 18, 2019, 9:02 a.m.
In one of my classes last semester we had to make a variational auto-encoder for the MNIST dataset. MNIST is a pretty simple and small dataset so that wasn't very difficult. Looking for something more challenging I decided to try to make a face autoencoder.
The first challenge was finding a suitable dataset. There are several public datasets, but they are inconsistent in terms of image size, shape and other important details. I ended up using several datasets - the celebA was the easiest one to use without having to do much work on it. But it is relatively small, consisting of only 200,000 images. I ended up using the ETHZ dataset from Wikipedia and IMDB, which are quite large but the images are all different shapes and sizes. So I had to do some pre-processing of the data to remove unusable images. I removed any images small than the input size I was using of 162x190, and I also removed any images that were wider than they were higher or bigger than 500x500. This dataset also contains some images which have been stretched out at the edges to bizarre proportions. I removed these by deleting any images where the 10th row or column was identical to the first row or column. Finally I resized the large images down to a more reasonable size. This resulted in a dataset of about 390,000 faces, all of which were roughly the right size and shape.
I decided to train my autoencoder as a normal autoencoder rather than a variational one, mostly due to the extra overhead required for the variational layers. I used a latent space of size 4096, and after training for 12 hours a day for a few weeks on Google CoLab the results were surprisingly accurate. Once the model seemed to start overfitting the training data I stopped training it so I could play around with it.
I wanted to try to do interpolation between faces, which was when I realized what the advantage of making the auto-encoder variational was. When I tried to interpolate between faces, because the latent space was not continuous, rather than working as one would expect it was more like adding the faces together. Training the autoencoders as variational forces the latent space to be continuous which makes interpolation possible, so I am currently trying to retrain the model as variational.
Since the non-variational autoencoder had started to overfit the training data I wanted to try to find other ways to improve the quality, so I added an discriminative network which I am also currently training as a GAN, using the autoencoder as the generator. I will update with results of that when I have results worth reporting.
The notebooks used are available on GitHub, and the datasets I used are on Google Cloud Storage, although due to their size and the cost of downloading them they are not publicly available.