Eric Antoine Scuccimarra

Creating a Dataset of Faces for my Autoencoder with Semi-supervised Learning

Aug. 3, 2019, 7:33 a.m.

I am still working on my face autoencoder in my spare time, although I have much less spare time lately. My non-variational autoencoder works great - it can very accurately reconstruct any face in my dataset of 400,000 faces, but it doesn't work at all for interpolation or anything like that. So I have also been trying to train a variational autoencoder, but it has a lot more difficulty learning.

For a face which is roughly centered and looking in the general direction of the camera it can do a somewhat decent job, but if the picture is off in any way - there is another face off to the side, there is something blocking the face, the face is at a strange angle, etc it does a pretty bad job. And since I want to try to use this for interpolation training it on these bad faces doesn't really help anything.

One of the biggest datasets I am using is this one from ETHZ. The dataset was created to train a network to predict the age of the person, and while the images are all of good quality it does include many images that have some of the issues I mentioned above, as well as pictures that are not faces at all - like drawings or cartoons. Other datasets I am using consist entirely of properly cropped faces as I described above, but this dataset is almost 200k images, so omitting it completely significantly reduces the size of my training data.

The other day I decided I needed to improve the quality of my training dataset if I ever want to get this variational autoencoder properly trained, and to do that I need to filter out the bad images from the ETHZ IMDB dataset. They had already created the dataset using face detectors, but I want to remove faces that have certain attributes:

Multiple faces or parts of faces in the image
Images with something blocking part of the face
Images where the faces are not generally facing forward, such as profiles

I started trying to curate them manually, but after going through 500 images of the 200k I realized that would not be feasible. It would be easy to train a neural network to classify the faces, but that would require training data, but that still means manually classifying the faces. So, what I did is I took another dataset of faces that were all good and added about 700 bad faces from the IMDB dataset for a total size of about 7000 images and made a new dataset. Then I took a pre-trained discriminator I had previously used as part of a GAN to try to generate faces and retrained it to classify the faces as good or bad.

I ran this for about 10 epochs, until it was achieving very good accuracy, and then I used it to evaluate the IMDB dataset. Any image which it gave a less than 0.03 probability of being good I moved into the bad training dataset, and any images which it gave a 0.99 probability of being good I moved to the good training dataset. Then I continued training it and so on and so on.

This is called weak supervision or semi-supervised learning, and it works a lot better than I thought it would. After training for a few hours, the images which are moved all seem to be correctly classified, and after each iteration the size of the training dataset grows to allow the network to continue learning. Since I only move images which have very high or very low probabilities, the risk of a misclassification should be relatively low, and I expect to be able to completely sort the IMDB dataset by the end of tomorrow, maybe even sooner. What would have taken weeks or longer to do manually has been reduced to days thanks to transfer learning and weak supervision!

Labels: coding , data_science , machine_learning , pytorch , autoencoders

1 comment

CatBoost

Jan. 10, 2019, 2:01 p.m.

Usually when you think of a gradient boosted decision tree you think of XGBoost or LightGBM. I'd heard of CatBoost but I'd never tried it and it didn't seem too popular. I was looking at a Kaggle competition which had a lot of categorical data and I had squeezed just about every drop of performance I could out of LGBM so I decided to give CatBoost a try. I was extremely impressed.

Out of the box, with all default parameters, CatBoost scored better than the LGBM I had spent about a week tuning. CatBoost trained significantly slower than LGBM, but it will run on a GPU and doing so makes it train just slightly slower than the LGBM. Unlike XGBoost it can handle categorical data, which is nice because in this case we have far too many categories to do one-hot encoding. I've read the documentation several times but I am still unclear as to how exactly it encodes the categorical data, but whatever it does works very well.

I am just beginning to try to tune the hyperparameters so it is unclear how much (if any) extra performance I'll be able to squeeze out of it, but I am very, very impressed with CatBoost and I highly recommend it for any datasets which contain categorical data. Thank you Yandex!

Labels: coding , data_science , machine_learning , kaggle , catboost

2 comments

Exercise Log

Nov. 27, 2018, 4:27 p.m.

I exercise quite a lot and I have not been able to find an app to keep track of it which satisfies all of my criteria. Most fitness trackers are geared towards cardio and I also do a lot of strength training. After spending a year trying to make due with combinations of various fitness trackers and other apps I decided to just write my own, which could do everything I wanted and could show all of the reports I wanted.

I did that and after using it for a few weeks put it online at workout-log.com. It's not fancy and it is quite likely very buggy at this point, but it is open to anyone who wants to use it.

It's written with Django and jQuery and uses ChartJS for the charts.

Labels: python , django , data_science , machine_learning

1 comment

Early Stopping

July 30, 2018, 9:37 a.m.

I recently began using the early stopping feature of Light GBM, which allows you to stop training when the validation score doesn't improve for a certain number of rounds. This is especially useful if you are bagging models, as you don't need to watch each one and figure out when training should stop. The way it works is you specify a number of rounds, and if the validation score doesn't improve during that number of rounds the training is stopped and the round with the best validation score is used.

When working with this I noticed that often the best validation round is a very early round, which has a very good validation score but an incredibly low training score. As an example here is the output from a model I am currently training. Normally the training F1 gets up to the high 0.90s:

Early stopping, best iteration is:
[7]	train's macroF1: 0.525992	valid's macroF1: 0.390373

Out of at least 400 rounds of training, the best performance on the validation set was on the 7th, at which time it was performing incredibly poorly on the training data. This indicates overfitting to the validation set, which is just as bad as overfitting to the training set in that the model is not likely to generalize well.

So what to do about this issue? The obvious solution would be to provide a minimum number of rounds and begin to monitor the validation score for early stopping once that number of rounds has passed, but I don't see any way to do this through the LGB API.

I am running this code using sklearn's joblib to do parallel processing, so I have create a list of the estimators to fit and then pass that list to the parallel processing which calls a function which fits the estimator to the data and returns it. The early stopping is taken care of by LGB, so what I did is after the estimator is fit I manually get the validation results and the train performance for the best validation round. If the train performance is above a specified threshold I return the estimator as normal. If, however, the train performance is below that threshold I recursively call the function again.

The downside to this is that it is possible to get into an infinite loop, but if the thresholds are properly tuned this should be easily avoidable.

Labels: coding , data_science , machine_learning , lightgbm

2 comments

The Importance of Cross Validation

July 23, 2018, 11:10 a.m.

I recently started looking at a Kaggle Challenge about predicting poverty levels in Costa Rica. I used sklearn train_test_split to split the training data into train and validation sets and fit a few models. The first thing I noticed was that my submissions scored significantly lower than my validation sets: 0.36 on the submission vs. .96 on my validation data.

The data consists of information about individuals with the target as their poverty level. The features include both information relating to that individual as well as information for the household they live in. The data includes multiple individuals from the same household, and some exploratory data analysis indicated that most of the features were on a household level rather than the individual level.

This means that doing a random split ends up including data from the same household in both the train and validation datasets, which will result in the leakage that artificially raised my initial validation scores. This also means that my models were all tuned on a validation dataset which was essentially useless.

To fix this I did the split on unique household IDs, so no household would be included in both datasets. After re-tuning the models appropriately, the validation f1 scores had gone down from 0.96 to 0.65. The submissions scores went up to 0.41, which was not a huge increase, but it was much closer to the validation scores.

The moral of this story is never forget to make sure that your training and validation sets don't contain overlap or leakage, or the validation set becomes useless.

Labels: data_science , machine_learning , kaggle

No comments

More on Deconvolution

July 5, 2018, 10:37 a.m.

I wrote about this paper before, but I am going to again because it has been so enormously useful to me. I am still working on segmentation of mammograms to highlight abnormalities and I recently decided to scrap the approach I had been taking to upsampling the image and start that part from scratch.

When I started I had been using the earliest approach to upsampling, which basically was take my classifier, remove the last fully-connected layer and upsample that back to full resolution with transpose convolutions. This worked well enough, but the network had to upsample images from 2x2x1024 to 640x640x2 and in order to do this I needed to add skip connections from the downsizing section to the upsampling section. This caused problems because the network would add features of the input image to the output, regardless of whether the features were relevant to the label. I tried to get around this by adding bottleneck layers before the skip connection in order to only select the pertinent features, but this greatly slowed down training and didn't help much and the output ended up with a lot of weird artifacts.

In "Deconvolution and Checkerboard Artifacts", Odena et al. have demonstrated that replacing transpose convolutions with nearest neighbors resizing produces smoother images than using transpose convolutions. I tried replacing a few of my tranpose convolutions with resizes and the results improved.

Then I started reading about dilated convolutions and I started wondering why I was downsizing my input from 640x640 to 5x5 just to have to resize it back up. I removed all the fully-connected layers (which in fact were 1x1 convolutions rather than fully-connected layers) and then replaced the last max pool with a dilated convolution.

I replaced all of the transpose convolutions with resizes, except for the last two layers, as suggested by Odena et al, and the final tranpose convolution has a stride of 1 in order to smooth out artifacts.

In the downsizing section, the current model reduces the input from 640x640x1 to 20x20x512, then it is upsampled by using nearest neighbors resizing followed by plain convolutions to 320x320x32. Finally there is a tranpose convolution with a stride of 2 followed by a transpose convolution with a stride of 1 and then a softmax for the output. As an added bonus, this version of the model trains significantly faster than upsampling with transpose convolutions.

I just started training this model, but I am fairly confident it will perform better than previous upsampling schemes as when I extracted the last downsizing convolutional layer from the model that layer appeared closer to the label (although much smaller) than the final output did. I will update when I have actual results.

Update - After training the model for just one epoch, with the downsizing layer weights initialized from a previous model, the results are already significantly better than under the previous scheme.

Labels: coding , data_science , tensorflow , mammography , convnets , ddsm

No comments

Only Training Certain Variables in TensorFlow

June 13, 2018, 1:53 p.m.

As I continue to work on my mammography project I save a lot of time by re-using weights from models I have already trained rather than training every iteration of every model from scratch, which would be very time consuming. However a drawback to this method is that if I add a new layer or change a layer when I continue training the model the layers which have not changed are prone to overfit as they have been trained for substantially longer than the new layers.

I tried only training certain variables, but when the checkpoint is saved only the trained variables are included in it, which means that the checkpoint can not be restored as it is missing many variables. This could be overcome by restoring certain variables from one checkpoint and others from a different checkpoint, but that is overly complicated and not very convenient.

Earlier today, I had added another deconvolution layer to my model. When I trained just that layer the accuracy of the model went very high very quickly, much more quickly than training all of the layers. But then I couldn't continue training all of the layers because the checkpoint only contained the layer trained. I don't have the time to retrain the entire monstrosity from scratch, so I found an ugly hack that allows me to train mostly the layers I want to train while saving all of the weights in the checkpoint.

I create two training ops - one for all variables (train_op_1) and one for the variables I want to train (train_op_2). I run train_op_2 most of the time. But right before I save the checkpoint I do one iteration of train_op_1 which updates all layers, so all variables are saved in the checkpoint. It's not pretty, but it works and best of all, the code doesn't have to be changed depending on what I want to train. I specify whether I want to train all vars or just the subset as a command line arg and if I want to train all vars, then set train_op_2 = train_op_1.

I just ran a few quick tests with no issues, hopefully this will continue to work.

Labels: python , data_science , machine_learning , tensorflow

No comments

TensorFlow Queues and Validation

March 22, 2018, 1:36 p.m.

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path]) X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000) X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1]) y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Labels: python , data_science , machine_learning , tensorflow

1 comment

TensorFlow GPU Errors on Windows

Feb. 15, 2018, 1:50 p.m.

I have been loving TensorFlow lately and have installed tensorflow-gpu on my Windows 10 laptop. Given that the GPU on my laptop is not a really great one I have run into quite a few issues, most of which I have solved. My GPU is an Nvidia GeForce GT 750M with 2GB of RAM and I am running the latest release of tensorflow as of February 2018, with Python 3.6.

If you are running into errors I would suggest you try these things in this order:

Try reducing the batch size for training AND validation. I always use batches for training but would evaluate on the validation data all at once. By using batches for validation and averaging the results I am able to avoid most of the memory errors.
If this doesn't work try to restrict the amount of GPU RAM available to tensorflow with config.gpu_options.per_process_gpu_memory_fraction = 0.7
which restricts the amount available to 70%. Note that I am unable to ever run the GPU with the memory fraction above 0.7
If all else fails turn the GPU off and use the CPU:
config = tf.ConfigProto() config = tf.ConfigProto(device_count = {'GPU': 0})

The difference between using the CPU and the GPU is like night and day... With the CPU it takes all day to train through 20 epochs, with the GPU the same can be done in a few hours. I think the main roadblock with my GPU is the amount of RAM, which can easily be managed by controlling the batch size and the config settings above. Just remember to feed the config into the session.

Labels: python , data_science , machine_learning , tensor_flow

No comments

Batch Normalization with TensorFlow

Feb. 13, 2018, 1:44 p.m.

I was trying to use batch normalization in order to improve the accuracy of my CIFAR classifier with tf.layers.batch_normalization, and it seemed to have little to no effect. According to this StackOverflow post you need to do something extra, which is not mentioned in the documentation, in order to get the batch normalization to work.

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
sess.run([train_op, extra_update_ops], ...)

The batch norm update operations are added to UPDATE_OPS collection, so you need to create that operation and then feed it into the session along with the training op. Before I had added the extra_update_ops the batch normalization was definitely not running, now it is, whether it helps or not remains to be seen.

Also make sure to use a training=[BOOLEAN | TENSOR] in the call to batch_normalization() to prevent it from being applied during evaluation. I use a placeholder and pass whether it is training or not in via the feed_dict:

training = tf.placeholder(dtype=tf.bool)

And then use this in my batch norm and dropout layers:

training=training

There were a few other things I had to do to get batch normalization to work properly:

I had been using local response normalization, which apparently doesn't help that much. I removed those layers and replaced them with batch normalization layers.
Remove the activation from the conv2d layers. I run the output through the batch normalize layers and then apply the relu.

Before I made these changes the model with the batch normalization didn't seem to be training at all, the accuracy was just going up and down right around the baseline of .10. After these changes it seems to be training properly now.

Labels: data_science , machine_learning , tensor_flow

No comments

My Favorite Languages

Jan. 23, 2018, 11:25 a.m.

My favorite languages, in order:

MatLab - MatLab is just a beautiful language. It is simple but very powerful and makes it easy to do very complex things. While the fact that is is designed solely for numeric computation is a drawback as far as using it for other things, that is one of the reasons I love it.
Python - Python is by far my favorite scripting language. It is very powerful with a lot of features, but that also makes it a bit complex. It isn't as elegant as MatLab, but it is way more useful.
R - I consider R to be somewhere in between MatLab and Python. It is optimized for numerical computing, but can also be used with text and character data. For statistics it stands alone - Python can do pretty much anything R can do, but R is simpler and easier. However, it is more a functional language than a programming language.
SQL - I've worked with MySQL for almost 20 years and I know SQL very well. It is great at working with normalized data, however as storage costs have gone down and RAM has gone up, I'm not sure normalizing data really makes all that much sense these days. Having to join a bunch of tables can really impact performance, which when you just need to get a string out of a joined table doesn't really seem worth it. For web sites it makes sense, but for computational purposes I'm not sure it's really necessary, unless you have more data than you can fit in memory. However I will always remember SQL as my first love.
C - I used C in university, but not much since then. I have forgotten most of what I once knew, but I plan on learning C again because of it's speed and efficiency. The fact that you can use C to write extensions for R, Python, MatLab etc makes it very useful.
PHP - PHP is a reallly ugly language. It has a lot of features, but is inconsistent in syntax and not designed for manipulating data. It's main advantage is that it is easy to learn, but this also makes it very easy to do badly. PHP can be done very well, but good PHP programmers are few and far between and they seem to be getting crowded out by mediocre programmers.
Javascript - Javascript has really been maturing recently. I started using Javascript back in 1996 or so, when all it could really do was alerts and confirmations, it can do a whole lot more than that these days. I have not yet worked with Node.js so I am not all that familiar with the full power of it. I don't really know what my problem with Javascript is. Maybe I still see it as the silly little language it was when I first learned it. Anyway the fact that it is on the bottom of my list should in no way be taken as a reflection of its value.

Labels: coding , data_science

No comments

Data vs Web

Jan. 18, 2018, 11:16 a.m.

In 1998, when I graduated from university the internet was in its infancy and the dot com bubble was just getting ramped up. At the time I thought the internet would change the world by making information easily accessible and available, and I was excited to be a part of this new thing that would be revolutionary.

That started to change a few years ago. The focus of internet companies had shifted from providing useful services to collecting as much data as possible on the users to be used to better target advertisements. Rather than providing useful and informative content and services, the emphasis was on keeping users online for as much time as possible. While the negative effects of this business model on the users and society are becoming more and more noticeable, notably in the recent US election, the tech companies continue to ignore them. This is no longer the industry that I signed up to work for, and I no longer want to be a part of it.Anyone who is familiar with the work of Kahneman and Tversky knows that the human brain is very poor at processing and analyzing data. Most of our decisions are made using heuristics, or rules of thumb, that allow us to make quick and easy judgements. These result in cognitive biases, which are ways in which our brains distort reality for the purpose of making decisions. One of the most famous cognitive biases is the "confirmation bias" - which is how people interpret new information in a way such as to support their existing beliefs. Kahneman and Tversky conducted experiments on people ranging from college undergraduates to statistics professors, and everyone was subject to these biases - even PhD level statisticians who should know better. This is why data science is so important.

Our brains are not designed to gather and analyze large amounts of data and we are incredibly bad at doing so. We tend to draw conclusions from small, isolated, but memorable bits of information rather than looking at the overall big picture. One example is how Americans are all very worried about terrorism even though on average only six Americans die per year from foreign terrorism. The media likes to report these stories because they are sensational and memorable, but doing so greatly exaggerates the real risks. There are also numerous medications which are commonly prescribed despite having minimal positive effects, or having no benefits at all.

Data science is a way to draw knowledge from actual observation of the world, rather than just whatever thoughts happen to be strung together in our heads, or whatever sound bites relating to a given subject most easily come to mind. I can come up with whatever theories and ideas I want, but unless they actual reflect on the real world it's all meaningless. This is the basis of scientific inquiry, and this is why I am getting out of web development and into data science.

Labels: personal , coding , data_science

No comments