Eric Antoine Scuccimarra - Blog

Exploratory Data Analysis

Dec. 24, 2020, 11:34 a.m.

When I start working on a machine learning project my first impulse is always to try to fit some models. At the end of the project I always remember how important exploratory data analysis is, and wish I had remembered sooner. Even on things where EDA doesn't seem necessary it usually is.

I have been working on an instance detection challenge and what use will EDA be on a dataset of annotated images ? It turns out a lot. After doing some EDA I found that many of the annotations were wrong, and simply by correcting them I was able to greatly increase my model's performance.

In addition, by doing some EDA on the predictions from a fitted model I was able to identify some common causes of errors and attempt to address them.

Labels: machine_learning

Multi-Scale Training with PyTorch Image Folder

Dec. 17, 2020, 12:43 p.m.

I've had good luck with multi-scale training for image detection so I wanted to try it for classification of images that were of different sizes with objects at differing scales. I found some base code here , but this is based on PyTorch datasets, not on ImageFolders. I wrote some code to extend it to ImageFolders, which is in the below gist :

Labels: machine_learning , pytorch

Azure Spot Instances

Nov. 26, 2020, 6:39 a.m.

I have some free Azure student credit so I decided to try to use some Azure VMs to train some of my models yesterday. I soon realized that a student account does not include a quota for any GPU more powerful than a K80 and with a student account there is no way to request increased quota. However, the student account does include a quota for "low priority instances" or spot instances, which are pre-emptible. So I set up a spot VM.

On AWS sometimes spot VMs can go for days before being pre-empted. Not so on Azure. I tried about a half dozen times, and no instance ever lasted long enough to complete even half an epoch, or about an hour. I was very disappointed because the spot prices were much better than AWS spot prices. For Azure spot instances you can set a price you are willing to pay, but even setting the price above the on demand price didn't make any difference.

My final complaint about Azure VMs is the shortage of images. AWS has a huge number of images for deep learning so you can basically just start the instance and you are set to go. Azure only has a few such images and they still required considerable configuration and installation of packages, which is made especially difficult by the fact that the instance kept shutting down.

I may use Azure on-demand VMs in the future, but the spot instances were largely useless.

Labels: machine_learning , azure

CoLab Pro

Nov. 13, 2020, 9:40 a.m.

I have been using CoLab for quite a few years now and have always really appreciated the ability to get access to GPUs (and TPUs) for free. So when I recently found out about CoLab Pro I was reluctant to pay $10 a month for something I had been getting for free. However, at the same time I was paying hundreds of dollars a month for cloud GPU instances. Last week, after going well over my AWS budget last month, I decided to maybe try CoLab Pro and I am very glad I did.

CoLab Pro gives you priority on high-end GPUs - so far I have never not gotten a V100. This is the same GPU I was paying $0.90/hour spot rate (preemptible) on AWS. For me, the main disadvantage of CoLab was that each instance lasted usually about 10 hours before shutting down, and they would time out if left unattended or if I wasn't at the computer. CoLab Pro instances will last up to 24 hours, and they will not time out. I had one running at work the other day and when I got home I figured it had timed out, but when I went back the next morning it was still running !

Obviously, CoLab Pro is better suited to running experiments than executing long training, and it doesn't support multiple GPUs. And if you are using TensorFlow you have TPUs (I prefer PyTorch.) In the past I have repeatedly kicked myself after spending hundreds of dollars training a model, and then finding a small mistake. In the future I will be running my experiments on CoLab Pro and only using VMs when I am sure everything is correct and I need to train models quickly.

Labels: machine_learning , aws , gpu , colab

ModuleNotFoundError: No module named 'mmcv._ext'

Oct. 18, 2020, 9:55 a.m.

ModuleNotFoundError: No module named 'mmcv._ext'

If you are getting this error when trying to run the latest version of mmdet with the latest version of mmcv (installed via pip), this solves the problem
pip uninstall mmcv mmcv-full
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
MMCV_WITH_OPS=1 pip install -e .

Labels: machine_learning , pytorch

VAE GAN

Dec. 8, 2019, 11:09 a.m.

I had been trying to train a version of VAE-GAN for a few weeks and it wasn't working as well as I had hoped it would. I had added an auxiliary output to the discriminator which was attempting to predict the 40 features of each image provided with the celeb-a dataset as suggested in the VAE-GAN paper and I was scaling that loss to try to bring it in line with the GAN discriminator loss, but I was doing that incorrectly so that loss ended up overwhelming the GAN loss. (I was summing, rather than averaging the losses, and the lambda I was using to scale the loss was appropriate for a mean loss, but with 40 features the auxiliary loss was 40x the GAN loss at base, so I needed to divide the lambda by 40 to get the effect I wanted.)

After having corrected that error I am finally making some progress with these models. Below are sample images from two models I am training. The first outputs images at 160x160, the second at 128x128.

I guess the moral of this story is if something isn't working the way you expect it to, double check your math before you continue training it!

Labels: python , machine_learning , pytorch , gan

GAN Hacks

Sept. 21, 2019, 1:18 p.m.

I've now been trying to train my GANs for quite a while and still haven't been too successful, but I have learned some tricks. I found this excellent article a while ago and I didn't really understand it completely at first, but after having tried a lot of its tricks I understand them now. Here are my thoughts and some additional tricks I have used:

Item 5 from the article - use convolutional layers with strides of 2 rather than pools : one of the biggest problems in training GANs is maintaining the gradients. Since the gradients for the generator come from the discriminator vanishing or exploding gradients are a huge problem and need to be avoided at all costs. Max pools eliminate all of the gradients but one, so convolutional layers with a stride of 2 are a better way to downsample. Average pooling will also work, but I've found that stride 2 layers work better.
With apologies to Frank Herbert, "the gradients must flow." I've had luck using dense convnets as the discriminator because of the improved gradient flow they provide.
Item 6 - soft and noisy labels - this has helped a LOT. I haven't tried using random labels, but I have had luck using labels that are slightly off from 0 or 1, like 0.1 or 0.99. This keeps the discriminator from becoming too confident in it's predictions and the gradient to the generator exploding. I've learned that when training GANS, exploding gradients are just as bad as vanishing gradients in that the generator learns nothing.
The article also suggests occassionally flipping the labels, which I'm not sure exactly how to interpret. In practice, if the discriminator gets too strong I will occassionally flip the labels for a few training steps to confuse it a bit and then flip them back. This seems to help the generator catch up a bit.
One other thing I have found is that using smaller batch sizes seems to work better. When I started using the V100 GPUs I immediately increased my batch size to the max the GPU could handle, but the generator did not learn well at all. Reducing the batch size helps a lot, possibly by introducing some additional regularization to the discriminator.
Dropout - the article mentions using dropout in the generator, which I haven't tried. I do use dropout in the discriminator, which I wasn't sure about since it will reduce the gradients, but it does help slow down the discriminator which seems to help training.
Item 11 - I have tried to do this and wasted a lot of time. If your training has collapsed it is not likely you will be able to uncollapse it by training one network more than the other. I would suggest that rather than training one network more than the other you make sure that the networks are roughly equally matched from the start. Training the generator more, for example, tends to lead to mode collapse; training the discriminator more tends to lead to the gradients exploding or vanishing.
Item 12 - I haven't tried this one yet but it is interesting. I've heard a lot about using auxiliary outputs to provide regularization and if I had labelled images I would definitely try this one. In fact, I may try to label my images somehow in order to do so.
One thing that was not mentioned in the article, but which I have found very helpful, if using separate batches for real and fake images when training the discriminator. At first I thought this was a bizarre idea and wasn't sure how it would help, but it really does.

Some additional tips on how to construct a GAN:

Start small - when I started playing with GANs I immediately made two large, deep convnets and tried to train them and they learned nothing. I recommend you start with a very small network, train it enough to make sure it is learning something, then add a layer and repeat. I still don't know what the problem with my original networks was, or if I just wasn't patient enough, but it's a lot easier to find problems if you add one layer at a time (or one block at a time) than if you start off with a 100 layer network.
Keep things simple - training a GAN involves making sure that two networks are roughly learning at the same pace, it's a delicate dance and I would recommend not throwing too many bells and whistles into it. As in the previous tip, make sure everything is working properly first before you add some newfangled loss function or dynamic loss weighted or anything into it.

Labels: machine_learning , gan

K80 vs V100

Sept. 16, 2019, 10:57 a.m.

Discovering how much cheaper spot EC2 instances were than normal on-demand instances gave me the courage to try out a faster GPU. I had been using K80s which are painfully slow, but very cheap. The spot price for the V100 is about the same as the on-demand price of the K80s, so using those with spot instances won't be any cheaper, but it won't be more expensive either.

I didn't think the V100s were such great GPUs, so I wasn't expecting it to be worth the extra cost. How wrong I was. Training the network I am currently playing with on a K80 with a batch size of 48 took about 8-12 hours per epoch. Training it on a V100 with a batch size of 64 is looking like it's going to take about 2 hours. With the V100s priced at about 4x the K80s, that works out to about the same price per compute to a little bit cheaper, depending on exactly how long it took per epoch on the K80.

When you factor in the value of not having to wait an entire day to see the results of an epoch, this is a no-brainer as far as I'm concerned. Unfortunately, I'm sure my AWS bill is going to increase substantially. That's how they get you... Once you have a taste of HPC they know you'll be back for more...

Labels: machine_learning , ec2 , aws , gpu

AWS EC2 Spot Instances

Sept. 12, 2019, 4:35 p.m.

My major complaint about using EC2 GPU instances was the cost, it gets very expensive to run a GPU instance for more than a few hours. Last week I was wondering why I wasn't using spot instances, so I set up a request and I've been running it for a few days now. It is about 1/4 the price of a normal instance, so it's not much more expensive than renting a CPU-only on-demand instance. I was hoping to get a better GPU than the K80, but I ended up settling for the K80 because it was more available than the better GPUs, but next time I may request a better one and see what happens.

The downside of spot instances is that they will be terminated if the capacity is needed for an on-demand instance, and my instance was terminated the other night. But then I spun up a new one in the morning and that one has been running for a few days now. I can't believe I haven't used these before.

Labels: machine_learning , aws

Amazon EC2 Deep Learning AMI Instance

Aug. 27, 2019, 6:44 a.m.

It is difficult to play around with the structure for the GAN I am working on in Colab since it trains so slowly. I can usually get maybe 2 or 3 epochs in a day, which means that I need to wait a day before evaluating each change I make. I decided to rent a GPU in the cloud for a few days so I could train it a bit more quickly and figure out what works and what doesn't work before going back to Colab.

I already have a Google Cloud GPU instance I was using for my work with mammography, but it was running CUDA 9.0 which apparently is not supported by PyTorch out of the box. I tried to upgrade CUDA to 10, but I think I ended up just making things worse. Rather than spend a whole day trying to fix the GCS instance, and since I have some AWS credits, I decided to try to use an AWS Deep Learning AMI instance, which already has everything configured.

It was incredibly easy to get set up, it comes pre-configured with virtual environments for different deep learning frameworks and packages, so there is no need to install CUDA or drivers or anything like that, which is a huge advantage, since back when I was setting up the GCS instance it took me a few days to get everything installed and working. One thing I quickly noticed was that the default disk size was not even close to big enough - after downloading a few data files I was already running out of disk space, but it was very easy to increase the disk size.

Then all I had to do was activate the pytorch environment, launch a notebook and everything was running smoothly. I did run into a few minor issues, none of which were difficult to resolve:

If I launch tmux from within a virtual environment it launches a session that does NOT have the environment activated. Then if I activate the environment from within tmux it doesn't have access to the proper modules. This was resolved by launching tmux from outside of the venv, and then activating the venv from inside tmux.
In my notebook it didn't seem to have access to pytorch, but this was because I hadn't selected the proper kernel from the kernel -> change kernel menu. I wasn't even aware that one could select the kernel like that.

I used to prefer GCS to AWS because it was more configurable and easier to use. While AWS does have a bit of a learning curve, they really have thought of and provided for just about every possible contingency. We use AWS at my work, and it really is very impressive. I still like the simplicity of GCS, but even simple things like AMIs make such a huge difference in set-up time that I think I'll be using AWS more often now.

Labels: machine_learning , aws , pytorch

Constructing Batches for Training a Discriminator of a GAN

Aug. 24, 2019, 7:20 a.m.

I had been trying to train my autoencoder with a GAN component on and off for a couple of months and it just didn't seem to be working very well. I thought that maybe the autoencoder and the discriminator errors were somehow cancelling each other out or something. Just for the hell of it I decided to try to use the discriminator to optimize a reconstructed image to look real, just to see what the result would be. Instead of optimizing the weights, I created a Variable of the input and optimized that instead. To my surprise I ended up with weird splotches of primary colors against a white background, it actually made the image look less and less real rather than more. After seeing that I decided that there must be some major problem with my code so I went through it in greater detail.

I decided to train all three networks from scratch (the three being the encoder, the decoder and the discriminator) to see what would happen. I was surprised to see that the generator did not seem to be learning ANYTHING and neither did the discriminator. I found a tutorial on creating a GAN in PyTorch and I went through the training code to see how it differed from mine.

I had written my code to optimize it for speed, training the autoencoder without the GAN already took about 4 hours per epoch on a (free) K80 on Colab so I didn't want to slow that down much more, so I tried to minimize the numebr of times data had to be passed through the networks. The tutorial did not do that. First it ran a batch of real data through the discriminator, computed the gradients but did NOT back propagate them. Then it used the generator to generate a batch of faked data, passed that through the discriminator, computed the gradients, added them to the gradients from the first batch and THEN did the back prop. Then it ran the same batch of faked data through the discriminator again, and used that to update the generator. This was different from my code in several major ways:

I was using a single batch containing half real images and half reconstructed images to train my discriminator.
I was training passing data through each network one single time per batch.
I wasn't detaching the reconstructed data before passing them through the discriminator.

After updating my code to bring it more in line with the tutorial both networks began to learn, I think that major change was detaching the reconstructed images before putting them through the discriminator. However I noticed a few strange things regarding the discriminator batches:

If I used a single batch containing both real and constructed images to train the discriminator it learned very quickly, it's loss approached 0 very quickly, and the discriminator loss component of the generator overwhelmed the autoencoder loss, which sort of fluctuated but didn't decrease very much.
If I trained using two batches, each containing only images for a single label, it's accuracy hovered around 50% and the autoencoder loss decreased rapidly.

I read in a couple of places that using separate batches was a trick to make GANs train better, but no one really had an explanation for why this worked. What I am currently doing it using separate batches most of the time, before every n batches I use a single batch to encourage the discriminator to learn a bit more. I've tested values for n of 8, 16, 32 and 64. Most of those seemed to result in the worst of both worlds, nothing really seemed to improve, but with n = 64 the autoencoder loss is again decreasing, although slowly, and the discriminator accuracy is hovering around 52% rather than the 49-50% it was at using all separate batches.

To me using separate batches doesn't intuitively make sense, I don't see how the network can really learn to differentiate between classes when it only sees one class at a time. Of course the gradients are then added, and the differences should cancel out, with what's left indicating how to differentiate the classes; but to me it seems much more efficient to learn from mixed batches. One would never consider training a network on, say, the CIFAR dataset with each batch consisting exclusively of a single class. Maybe that's the point, to slow down the discriminator's learning enough for the generator to keep up? Anyway I will continue to experiment and see what works and what doesn't work.

Labels: machine_learning , pytorch , autoencoders , gan

Creating a Dataset of Faces for my Autoencoder with Semi-supervised Learning

Aug. 3, 2019, 7:33 a.m.

I am still working on my face autoencoder in my spare time, although I have much less spare time lately. My non-variational autoencoder works great - it can very accurately reconstruct any face in my dataset of 400,000 faces, but it doesn't work at all for interpolation or anything like that. So I have also been trying to train a variational autoencoder, but it has a lot more difficulty learning.

For a face which is roughly centered and looking in the general direction of the camera it can do a somewhat decent job, but if the picture is off in any way - there is another face off to the side, there is something blocking the face, the face is at a strange angle, etc it does a pretty bad job. And since I want to try to use this for interpolation training it on these bad faces doesn't really help anything.

One of the biggest datasets I am using is this one from ETHZ. The dataset was created to train a network to predict the age of the person, and while the images are all of good quality it does include many images that have some of the issues I mentioned above, as well as pictures that are not faces at all - like drawings or cartoons. Other datasets I am using consist entirely of properly cropped faces as I described above, but this dataset is almost 200k images, so omitting it completely significantly reduces the size of my training data.

The other day I decided I needed to improve the quality of my training dataset if I ever want to get this variational autoencoder properly trained, and to do that I need to filter out the bad images from the ETHZ IMDB dataset. They had already created the dataset using face detectors, but I want to remove faces that have certain attributes:

Multiple faces or parts of faces in the image
Images with something blocking part of the face
Images where the faces are not generally facing forward, such as profiles

I started trying to curate them manually, but after going through 500 images of the 200k I realized that would not be feasible. It would be easy to train a neural network to classify the faces, but that would require training data, but that still means manually classifying the faces. So, what I did is I took another dataset of faces that were all good and added about 700 bad faces from the IMDB dataset for a total size of about 7000 images and made a new dataset. Then I took a pre-trained discriminator I had previously used as part of a GAN to try to generate faces and retrained it to classify the faces as good or bad.

I ran this for about 10 epochs, until it was achieving very good accuracy, and then I used it to evaluate the IMDB dataset. Any image which it gave a less than 0.03 probability of being good I moved into the bad training dataset, and any images which it gave a 0.99 probability of being good I moved to the good training dataset. Then I continued training it and so on and so on.

This is called weak supervision or semi-supervised learning, and it works a lot better than I thought it would. After training for a few hours, the images which are moved all seem to be correctly classified, and after each iteration the size of the training dataset grows to allow the network to continue learning. Since I only move images which have very high or very low probabilities, the risk of a misclassification should be relatively low, and I expect to be able to completely sort the IMDB dataset by the end of tomorrow, maybe even sooner. What would have taken weeks or longer to do manually has been reduced to days thanks to transfer learning and weak supervision!

Labels: coding , data_science , machine_learning , pytorch , autoencoders

PyTorch Update

April 18, 2019, 7:26 a.m.

After another couple of weeks using PyTorch my initial enthusiasm has somewhat faded. I still like it a lot, but I have encountered many disadvantages. For one I can now see the advantage of TensorFlows static graphs - it makes the API easier to use. Since the graph is completely defined and then compiled you can just tell each layer how many units it should have and it will infer the number of inputs from whatever it's input is. In PyTorch you need to manually specify the inputs and outputs, which isn't a big deal, but makes it more difficult to tune networks since to change the number of units in a layer you need to change the inputs to the next layer, the batch normalization, etc. whereas with TensorFlow you can just change one number and everything is magically adjusted.

I also think that the TensorFlow API is better than PyTorch. There are some things which are very easy to do in TensorFlow which become incredibly complicated with PyTorch, like adding different regularization amounts to different layers. In TensorFlow there is a parameter to the layer that controls the regularization, in PyTorch you apparently need to loop through all of the parameters and know which ones to add what amount of regularization to.

I suppose one could easily get around these limitations with custom functions and such, and it shouldn't be surprising that TensorFlow seems more mature given that it has the weight of Google behind it, is considered the "industry standard", and has been around for longer. But I now see that TensorFlow has some advantages over PyTorch.

Labels: python , machine_learning , tensorflow , pytorch

PyTorch

April 8, 2019, 2:31 p.m.

When I first started with neural networks I learned them with TensorFlow and it seemed like TensorFlow was pretty much the industry standard. I did however keep hearing about PyTorch which was supposedly better than TensorFlow in many ways, but I never really got around to learning it. Last week I had to do one of my assignments in PyTorch so I finally got around to it, and I am already impressed.

The biggest problem I always had with TensorFlow was that the graphs are static. The entire graph must be defined and compiled before it is run and it can't be altered at runtime. You feed data into the graph and it returns output. This results in the rather awkward tf.Session() which must be created before you can do anything, and which contains all of the parameters for the model.

PyTorch has dynamic graphs which are compiled at runtime. This means that you can change things as you go, including altering the graph while it is running, and you don't need to have all the dimensions of all of the data specified in advance like you do in TensorFlow. You can also do things like change the numbers of neurons in a layer dynamically and drop entire layers at runtime which you can't do with TensorFlow.

Debugging PyTorch is a lot easier since you can just make a change and test it - you don't need to recreate the graph and instantiate a session to test it out. You can just run an optimization step whenever you want. Coming from TensorFlow that is just a breath of fresh air.

TensorFlow still has many advantages, including the fact that it is still an industry standard, is easier to deploy and is better supported. But PyTorch is definitely a worth competitor, is far more flexible, and solves many of the problems with TensorFlow.

Labels: python , machine_learning , tensorflow , pytorch

CatBoost

Jan. 10, 2019, 2:01 p.m.

Usually when you think of a gradient boosted decision tree you think of XGBoost or LightGBM. I'd heard of CatBoost but I'd never tried it and it didn't seem too popular. I was looking at a Kaggle competition which had a lot of categorical data and I had squeezed just about every drop of performance I could out of LGBM so I decided to give CatBoost a try. I was extremely impressed.

Out of the box, with all default parameters, CatBoost scored better than the LGBM I had spent about a week tuning. CatBoost trained significantly slower than LGBM, but it will run on a GPU and doing so makes it train just slightly slower than the LGBM. Unlike XGBoost it can handle categorical data, which is nice because in this case we have far too many categories to do one-hot encoding. I've read the documentation several times but I am still unclear as to how exactly it encodes the categorical data, but whatever it does works very well.

I am just beginning to try to tune the hyperparameters so it is unclear how much (if any) extra performance I'll be able to squeeze out of it, but I am very, very impressed with CatBoost and I highly recommend it for any datasets which contain categorical data. Thank you Yandex!

Labels: coding , data_science , machine_learning , kaggle , catboost

Exercise Log

Nov. 27, 2018, 4:27 p.m.

I exercise quite a lot and I have not been able to find an app to keep track of it which satisfies all of my criteria. Most fitness trackers are geared towards cardio and I also do a lot of strength training. After spending a year trying to make due with combinations of various fitness trackers and other apps I decided to just write my own, which could do everything I wanted and could show all of the reports I wanted.

I did that and after using it for a few weeks put it online at workout-log.com. It's not fancy and it is quite likely very buggy at this point, but it is open to anyone who wants to use it.

It's written with Django and jQuery and uses ChartJS for the charts.

Labels: python , django , data_science , machine_learning

CoLab TPUs One Month Later

Oct. 31, 2018, 11:49 a.m.

After having used both CoLab GPUs and TPUs for almost a month I must significantly revise my previous opinion. Even for a Keras model not written or optimized for TPUs, with some minimal configuration changes TPUs perform much faster - minimum of twice the speed. In addition to making sure that all operations are TPU compatible, the only major configuration change required is increasing the batch size by 8. At first I was playing around with the batch size, but I realized that this was unnecessary. TPUs have 8 shards, so you simply multiple the GPU batch size by 8 and that should be a good baseline.

The model I am currently training on a TPU and a GPU simultaneously is training 3-4x faster on the TPU than on the GPU and the code is exactly the same. I have this block of code:

use_tpu = True # if we are using the tpu copy the keras model to a new var and assign the tpu model to model if use_tpu: TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR'] # create network and compiler tpu_model = tf.contrib.tpu.keras_to_tpu_model( model, strategy = tf.contrib.tpu.TPUDistributionStrategy( tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER))) BATCH_SIZE = BATCH_SIZE * 8

The model is created with Keras and the only change I make is setting use_tpu to True on the TPU instance.

One other thing I thought I would mention is that CoLab creates separate instances for GPU, TPU and CPU, so you can run multiple notebooks without sharing RAM or processor if you give each one a different type.

Labels: machine_learning , tensorflow , google , google_cloud

CoLab TPUs

Oct. 23, 2018, 9:41 a.m.

The other day I was having problems with a CoLab notebook and I was trying to debug it when I noticed that TPU is now an option for runtime type. I found no references to this in the CoLab documentation, but apparently it was quietly introduced only recently. If anyone doesn't know, TPUs are chips designed by Google specifically for matrix multiplications and are supposedly incredibly fast. Last I checked the cost to rent one through GCP was about $6 per hour, so the ability to have access to one for free could be a huge benefit.

As TPUs are specialized chips you can't just run the same code as on a CPU or a GPU. TPUs do not support all TensorFlow operations and you need to create a special optimizer to be able to take advantage of the TPU at all. The model I was working with at the time was created using TensorFlow's Keras API so I decided to try to convert that to be TPU compatible in order to test it.

Normally you would have to use a cross shard optimizer, but there is a shortcut for Keras models:

TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']

# create network and compiler tpu_model = tf.contrib.tpu.keras_to_tpu_model( keras_model, strategy = tf.contrib.tpu.TPUDistributionStrategy( tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))

The first line finds an available TPU and gets it's address. The second line takes your keras model as input and converts it to a TPU compatible model. Then you would train the model using tpu_model.fit() instead of keras_model. This was the easy part.

For this particular model I am using a lot of custom functions for loss and metrics. Many of the functions turned out to not be compatible with TPUs so had to be rewritten. While at the time this was annoying, it turned out to be worth it regardless of the TPU because I had to optimize the functions in order to make them compatible with TPUs. The specific operations which were not compatible were non-matrix ops - logical operations and boolean masks specifically. Some of the code was downright hideous and this forced me to sit down and think through it and re-write it in a much cleaner manner, vectorizing as much as possible.

After all that effort, so far my experience with the TPUs hasn't been all that great. I can train my model with a significantly larger batch size - whereas on an Nvidia K80 16 was the maximum batch size, I am currently training with batches of 64 on the TPU and may be able to push that even higher. However the time per epoch hasn't really improved all that much - it is about 1750 seconds on the TPU versus 1850 seconds on the K80. I have read code may need to be altered more to take full advantage of TPUs and I have not really tried playing with the batch size to see how that changes the performance yet.

I suspect that if I did some more research about TPUs and coded the model to be optimized for a TPU from scratch there might be a more noticeable performance gain, but this is based solely on having heard other people talk about how fast they are and not from my experience.

Update - I have realized that the data augmentation is the bottleneck which is limiting the speed of training. I am training with a Keras generator which performs the augmentation on the CPU and if this is removed or reduced the TPUs do, in fact, train significantly faster than a GPU and also yield better results.

Labels: coding , machine_learning , google_cloud

Saving Files from Google Colab to Google Cloud Storage

Oct. 4, 2018, 4:41 p.m.

I have previously written about Google CoLab which is a way to access Nvidia K80 GPUs for free, but only for 12 hours at a time. After a few months of using Google Cloud instances with GPUs I have run up a substantial bill and have reverted to using CoLab whenever possible. The main problem with CoLab is that the instance is terminated after 12 hours taking all files with it, so in order to use them you need to save your files somewhere.

Until recently I had been saving my files to Google Drive with this method, but while it is easy to save files to Drive it is much more difficult to read them back. As far as I can tell, in order to do this with the API you need to get the file id from Drive and even then it is not so straightforward to upload the files to CoLab. To deal with this I had been uploading files that needed to be accessed often to an AWS S3 bucket and then downloading them to CoLab with wget, which works fine, but there is a much simpler way to do the same thing by using Google Cloud Storage instead of S3.

First you need to authenticate CoLab to your Google account with:

from google.colab import auth

auth.authenticate_user()

Once this is done you need to set your project and bucket name and then update the gcloud config.
project_id = [project_name] bucket_name = [bucket_name] !gcloud config set project {project_id}

After this has been done files can simply and quickly be upload or downloaded from the bucket with the following simple commands:

# download !gsutil cp gs://{bucket_name}/foo.bar ./foo.bar

# upload !gsutil cp ./foo.bar gs://{bucket_name}/foo.bar

I actually have been adding the line to upload the weights to GCS to my training code so it is automatically uploaded every couple epochs, which removes the need for me to manually back them up periodically throughout the day.

Labels: coding , python , machine_learning , google , google_cloud

Keras

Sept. 26, 2018, 1:03 p.m.

When I first started working with TensorFlow I didn't really like Keras. It seemed like a dumbed down interface to TensorFlow and I preferred having greater control over everything to the ease of use of Keras. However I have recently changed my mind. When you use Keras with a TensorFlow back-end you can still use TensorFlow if you need to tweak something that you can't in Keras, but otherwise Keras just provides an easier to use way to access TensorFlow's functionality. This is especially useful for prototyping models since you can easily make changes without having to write or rewrite a lot of code. I used to write my own functions to do things like make a convolutional layer, but most of that was duplicating functionality that already exists in Keras.

My original opinion was incorrect, Keras is a valuable tool for creating neural networks, and since you can mix TensorFlow in, there is nothing lost by using it.

Labels: machine_learning , tensorflow , keras

IOU Loss

Aug. 29, 2018, 11:52 a.m.

When doing binary image segmentation, segmenting images into foreground and background, cross entropy is far from ideal as a loss function. As these datasets tend to be highly unbalanced, with far more background pixels than foreground, the model will usually score best by predicting everything as background. I have confronted this issue during my work with mammography and my solution was to use a weighted sigmoid cross entropy loss function giving the foreground pixels higher weight than the background.

While this worked it was far from ideal, for one thing it introduced another hyperparameters - the weight - and altering the weight had a large impact on the model. Higher weights favored predicting pixels as positive, increasing recall and decreasing precision, and lowering the weight had the opposite effect. When training my models I usually began with a high weight to encourage the model to make positive predictions and gradually decayed the weight to encourage it to make negative predictions.

For these types of segmentation tasks Intersection over Union tends to be the most relevant metric as pixel level accuracy, precision and recall do not account for the overlap between predictions and ground truth. Especially for this task, where overlap can be the difference between life and death for the patient, accuracy is not as relevant as IOU. So why not use IOU as a loss function?

The reason was because IOU was not differentiable so can not be used for gradient descent. However Wang et al have written a paper - Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation - which provides an easy way to use IOU as a loss function. In addition, this site provides code to implement this loss function in TensorFlow.

The essence of this method is that rather than using the binary predictions to calculate IOU we use the sigmoid probability output by the logits to estimate it which allows IOU to provide gradients. At first I was skeptical of this method, mostly because I understood cross entropy better and it is more common, but after I hit a performance wall with my mammography models I decided to give it a try.

My models using cross-entropy loss had ceased to improve validation performance so I switched the loss function and trained them for a few more epochs. The validation metrics began to improve, so I decided to train a copy of the model from scratch with the IOU loss. This has been a resounding success. The IOU loss accounts for the imbalanced data, eliminating the need to weight the cross entropy. With the cross entropy loss the models usually began with recall of near 1 and precision of near 0 and then the precision would increase while the recall slowly decreased until it plateaued. With IOU loss they both start near 0 and gradually increase, which to me seems more natural.

Training with an IOU loss has two concrete benefits for this task - it has allowed the model to detect more subtle abnormalities which models trained with cross entropy loss did not detect; and it has reduced the number of false positives significantly. As the false positives are on a pixel level this effectively means that the predictions are less noisy and the shapes are more accurate.

The biggest benefit is that we are directly optimizing for our target metric rather than attempting to use an imperfect substitute which we hope will approximate the target metric. Note that this method only works for binary segmentation at the moment. It also is a bit slower than using cross entropy, but if you are doing binary segmentation the performance boost is well worth it.

Labels: python , machine_learning , mammography , convnets

Early Stopping

July 30, 2018, 9:37 a.m.

I recently began using the early stopping feature of Light GBM, which allows you to stop training when the validation score doesn't improve for a certain number of rounds. This is especially useful if you are bagging models, as you don't need to watch each one and figure out when training should stop. The way it works is you specify a number of rounds, and if the validation score doesn't improve during that number of rounds the training is stopped and the round with the best validation score is used.

When working with this I noticed that often the best validation round is a very early round, which has a very good validation score but an incredibly low training score. As an example here is the output from a model I am currently training. Normally the training F1 gets up to the high 0.90s:

Early stopping, best iteration is:
[7]	train's macroF1: 0.525992	valid's macroF1: 0.390373

Out of at least 400 rounds of training, the best performance on the validation set was on the 7th, at which time it was performing incredibly poorly on the training data. This indicates overfitting to the validation set, which is just as bad as overfitting to the training set in that the model is not likely to generalize well.

So what to do about this issue? The obvious solution would be to provide a minimum number of rounds and begin to monitor the validation score for early stopping once that number of rounds has passed, but I don't see any way to do this through the LGB API.

I am running this code using sklearn's joblib to do parallel processing, so I have create a list of the estimators to fit and then pass that list to the parallel processing which calls a function which fits the estimator to the data and returns it. The early stopping is taken care of by LGB, so what I did is after the estimator is fit I manually get the validation results and the train performance for the best validation round. If the train performance is above a specified threshold I return the estimator as normal. If, however, the train performance is below that threshold I recursively call the function again.

The downside to this is that it is possible to get into an infinite loop, but if the thresholds are properly tuned this should be easily avoidable.

Labels: coding , data_science , machine_learning , lightgbm

The Importance of Cross Validation

July 23, 2018, 11:10 a.m.

I recently started looking at a Kaggle Challenge about predicting poverty levels in Costa Rica. I used sklearn train_test_split to split the training data into train and validation sets and fit a few models. The first thing I noticed was that my submissions scored significantly lower than my validation sets: 0.36 on the submission vs. .96 on my validation data.

The data consists of information about individuals with the target as their poverty level. The features include both information relating to that individual as well as information for the household they live in. The data includes multiple individuals from the same household, and some exploratory data analysis indicated that most of the features were on a household level rather than the individual level.

This means that doing a random split ends up including data from the same household in both the train and validation datasets, which will result in the leakage that artificially raised my initial validation scores. This also means that my models were all tuned on a validation dataset which was essentially useless.

To fix this I did the split on unique household IDs, so no household would be included in both datasets. After re-tuning the models appropriately, the validation f1 scores had gone down from 0.96 to 0.65. The submissions scores went up to 0.41, which was not a huge increase, but it was much closer to the validation scores.

The moral of this story is never forget to make sure that your training and validation sets don't contain overlap or leakage, or the validation set becomes useless.

Labels: data_science , machine_learning , kaggle

The Undeniable Beauty of Cross Entropy

July 17, 2018, 7:34 a.m.

When I began working on this project my intention was to do multi-class classification of the images. To this end I built my graph with logits and a cross-entropy loss function. I soon realized that the decision to do multi-class classification was quite ambitious, and scaled back to doing binary classification into positive and negative. My goal was to implement the multi-class approach once I had the binary approach working reasonably well, so I left the cross entropy in place.

Over the months I have been working on this I have realized that, for many reasons, the multi-class classification was a bad idea. For an academic project it might have made sense, but for any sort of real world use case it made none. There is really no use to outputting a simple classification for something as important as detecting cancer. A much more useful output is the probabilities that each area of the image contains an abnormality as this could aid a radiologist in diagnosing abnormalities rather than completely replacing her. Yet for some reason I never bothered to change the output or the loss function.

The limiting factor on the size of the model has been the GPU memory of the Google Cloud instance I am training this monstrosity on. So I've been trying to optimize the model to run within the RAM constraints and train in a reasonable amount of time. Mostly this has involved trying to keep the number of parameters to a minimum, but today I was looking at the model and realized that the logits were definitely not helping the situation.

For this problem classification was absolutely the wrong approach. We aren't trying to classify the content of the image, we are trying to detect abnormalities. The negative class was not really a separate class, but the absence of any abnormalities, and the graph and the loss function should reflect this. In order to coerce the logits into an output that reflected the reality just described, I put the logits through a softmax and then discarded the negative probability - as I said the negative class doesn't really exist. However the cross entropy function does not know this and it places equal importance on the imaginary negative class as on the positive class (subject to the cross entropy weighting of course.) This means that the gradients placed equal weight on trying to find imaginary "normal" patterns, despite the fact that this information is discarded and never used.

So I reduced the logits layer to one unit, replaced the softmax activation with a sigmoid activation, and replaced the cross entropy with binary cross entropy. And the change has been more impactful than I imagined it would be. The model immediately began performing better than the same model with the logits/cross entropy structure. It seems obvious that this would be the case as now the model can focus on detecting abnormalities rather than wasting half of it's efforts on trying to detect normal patterns.

I am not sure why I waited so long to make this change and my best guess is that I was seduced by the undeniable elegance of the cross entropy loss function. For multi-class classification it is truly a thing of beauty, and I may have been blinded by that into attempting to use it in a situation it was not designed for.

Labels: coding , machine_learning , tensorflow , mammography

DeConvolution Artifacts

June 14, 2018, 9:21 a.m.

If you have ever used deconvolutions to upsample layers of convnets you have probably seen artifacts and possibly checkerboard patterns. This article explains why and gives some useful tips as to how to avoid the problem. I have implemented some of the suggestions and, while it's a bit early to evaluate their efficacity, so far they seem to be helping.

Labels: coding , machine_learning

Only Training Certain Variables in TensorFlow

June 13, 2018, 1:53 p.m.

As I continue to work on my mammography project I save a lot of time by re-using weights from models I have already trained rather than training every iteration of every model from scratch, which would be very time consuming. However a drawback to this method is that if I add a new layer or change a layer when I continue training the model the layers which have not changed are prone to overfit as they have been trained for substantially longer than the new layers.

I tried only training certain variables, but when the checkpoint is saved only the trained variables are included in it, which means that the checkpoint can not be restored as it is missing many variables. This could be overcome by restoring certain variables from one checkpoint and others from a different checkpoint, but that is overly complicated and not very convenient.

Earlier today, I had added another deconvolution layer to my model. When I trained just that layer the accuracy of the model went very high very quickly, much more quickly than training all of the layers. But then I couldn't continue training all of the layers because the checkpoint only contained the layer trained. I don't have the time to retrain the entire monstrosity from scratch, so I found an ugly hack that allows me to train mostly the layers I want to train while saving all of the weights in the checkpoint.

I create two training ops - one for all variables (train_op_1) and one for the variables I want to train (train_op_2). I run train_op_2 most of the time. But right before I save the checkpoint I do one iteration of train_op_1 which updates all layers, so all variables are saved in the checkpoint. It's not pretty, but it works and best of all, the code doesn't have to be changed depending on what I want to train. I specify whether I want to train all vars or just the subset as a command line arg and if I want to train all vars, then set train_op_2 = train_op_1.

I just ran a few quick tests with no issues, hopefully this will continue to work.

Labels: python , data_science , machine_learning , tensorflow

Intelligent Conversation About AI

June 7, 2018, 9:24 a.m.

Most of the videos I see on YouTube discussing the dangers of AI and machine learning are by people who really have no idea about the subject. I recently started to watch a TED talk by a guy who said that machine learning programs wrote themselves. I had to turn it off about 30 seconds in because the guy obviously had learned about AI from watching Terminator or some other Hollywood movie. I also get pissed off when I hear Elon Musk talk about the "dangers" of AI. It amazes me that someone who is obviously intelligent is so clueless about the subject.

I just watched a discussion about AI from the World Science Festival which was notable for being an intelligent discussion of the subject by people who are actually familiar with the technology. It is rather long, but it touches on many subjects and every subject is discussed intelligently.

Labels: machine_learning

Update on CBIS-DDSM Training Images

June 6, 2018, 3:07 p.m.

Even though I only have 1/5 of the images uploaded so far, I decided to do some tests to see if this method would work. It does, but it took quite a bit of tweaking to get performance to reasonable levels.

At first I just plugged the new dataset into the old graph, and this worked but was incredibly slow with the GPU sitting idle most of the time. I tried quite a few different methods to speed the pre-processing bottleneck up, but the solution was simpler than I had thought it would be.

The biggest factor was increasing number of threads in the tf.train.batch from the default of 1. This one change made a huge difference, cutting the training time down to about 1/4 of what it had been.

I also experimented with moving some pre-processing operations around, including resizing the images individually when loading them and after being batched. This had negligible effects, but resizing them individually was slightly faster than doing it as a batch. In general I found that the more pre-processing operations I moved to the queue (and the CPU) the better the performance.

This version still trains at about 1/2 the speed the tfrecords version did, which is a big difference, but the size of the training set has increased by orders of magnitude so I guess I can live with it.

The code is available on my GitHub.

Labels: python , machine_learning , tensorflow , mammography

CBIS-DDSM Mammography Training Data

June 6, 2018, 1:32 p.m.

I am continuing to work with the CBIS-DDSM datasets and recently decided to take a new direction with the training data. Previously I had been locally segmenting the raw scans into images of varying sizes and writing those images to tfrecords to use as training data. I started by classifying the images by pathology with categorical labels, and while I got decent results using this approach, the models performed terribly on images from different datasets and on full-size images. I suspected the model was using features of the images that were not related to the actual ROIs to make its predictions, such as the amount of contrast or presence of extremely high pixel values.

To address this I started using the masks as labels and training the model to do segmentation of the images into normal and ROI. This had the added advantage of allowing me to exclude images from the DDSM dataset and only use CBIS-DDSM images which eliminated the features I believed the previous models had been relying on, as the DDSM and CBIS-DDSM datasets had substantially different variances, mins, maxes and means. The disadvantage of this approach was that the dataset was double the size due to the fact that the labels are now the same size as the images.

I started with a dataset of 320x320 images, however models trained on this dataset often had trouble with images which had bright patches running of the edge of the image and images with high contrast, misclassifying the bright patches as positive. To attempt to address this I started training the model on 320x320 images, and then switched to another dataset of 640x640 images after training through 50 or so epochs.

The dataset of 640x640 images only had 13,000 training examples in it, about 1/3 the number of examples in the 320x320 dataset, but was still larger due to the fact that each example and label is four times the size of the 320x320 images. I considered making another dataset with either more or larger images, but saw that this process could continue indefinitely as I had to keep creating new datasets of larger and larger size.

Instead I decided to create one new dataset which could be used indefinitely, for all purposes. To do this I loaded each image in the CBIS-DDSM dataset into Python. While the JPEGS are RGB, the images are grayscale so I only kept one channel of each image. I Some images have multiple masks, and rather than have multiple versions of each image with different masks, which could confuse the model, I combined all masks for each image into one mask, and then added that as the second channel of each image. In order to be able to save the array as an image I added a third channel of all 0s. Each new images was then saved as a PNG.

The resulting dataset is about 12GB, about four times the size of the largest tfrecords dataset, but the entirety of the CBIS-DDSM dataset (minus a few images which had masks of incorrect sizes and were discarded) is now represented. Now, in my model, I load each full image and then take a random crop of it and use that as training data. Since the mask is part of the image I can use TensorFlow's random crop function to crop the full image, and then separate the channels into the training example and it's label.

This not only increases the size of the training data set exponentially, but since my model is fully convolutional, I can also easily change the crop size without having to create a new dataset.

The major problem with this approach is that the mean of the labels is very low - around 0.015 - meaning that only 1% of the pixels have a positive label and the rest are negative. The previous dataset had a mean of 0.05. This will be addressed by raising the cross entropy weight from 20 to 75 so that the model doesn't just predict everything as negative. When creating the images I had trimmed as much background as possible from them to avoid having a large amount of training images of pure black, but still the random cropping produces a large number of images with little to no actual content.

At the moment I am uploading the data to S3 which should take another couple days. Once this is done I will attempt to train on this new dataset and see if the empty images cause major problems.

Labels: coding , python , machine_learning , mammography

Segmentation of CBIS-DDSM Images

May 28, 2018, 9:38 a.m.

My original work with the DDSM and CBIS-DDSM dataset yielded good accuracy and recall on the test and validation data, but the model didn't perform so well when applied to the MIAS images, which came from a completely different dataset. Additional analysis of the images indicated that the negative images (from the DDSM) and the positive images (from the CBIS-DDSM) were different in some subtle but important ways:

The negative images had a lower mean, lower maximum and higher minimum.
The negative images also had lower contrast.

We had become concerned about point 2 when we discovered that increasing the contrast of any image made it more likely to be predicted as positive and discovered point 1 while investigating this further. When applying our fully convolutional model trained on the combined data to complete scans, rather than the 299x299 images we had trained on, we noticed that more than 50% of the sections of a positive image were predicted as positive, even if the ROI was, in fact, only present in one section. This indicates that the model was using some feature of the images other than the ROI in its prediction.

When starting this project, we had initially planned to segment the CBIS-DDSM images and use images which did not contain an ROI as negative images, but we were not certain that there were not differences in the tissue of positive and negative scans which might make this approach not generalize to completely negative scans. When we realized that the scans had been pre-processed differently we attempted to adjust the negative images in such a way as to make them more similar to the positive images but were unable to do so without knowledge of how they had been processed.

Our solution to both of these issues was to train the model to do the segmentation of the scans rather than simple classification, using the masks as the labels. This approach had several advantages:

Using the mask as the label tells the model where it needs to look, so we can ensure that it actually uses the ROIs rather than other features of the images, such as the contrast or maximum pixel value.
This allows us the exclude the DDSM images and only use images from one dataset, as the ROI of most scans only encompasses a small portion of the image.

We recreated the model to do semantic segmentation by removing the last "fully connected" layer (which were implemented as a 1x1 convolution) and the logits layer and upsampling the results with transpose convolutions. In order for the upsampling to work properly we needed to have the size of the images be a multiple of 2 so that the dimension reduction could be properly undone, so we used images of size 320x320.

We were able to get fairly good results training on this data with a pixel level accuracy of about 90% and a pixel recall of 70%. The image level accuracy and recall were 70% and 87%, respectively. While these results were respectable, we noticed certain patterns of incorrect predictions. Images which were mostly dark, with patches that were much brighter, tended to have the bright patches predicted positive regardless of the actual label. This pattern was mostly observed when the bright patch ran off the edge of the image.

We know that the context of an ROI is important in detecting and diagnosing it, and we suspected that in the absence of context the model was predicting any patch substantially brighter than it's surroundings to be positive. While for cancer detection, it is better to make a false positive than a false negative we thought that this pattern might become problematic when applying the model to images larger than those it was trained on. To address this issue we decided to create a dataset of larger images and continue training our model on those.

We created a dataset of 640x640 images and adjusted our existing model to take those as input. As the model is fully convolutional we can restore the model trained on 320x320 images and continue training it on the larger images with no problems, which we are currently in the process of doing. If the results of this are promising we may create another dataset of even larger images are fine-tune this model on those images until we have a model which takes complete images as input.

Labels: machine_learning , tensorflow , mammography , convnets , ddsm

DDSM Mammography

May 23, 2018, 10:02 a.m.

For a course I was taking at EPFL I was working on classifying images from the DDSM dataset with ConvNets. I had some success, although not as much as I would have liked, and I posted an edited version of my report on Medium.

The source code used to create and train the models is available in this GitHub repo, and the code used to create the data and do EDA is available here.

Although the course is over I am still working on this project, attempting to fix some of the issues that came up during the first stage.

Labels: python , machine_learning , mammography , convnets

TensorFlow Cross Entropy Returning NaN at Test Time

May 17, 2018, 3:34 p.m.

I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.

Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.

In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems.

I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.

Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.

Labels: python , machine_learning , tensorflow

TensorFlow GPU Bottlenecks

May 16, 2018, 3:20 p.m.

I was training a model on a Google Cloud instance with a Tesla K80 GPU. This particular model had more data pre-processing required than normal. The model was training very slowly, the GPU usage was oscillating between 0% and 75-100%. I thought the CPU was the bottleneck and was trying to put as much pre-processing on the GPU as possible.

I read TensorFlow's optimization guide, which suggested forcing the pre-processing to be on the CPU by enclosing it with:

with tf.device('/cpu:0'):

Since I thought the CPU was the bottleneck I didn't think that would help, but I tried it anyway because I had no other good ideas and was surprised that it worked like magic! The GPU usage now stays constant around 95-100% while the CPU usage stays at about the same levels as before.

Labels: machine_learning , tensorflow , google_cloud

Online Data Augmentation with TensorFlow

May 4, 2018, 8:41 a.m.

I have been working on a project to detect abnormalities in mammograms. I have been training it on Google Cloud instances with Nvidia Tesla K80 GPUs, which allow a model to be trained in days rather than weeks or months. However when I tried to do online data augmentation it became a huge bottleneck because it did the data augmentation on the CPU.

I had been using tf.image.random_flip_left_right and tf.image.random_flip_up_down but since those operations were run on the CPU the training slowed down to a crawl as the GPU sat idle waiting for the queue to be filled.

I found this post on Medium, Data Augmentation on GPU in Tensorflow, which uses tf.contrib.image instead of tf.image. tf.contrib.image is written to run on the GPU, so using this code allows the data augmentation to be performed on the GPU instead of the CPU and thus eliminates the bottleneck.

This has been a life saver for me. Adding it to my graph allows me to train for longer without overfitting and this get better results.

Labels: python , machine_learning , tensorflow

Machine Learning

April 16, 2018, 9:35 a.m.

While we tend to think of ourselves as being at the pinnacle of evolution, in reality humans are barely a step up from monkeys. Our only real differentiation from them is that we have language, which allows us to communicate knowledge to others and to preserve that knowledge through time. Language gives us a huge advantage, allowing us to progressively accumulate new knowledge by building up on previous discoveries, but in the end we are just animals who have evolved to survive, like all other animals. We are not adapted to having civilizations and technology, we evolved to find food and procreate and the results of this can be seen all over - from how tech companies use simple tricks like noises and bright colors and intermittent rewards to keep us hooked, to how food companies load their food with salt, fat and sugar to keep us eating unhealthy food, to the cognitive biases and heuristics we use to make decisions under uncertainty. The point of all of this is that humans evolved to find food and avoid predators, and our brains are incredibly ill-suited to processing the large amounts of data that are required to make evaluations about the types of issues that we face everyday in today's complex world.

Computers on the other hand are designed for processing large amounts of data - they can do this very efficiently if programmed correctly. However they lack our creativity - the ability to combine seemingly unrelated ideas into new ideas, and to come up with novel solutions to problems. Machine learning combines our creativity with the ability of computers to handle large amounts of data, specifically the ability to find patterns in data. In 2015 a journalist ran a study on chocolate, the results of which were that chocolate helps you lose weight. The study was commissioned as an example of "junk science" and only had 15 participants with 18 measurements for each participant. The author said “here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a ‘statistically significant’ result.” Unfortunately much of science is conducted like this - the authors start a study to proof a hypothesis and maybe they'll ignore some results which contradict the hypothesis if other results confirm it. No one likes to be wrong and if the results are a bit ambiguous you can maybe just cherry pick the numbers you like. In Darrell Huff's 1954 book "How to Lie with Statistics" he says "if you torture the data long enough it will confess to anything" and this is in fact the case.

Machine learning is about letting the data speak for itself. With machine learning you set up a system of symbolic equations for transforming data into predictions and then feed the data into that system and see what happens. If you don't like the results you can change the system or the data, but the process is far too complex to be able to cherry pick the data you like and discard the rest. This combines the strengths of humans with the strengths of computers - the humans use their creativity and domain knowledge to create the system which they hope will find patterns in the data and the computers run the data through the system. While technically possible to do so, the process of analyzing the data is far more complex than could ever be done without a computer, and the computers can only do what they are told to do - they can not create novel ideas from nothing.

In my opinion, machine learning is the most important scientific technology in recent history. Just like electricity allowed energy to become uncoupled from the previous sources - fire and animal energy - machine learning uncouples the ability to process data from the constraints of the human brain. Properly used, I think machine learning will be as revolutionary as electricity was.

Labels: machine_learning

Precision and Recall for non-binary classification

April 13, 2018, 9:19 a.m.

I am working on classifying mammography scans with a TensorFlow ConvNet. The scans are classified into five classes:

Normal
Benign Calcification
Malignant Calcification
Benign Mass
Malignant Mass

I was unsure of how I wanted to classify the scans so I created the model in such a way that it would work for any combination of classes. I initially started training with binary classification - normal or abnormal, with the goal of then expanding the number of classes once I had a model that made decent predictions on the binary case.

For the binary prediction I used precision, recall and a pr curve as metrics. When I expanded to multiple classes obviously those metrics no longer worked. As far as precision and recall I don't really care what type of abnormal the scan is - I just care that it is abnormal at all. And I wanted to have the same metrics to compare for all my models so I had to figure out a way to do precision and recall for all versions of the model.

The solution I came to was to "squash" my multi-class labels and predictions down into binary labels and predictions and feed those into the p/r metrics. I set up the classes so that 0 was always normal, so I can do the squashing as follows:

zero = tf.constant(0, dtype=tf.int64)
collapsed_predictions = tf.greater(predictions, zero)
collapsed_labels = tf.greater(y, zero)

Collapsed_predictions and collapsed_labels will then contain True if the prediction or label is NOT 0 and False if it is. Then I can feed these into my precision and recall metrics:

recall, rec_op = tf.metrics.recall(labels=collapsed_labels, predictions=collapsed_predictions)
precision, prec_op = tf.metrics.precision(labels=collapsed_labels, predictions=collapsed_predictions)

I also created a pr curve metric to see how the thresholds would affect the predictions. First I convert the logits to probabilities via a softmax and then feed that into a pr_curve_streaming_op as the predictions. In order to make this work with multi-class classification I squash the probabilities down to the probability that the item is NOT normal. Since my labels are created such that normal is always 0, the probability that it is not normal is just 1 - the probability that it is:

probabilities = tf.nn.softmax(logits, name="probabilities")

_, update_op = summary_lib.pr_curve_streaming_op(name='pr_curve',
                                                predictions=(1 - probabilities[:, 0]),
                                                labels=collapsed_labels,
                                                updates_collections=tf.GraphKeys.UPDATE_OPS,
                                                num_thresholds=20)

Labels: python , machine_learning , tensorflow

TensorFlow and Google Cloud GPU Instances

April 1, 2018, 10:06 a.m.

I decided to try a Google Cloud GPU instance as well as EC2. Once I had my quotas set properly and was able to start the instance it took me all day to get TensorFlow running with GPU. The instructions Google provides are for CUDA 8.0, and the latest version of TensorFlow requires CUDA 9.0.

To get everything running follow these steps:

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb

```
sudo apt-get update
```
```
sudo apt-get install cuda-9-0
```
```
sudo nvidia-smi -pm 1
```

These are the steps in the instructions with the proper repo to CUDA 9.0 inserted.

Then I had to install cudnn, which isn't mentioned at all in Google's instructions. I downloaded libcudnn7_7.0.4.31-1+cuda9.0_amd64.deb from the Nvidia cudnn site, and then uploaded it to the instance with scp. Then install it with:

sudo dpkg -i libcudnn7_7.0.4.31-1+cuda9.0_amd64.deb

Then you need to export the path with:

echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
echo 'export PATH=$PATH:$CUDA_HOME/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

source ~/.bashrc

And finally install TensorFlow:

sudo apt-get install python-dev python-pip libcupti-dev
sudo pip install tensorflow-gpu

I used pip3 and python3, but the rest is the same.

Update: I thought it was working fine but I was still getting errors about locating libcupti.so.9.0. That was fixed by making symlinks as described here.

I ran these commands and now it seems to be working...

# Put symlinks in /usr/local/cuda
sudo mkdir /usr/local/cuda
cd /usr/local/cuda
sudo ln -s /usr/lib/x86_64-linux-gnu/ lib64
sudo ln -s /usr/include/ include
sudo ln -s /usr/bin/ bin
sudo ln -s /usr/lib/x86_64-linux-gnu/ nvvm
sudo mkdir -p extras/CUPTI
cd extras/CUPTI
sudo ln -s /usr/lib/x86_64-linux-gnu/ lib64
sudo ln -s /usr/include/ include

Another Update: TensorFlow requires version 7.0.4 of the cudnn, I had originally downloaded 7.1.2, the code has been updated accordingly.

Final Update: I set up another instance and followed this process and it almost worked. I needed to export another path which I added here. The commands to export the path were temporary and had to be repeated every time the instance was booted, I changed that to echo the path to .bashrc so it would be automatically set.

Labels: coding , machine_learning , tensorflow , google_cloud

Amazon EC2 Deep Learning Instances

March 28, 2018, 2:53 p.m.

To resolve the problems I was having yesterday I ended up paying for an Amazon EC2 instance with the Deep Learning Ubuntu AMI. The instance type is p2.xlarge which costs $0.90/hour, but seems to be well worth it so far. In the last ten minutes I've been training a relatively small model on Google Cloud, which has been able to get through 60 steps. In contrast, on the EC2 instance the much larger model, training on the same data, has gone through 375 steps, where each epoch is 687 steps.

I did have some trouble accessing TensorBoard on the EC2 instance, but was able to get it running by following the tutorial. I also got Jupyter Notebook running and accessible from the outside world, again by following the tutorial, although I had to comment out the lines about the SSL certificates in the jupyter conf file in order to be able to connect. I decided to not use Jupyter Notebook, but it's nice to have it as an option.

Since this is just a project I am working on for myself, I'd prefer to not have to pay for the compute, but $0.90 per hour is manageable, and well worth it for the 10x increase in training speed.

Labels: machine_learning , tensorflow , google_cloud , ec2

Google CoLab and Google Cloud

March 23, 2018, 6:30 p.m.

While it was amazing for running smaller models, apparently CoLab has it's limitations. I'm working on a ConvNet that takes 299x299 images as input and trying to train it on Google CoLab kept crashing the runtime with no error messages provided. The training data totalled about 2.3 GB, and I guess CoLab just couldn't handle it for whatever reason.

I tried training on my laptop, but I estimated it would take about 6 hours per epoch, which is ridiculous, so then I tried to use Google Cloud's free trial to set up an instance with GPUs. Unfortunately the free trial no longer supports the ability to add GPUs, so that didn't work. I did set up an instance without GPUs which is training faster than my laptop right now, but not that much faster. My current estimate about about 2 hours per epoch.

My plan is to let this train overnight and see how it goes. If it is too slow I may try to use Google's TPUs, which are ostensibly optimized for TensorFlow. However they are very expensive at $6/hr. Amazon EC2 instances with GPUs are about the same price, which doesn't leave me many options.

Labels: python , machine_learning , tensorflow , google , google_cloud

TensorFlow Queues and Validation

March 22, 2018, 1:36 p.m.

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path]) X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000) X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1]) y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Labels: python , data_science , machine_learning , tensorflow

Google CoLaboratory File Persistence

Feb. 25, 2018, 10:59 a.m.

It took me a while to figure out exactly what was going on with the files I was uploading and creating using Google's CoLaboratory. Each user has a VM where their notebooks run and the VM only runs for 12 hours before it is spun down and recycled, taking with it any files you may have downloaded or created. The second day I used it I was surpised that the files I had spent time downloading, unzipping and importing were no longer there, and I had deleted the code to do that, so if you are using CoLab make sure you keep the code to get your data files!

I also tried to have two notebooks running at the same time thinking it would speed up some work I was doing, but it seems as if all of a user's notebooks run in the same VM, so there really is no advantage to having multiple notebooks running.

There is an instruction notebook that explains how to save files to Google Drive, which works very well and is easy to use. To do that run:

from google.colab import auth from googleapiclient.http import MediaFileUpload from googleapiclient.discovery import build

auth.authenticate_user()

Then you have to enter a code to authenticate yourself. Then I use this function to save files:

drive_service = build('drive', 'v3')

def save_file_to_drive(name, path): file_metadata = { 'name': name, 'mimeType': 'application/octet-stream' } media = MediaFileUpload(path, mimetype='application/octet-stream', resumable=True) created = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()

print('File ID: {}'.format(created.get('id'))) return created

The function takes two arguments, the name of the file and the path to it, and write the file to the root of your Google drive.

Note - This post was updated because my original guess as to how the VMs work was completely wrong. The VM instance exists for 12 hours, they are not tied to the runtime.

Labels: coding , machine_learning , tensorflow , google

Google CoLab

Feb. 20, 2018, 6:43 p.m.

On my laptop it takes forever to train my TensorFlow models. I was looking for cheap online services where I could run the code and not having any luck finding anything, Google Cloud Computing does give you $300 worth of free processing time, but that's not really free. I did find Google Colab which is a Python notebook based environment where you can run code for free, and it includes GPU support!

It took me a little while to get everything set up, but it was relatively easy and it runs incredibly fast. The tricky part was getting my data into the notebook. While Colab saves the notebooks to your Google Drive, they do not run on your Google Drive so you can't just put the data on the Drive and then access it.

I used wget to download the data from a URL to wherever the notebook is running, then unzipped it with Python and then I was able to read the data, so it wasn't all that complicated. When I tried to follow the instructions on importing data from Google Drive via an API I was unable to get it to work - I kept getting errors about directories and files not existing despite the fact that they showed up when I did !ls.

They have Tesla K80 GPUs available and the code runs incredibly fast. I'm still training my first model, but it seems like it's going to finish in about 20 minutes whereas it would have taken 3+ hours to train it locally. This difference in speed makes it possible to do things like tune the learning rate and hyperparameters, which are not practical to do locally if it takes hours to train the model.

This is an amazing service from Google and I am already using it heavily, just hours after having discovered it.

Labels: coding , python , machine_learning , google

Update on TensorFlow GPU Windows Errors

Feb. 16, 2018, 9:07 a.m.

After playing with TensorFlow GPU on Windows for a few days I have more information on the errors. I am running TensorFlow 1.6, currently the latest version, with Python 3.6 and Nvidia CUDA 9.0 on an Nvidia GE Force GT 750M.

When the Python Windows process crashes with an error that says CUDA_ERROR_LAUNCH_FAILED, the problem can be solved by reducing the fraction of the GPU memory available with:

config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.7

If the Python script fails with an error about exhausted resources or being unable to allocate enough memory, then you need to use a smaller batch size. This problem does not crash the Python process, Python throws an Exception but does not crash.

Once I figured these out, I have had no problems running models on the GPU at all.

Labels: python , machine_learning , tensorflow

TensorFlow GPU Errors on Windows

Feb. 15, 2018, 1:50 p.m.

I have been loving TensorFlow lately and have installed tensorflow-gpu on my Windows 10 laptop. Given that the GPU on my laptop is not a really great one I have run into quite a few issues, most of which I have solved. My GPU is an Nvidia GeForce GT 750M with 2GB of RAM and I am running the latest release of tensorflow as of February 2018, with Python 3.6.

If you are running into errors I would suggest you try these things in this order:

Try reducing the batch size for training AND validation. I always use batches for training but would evaluate on the validation data all at once. By using batches for validation and averaging the results I am able to avoid most of the memory errors.
If this doesn't work try to restrict the amount of GPU RAM available to tensorflow with config.gpu_options.per_process_gpu_memory_fraction = 0.7
which restricts the amount available to 70%. Note that I am unable to ever run the GPU with the memory fraction above 0.7
If all else fails turn the GPU off and use the CPU:
config = tf.ConfigProto() config = tf.ConfigProto(device_count = {'GPU': 0})

The difference between using the CPU and the GPU is like night and day... With the CPU it takes all day to train through 20 epochs, with the GPU the same can be done in a few hours. I think the main roadblock with my GPU is the amount of RAM, which can easily be managed by controlling the batch size and the config settings above. Just remember to feed the config into the session.

Labels: python , data_science , machine_learning , tensor_flow

Batch Normalization with TensorFlow

Feb. 13, 2018, 1:44 p.m.

I was trying to use batch normalization in order to improve the accuracy of my CIFAR classifier with tf.layers.batch_normalization, and it seemed to have little to no effect. According to this StackOverflow post you need to do something extra, which is not mentioned in the documentation, in order to get the batch normalization to work.

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
sess.run([train_op, extra_update_ops], ...)

The batch norm update operations are added to UPDATE_OPS collection, so you need to create that operation and then feed it into the session along with the training op. Before I had added the extra_update_ops the batch normalization was definitely not running, now it is, whether it helps or not remains to be seen.

Also make sure to use a training=[BOOLEAN | TENSOR] in the call to batch_normalization() to prevent it from being applied during evaluation. I use a placeholder and pass whether it is training or not in via the feed_dict:

training = tf.placeholder(dtype=tf.bool)

And then use this in my batch norm and dropout layers:

training=training

There were a few other things I had to do to get batch normalization to work properly:

I had been using local response normalization, which apparently doesn't help that much. I removed those layers and replaced them with batch normalization layers.
Remove the activation from the conv2d layers. I run the output through the batch normalize layers and then apply the relu.

Before I made these changes the model with the batch normalization didn't seem to be training at all, the accuracy was just going up and down right around the baseline of .10. After these changes it seems to be training properly now.

Labels: data_science , machine_learning , tensor_flow

Making Sure DataFrames Have Same Columns in Pandas

Jan. 26, 2018, 10:36 a.m.

Python code to make sure two data frames have the same columns in the same order. I used this to make sure that two dataframes had the same dummy columns after using pd.get_dummies:

missing_cols = set( X1.columns ) - set( X2.columns )
for c in missing_cols:
    X2[c] = 0
X2 = X2[X1.columns]

Labels: coding , python , machine_learning

Archives