PyTorch Update

April 18, 2019, 7:26 a.m.

After another couple of weeks using PyTorch my initial enthusiasm has somewhat faded. I still like it a lot, but I have encountered many disadvantages. For one I can now see the advantage of TensorFlows static graphs - it makes the API easier to use. Since the graph is completely defined and then compiled you can just tell each layer how many units it should have and it will infer the number of inputs from whatever it's input is. In PyTorch you need to manually specify the inputs and outputs, which isn't a big deal, but makes it more difficult to tune networks since to change the number of units in a layer you need to change the inputs to the next layer, the batch normalization, etc. whereas with TensorFlow you can just change one number and everything is magically adjusted.

I also think that the TensorFlow API is better than PyTorch. There are some things which are very easy to do in TensorFlow which become incredibly complicated with PyTorch, like adding different regularization amounts to different layers. In TensorFlow there is a parameter to the layer that controls the regularization, in PyTorch you apparently need to loop through all of the parameters and know which ones to add what amount of regularization to.

I suppose one could easily get around these limitations with custom functions and such, and it shouldn't be surprising that TensorFlow seems more mature given that it has the weight of Google behind it, is considered the "industry standard", and has been around for longer. But I now see that TensorFlow has some advantages over PyTorch.

Labels: python , machine_learning , tensorflow , pytorch

1 comment

PyTorch

April 8, 2019, 2:31 p.m.

When I first started with neural networks I learned them with TensorFlow and it seemed like TensorFlow was pretty much the industry standard. I did however keep hearing about PyTorch which was supposedly better than TensorFlow in many ways, but I never really got around to learning it. Last week I had to do one of my assignments in PyTorch so I finally got around to it, and I am already impressed.

The biggest problem I always had with TensorFlow was that the graphs are static. The entire graph must be defined and compiled before it is run and it can't be altered at runtime. You feed data into the graph and it returns output. This results in the rather awkward tf.Session() which must be created before you can do anything, and which contains all of the parameters for the model.

PyTorch has dynamic graphs which are compiled at runtime. This means that you can change things as you go, including altering the graph while it is running, and you don't need to have all the dimensions of all of the data specified in advance like you do in TensorFlow. You can also do things like change the numbers of neurons in a layer dynamically and drop entire layers at runtime which you can't do with TensorFlow.

Debugging PyTorch is a lot easier since you can just make a change and test it - you don't need to recreate the graph and instantiate a session to test it out. You can just run an optimization step whenever you want. Coming from TensorFlow that is just a breath of fresh air.

TensorFlow still has many advantages, including the fact that it is still an industry standard, is easier to deploy and is better supported. But PyTorch is definitely a worth competitor, is far more flexible, and solves many of the problems with TensorFlow.

Labels: python , machine_learning , tensorflow , pytorch

No comments

CoLab TPUs One Month Later

Oct. 31, 2018, 11:49 a.m.

After having used both CoLab GPUs and TPUs for almost a month I must significantly revise my previous opinion. Even for a Keras model not written or optimized for TPUs, with some minimal configuration changes TPUs perform much faster - minimum of twice the speed. In addition to making sure that all operations are TPU compatible, the only major configuration change required is increasing the batch size by 8. At first I was playing around with the batch size, but I realized that this was unnecessary. TPUs have 8 shards, so you simply multiple the GPU batch size by 8 and that should be a good baseline. 

The model I am currently training on a TPU and a GPU simultaneously is training 3-4x faster on the TPU than on the GPU and the code is exactly the same. I have this block of code:

use_tpu = True
# if we are using the tpu copy the keras model to a new var and assign the tpu model to model
if use_tpu:
    TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    
    # create network and compiler
    tpu_model = tf.contrib.tpu.keras_to_tpu_model(
        model, strategy = tf.contrib.tpu.TPUDistributionStrategy(
            tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
    
    BATCH_SIZE = BATCH_SIZE * 8

The model is created with Keras and the only change I make is setting use_tpu to True on the TPU instance. 

One other thing I thought I would mention is that CoLab creates separate instances for GPU, TPU and CPU, so you can run multiple notebooks without sharing RAM or processor if you give each one a different type.

Labels: machine_learning , tensorflow , google , google_cloud

4 comments

Keras

Sept. 26, 2018, 1:03 p.m.

When I first started working with TensorFlow I didn't really like Keras. It seemed like a dumbed down interface to TensorFlow and I preferred having greater control over everything to the ease of use of Keras. However I have recently changed my mind. When you use Keras with a TensorFlow back-end you can still use TensorFlow if you need to tweak something that you can't in Keras, but otherwise Keras just provides an easier to use way to access TensorFlow's functionality. This is especially useful for prototyping models since you can easily make changes without having to write or rewrite a lot of code. I used to write my own functions to do things like make a convolutional layer, but most of that was duplicating functionality that already exists in Keras. 

My original opinion was incorrect, Keras is a valuable tool for creating neural networks, and since you can mix TensorFlow in, there is nothing lost by using it.

Labels: machine_learning , tensorflow , keras

1 comment

The Undeniable Beauty of Cross Entropy

July 17, 2018, 7:34 a.m.

When I began working on this project my intention was to do multi-class classification of the images. To this end I built my graph with logits and a cross-entropy loss function. I soon realized that the decision to do multi-class classification was quite ambitious, and scaled back to doing binary classification into positive and negative. My goal was to implement the multi-class approach once I had the binary approach working reasonably well, so I left the cross entropy in place.

Over the months I have been working on this I have realized that, for many reasons, the multi-class classification was a bad idea. For an academic project it might have made sense, but for any sort of real world use case it made none. There is really no use to outputting a simple classification for something as important as detecting cancer. A much more useful output is the probabilities that each area of the image contains an abnormality as this could aid a radiologist in diagnosing abnormalities rather than completely replacing her. Yet for some reason I never bothered to change the output or the loss function.

The limiting factor on the size of the model has been the GPU memory of the Google Cloud instance I am training this monstrosity on. So I've been trying to optimize the model to run within the RAM constraints and train in a reasonable amount of time. Mostly this has involved trying to keep the number of parameters to a minimum, but today I was looking at the model and realized that the logits were definitely not helping the situation.

For this problem classification was absolutely the wrong approach. We aren't trying to classify the content of the image, we are trying to detect abnormalities. The negative class was not really a separate class, but the absence of any abnormalities, and the graph and the loss function should reflect this. In order to coerce the logits into an output that reflected the reality just described, I put the logits through a softmax and then discarded the negative probability - as I said the negative class doesn't really exist. However the cross entropy function does not know this and it places equal importance on the imaginary negative class as on the positive class (subject to the cross entropy weighting of course.) This means that the gradients placed equal weight on trying to find imaginary "normal" patterns, despite the fact that this information is discarded and never used.

So I reduced the logits layer to one unit, replaced the softmax activation with a sigmoid activation, and replaced the cross entropy with binary cross entropy.  And the change has been more impactful than I imagined it would be. The model immediately began performing better than the same model with the logits/cross entropy structure. It seems obvious that this would be the case as now the model can focus on detecting abnormalities rather than wasting half of it's efforts on trying to detect normal patterns. 

I am not sure why I waited so long to make this change and my best guess is that I was seduced by the undeniable elegance of the cross entropy loss function. For multi-class classification it is truly a thing of beauty, and I may have been blinded by that into attempting to use it in a situation it was not designed for.

Labels: coding , machine_learning , tensorflow , mammography

No comments

More on Deconvolution

July 5, 2018, 10:37 a.m.

I wrote about this paper before, but I am going to again because it has been so enormously useful to me. I am still working on segmentation of mammograms to highlight abnormalities and I recently decided to scrap the approach I had been taking to upsampling the image and start that part from scratch.

When I started I had been using the earliest approach to upsampling, which basically was take my classifier, remove the last fully-connected layer and upsample that back to full resolution with transpose convolutions. This worked well enough, but the network had to upsample images from 2x2x1024 to 640x640x2 and in order to do this I needed to add skip connections from the downsizing section to the upsampling section. This caused problems because the network would add features of the input image to the output, regardless of whether the features were relevant to the label. I tried to get around this by adding bottleneck layers before the skip connection in order to only select the pertinent features, but this greatly slowed down training and didn't help much and the output ended up with a lot of weird artifacts.

In "Deconvolution and Checkerboard Artifacts", Odena et al. have demonstrated that replacing transpose convolutions with nearest neighbors resizing produces smoother images than using transpose convolutions. I tried replacing a few of my tranpose convolutions with resizes and the results improved.

Then I started reading about dilated convolutions and I started wondering why I was downsizing my input from 640x640 to 5x5 just to have to resize it back up. I removed all the fully-connected layers (which in fact were 1x1 convolutions rather than fully-connected layers) and then replaced the last max pool with a dilated convolution.

I replaced all of the transpose convolutions with resizes, except for the last two layers, as suggested by Odena et al, and the final tranpose convolution has a stride of 1 in order to smooth out artifacts.

In the downsizing section, the current model reduces the input from 640x640x1 to 20x20x512, then it is upsampled by using nearest neighbors resizing followed by plain convolutions to 320x320x32. Finally there is a tranpose convolution with a stride of 2 followed by a transpose convolution with a stride of 1 and then a softmax for the output. As an added bonus, this version of the model trains significantly faster than upsampling with transpose convolutions.

I just started training this model, but I am fairly confident it will perform better than previous upsampling schemes as when I extracted the last downsizing convolutional layer from the model that layer appeared closer to the label (although much smaller) than the final output did. I will update when I have actual results.

Update - After training the model for just one epoch, with the downsizing layer weights initialized from a previous model, the results are already significantly better than under the previous scheme.

Labels: coding , data_science , tensorflow , mammography , convnets , ddsm

No comments

As I continue to work on my mammography project I save a lot of time by re-using weights from models I have already trained rather than training every iteration of every model from scratch, which would be very time consuming. However a drawback to this method is that if I add a new layer or change a layer when I continue training the model the layers which have not changed are prone to overfit as they have been trained for substantially longer than the new layers.

I tried only training certain variables, but when the checkpoint is saved only the trained variables are included in it, which means that the checkpoint can not be restored as it is missing many variables. This could be overcome by restoring certain variables from one checkpoint and others from a different checkpoint, but that is overly complicated and not very convenient.

Earlier today, I had added another deconvolution layer to my model. When I trained just that layer the accuracy of the model went very high very quickly, much more quickly than training all of the layers. But then I couldn't continue training all of the layers because the checkpoint only contained the layer trained. I don't have the time to retrain the entire monstrosity from scratch, so I found an ugly hack that allows me to train mostly the layers I want to train while saving all of the weights in the checkpoint.

I create two training ops - one for all variables (train_op_1) and one for the variables I want to train (train_op_2). I run train_op_2 most of the time. But right before I save the checkpoint I do one iteration of train_op_1 which updates all layers, so all variables are saved in the checkpoint. It's not pretty, but it works and best of all, the code doesn't have to be changed depending on what I want to train. I specify whether I want to train all vars or just the subset as a command line arg and if I want to train all vars, then set train_op_2 = train_op_1.

I just ran a few quick tests with no issues, hopefully this will continue to work.

Labels: python , data_science , machine_learning , tensorflow

No comments

Update on CBIS-DDSM Training Images

June 6, 2018, 3:07 p.m.

Even though I only have 1/5 of the images uploaded so far, I decided to do some tests to see if this method would work. It does, but it took quite a bit of tweaking to get performance to reasonable levels.

At first I just plugged the new dataset into the old graph, and this worked but was incredibly slow with the GPU sitting idle most of the time. I tried quite a few different methods to speed the pre-processing bottleneck up, but the solution was simpler than I had thought it would be.

The biggest factor was increasing number of threads in the tf.train.batch from the default of 1. This one change made a huge difference, cutting the training time down to about 1/4 of what it had been.

I also experimented with moving some pre-processing operations around, including resizing the images individually when loading them and after being batched. This had negligible effects, but resizing them individually was slightly faster than doing it as a batch. In general I found that the more pre-processing operations I moved to the queue (and the CPU) the better the performance.

This version still trains at about 1/2 the speed the tfrecords version did, which is a big difference, but the size of the training set has increased by orders of magnitude so I guess I can live with it. 

The code is available on my GitHub.

Labels: python , machine_learning , tensorflow , mammography

No comments

Segmentation of CBIS-DDSM Images

May 28, 2018, 9:38 a.m.

My original work with the DDSM and CBIS-DDSM dataset yielded good accuracy and recall on the test and validation data, but the model didn't perform so well when applied to the MIAS images, which came from a completely different dataset. Additional analysis of the images indicated that the negative images (from the DDSM) and the positive images (from the CBIS-DDSM) were different in some subtle but important ways:

  1. The negative images had a lower mean, lower maximum and higher minimum. 
  2. The negative images also had lower contrast.

We had become concerned about point 2 when we discovered that increasing the contrast of any image made it more likely to be predicted as positive and discovered point 1 while investigating this further. When applying our fully convolutional model trained on the combined data to complete scans, rather than the 299x299 images we had trained on, we noticed that more than 50% of the sections of a positive image were predicted as positive, even if the ROI was, in fact, only present in one section. This indicates that the model was using some feature of the images other than the ROI in its prediction.

When starting this project, we had initially planned to segment the CBIS-DDSM images and use images which did not contain an ROI as negative images, but we were not certain that there were not differences in the tissue of positive and negative scans which might make this approach not generalize to completely negative scans. When we realized that the scans had been pre-processed differently we attempted to adjust the negative images in such a way as to make them more similar to the positive images but were unable to do so without knowledge of how they had been processed.

Our solution to both of these issues was to train the model to do the segmentation of the scans rather than simple classification, using the masks as the labels. This approach had several advantages:

  1. Using the mask as the label tells the model where it needs to look, so we can ensure that it actually uses the ROIs rather than other features of the images, such as the contrast or maximum pixel value.
  2. This allows us the exclude the DDSM images and only use images from one dataset, as the ROI of most scans only encompasses a small portion of the image.

We recreated the model to do semantic segmentation by removing the last "fully connected" layer (which were implemented as a 1x1 convolution) and the logits layer and upsampling the results with transpose convolutions. In order for the upsampling to work properly we needed to have the size of the images be a multiple of 2 so that the dimension reduction could be properly undone, so we used images of size 320x320.

We were able to get fairly good results training on this data with a pixel level accuracy of about 90% and a pixel recall of 70%. The image level accuracy and recall were 70% and 87%, respectively. While these results were respectable, we noticed certain patterns of incorrect predictions. Images which were mostly dark, with patches that were much brighter, tended to have the bright patches predicted positive regardless of the actual label. This pattern was mostly observed when the bright patch ran off the edge of the image. 

We know that the context of an ROI is important in detecting and diagnosing it, and we suspected that in the absence of context the model was predicting any patch substantially brighter than it's surroundings to be positive. While for cancer detection, it is better to make a false positive than a false negative we thought that this pattern might become problematic when applying the model to images larger than those it was trained on. To address this issue we decided to create a dataset of larger images and continue training our model on those.

We created a dataset of 640x640 images and adjusted our existing model to take those as input. As the model is fully convolutional we can restore the model trained on 320x320 images and continue training it on the larger images with no problems, which we are currently in the process of doing. If the results of this are promising we may create another dataset of even larger images are fine-tune this model on those images until we have a model which takes complete images as input.

Labels: machine_learning , tensorflow , mammography , convnets , ddsm

No comments

I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.

Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.

In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems. 

I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.

Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.

Labels: python , machine_learning , tensorflow

No comments

TensorFlow GPU Bottlenecks

May 16, 2018, 3:20 p.m.

I was training a model on a Google Cloud instance with a Tesla K80 GPU. This particular model had more data pre-processing required than normal. The model was training very slowly, the GPU usage was oscillating between 0% and 75-100%. I thought the CPU was the bottleneck and was trying to put as much pre-processing on the GPU as possible.

I read TensorFlow's optimization guide, which suggested forcing the pre-processing to be on the CPU by enclosing it with:

with tf.device('/cpu:0'):

Since I thought the CPU was the bottleneck I didn't think that would help, but I tried it anyway because I had no other good ideas and was surprised that it worked like magic! The GPU usage now stays constant around 95-100% while the CPU usage stays at about the same levels as before.

Labels: machine_learning , tensorflow , google_cloud

No comments

I have been working on a project to detect abnormalities in mammograms. I have been training it on Google Cloud instances with Nvidia Tesla K80 GPUs, which allow a model to be trained in days rather than weeks or months. However when I tried to do online data augmentation it became a huge bottleneck because it did the data augmentation on the CPU.

I had been using tf.image.random_flip_left_right and tf.image.random_flip_up_down but since those operations were run on the CPU the training slowed down to a crawl as the GPU sat idle waiting for the queue to be filled.

I found this post on Medium, Data Augmentation on GPU in Tensorflow, which uses tf.contrib.image instead of tf.image. tf.contrib.image is written to run on the GPU, so using this code allows the data augmentation to be performed on the GPU instead of the CPU and thus eliminates the bottleneck.

This has been a life saver for me. Adding it to my graph allows me to train for longer without overfitting and this get better results.

Labels: python , machine_learning , tensorflow

No comments

I am working on classifying mammography scans with a TensorFlow ConvNet. The scans are classified into five classes:

  • Normal
  • Benign Calcification
  • Malignant Calcification
  • Benign Mass
  • Malignant Mass

I was unsure of how I wanted to classify the scans so I created the model in such a way that it would work for any combination of classes. I initially started training with binary classification - normal or abnormal, with the goal of then expanding the number of classes once I had a model that made decent predictions on the binary case.

For the binary prediction I used precision, recall and a pr curve as metrics. When I expanded to multiple classes obviously those metrics no longer worked. As far as precision and recall I don't really care what type of abnormal the scan is - I just care that it is abnormal at all. And I wanted to have the same metrics to compare for all my models so I had to figure out a way to do precision and recall for all versions of the model.

The solution I came to was to "squash" my multi-class labels and predictions down into binary labels and predictions and feed those into the p/r metrics. I set up the classes so that 0 was always normal, so I can do the squashing as follows:

zero = tf.constant(0, dtype=tf.int64)
collapsed_predictions = tf.greater(predictions, zero)
collapsed_labels = tf.greater(y, zero)

Collapsed_predictions and collapsed_labels will then contain True if the prediction or label is NOT 0 and False if it is. Then I can feed these into my precision and recall metrics:

recall, rec_op = tf.metrics.recall(labels=collapsed_labels, predictions=collapsed_predictions)
precision, prec_op = tf.metrics.precision(labels=collapsed_labels, predictions=collapsed_predictions)

I also created a pr curve metric to see how the thresholds would affect the predictions. First I convert the logits to probabilities via a softmax and then feed that into a pr_curve_streaming_op as the predictions. In order to make this work with multi-class classification I squash the probabilities down to the probability that the item is NOT normal. Since my labels are created such that normal is always 0, the probability that it is not normal is just 1 - the probability that it is:

probabilities = tf.nn.softmax(logits, name="probabilities")
_, update_op = summary_lib.pr_curve_streaming_op(name='pr_curve',
                                                predictions=(1 - probabilities[:, 0]),
                                                labels=collapsed_labels,
                                                updates_collections=tf.GraphKeys.UPDATE_OPS,
                                                num_thresholds=20)

 

Labels: python , machine_learning , tensorflow

No comments

TensorFlow and Google Cloud GPU Instances

April 1, 2018, 10:06 a.m.

I decided to try a Google Cloud GPU instance as well as EC2. Once I had my quotas set properly and was able to start the instance it took me all day to get TensorFlow running with GPU. The instructions Google provides are for CUDA 8.0, and the latest version of TensorFlow requires CUDA 9.0.

To get everything running follow these steps:

  1. curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
  2. sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
  3. sudo apt-get update
  4. sudo apt-get install cuda-9-0
  5. sudo nvidia-smi -pm 1

These are the steps in the instructions with the proper repo to CUDA 9.0 inserted.

Then I had to install cudnn, which isn't mentioned at all in Google's instructions. I downloaded libcudnn7_7.0.4.31-1+cuda9.0_amd64.deb from the Nvidia cudnn site, and then uploaded it to the instance with scp. Then install it with:

sudo dpkg -i libcudnn7_7.0.4.31-1+cuda9.0_amd64.deb

Then you need to export the path with:

echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
echo 'export PATH=$PATH:$CUDA_HOME/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

And finally install TensorFlow:

sudo apt-get install python-dev python-pip libcupti-dev
sudo pip install tensorflow-gpu

I used pip3 and python3, but the rest is the same. 

Update: I thought it was working fine but I was still getting errors about locating libcupti.so.9.0. That was fixed by making symlinks as described here.

I ran these commands and now it seems to be working...

  1. # Put symlinks in /usr/local/cuda
  2. sudo mkdir /usr/local/cuda
  3. cd /usr/local/cuda
  4. sudo ln -s /usr/lib/x86_64-linux-gnu/ lib64
  5. sudo ln -s /usr/include/ include
  6. sudo ln -s /usr/bin/ bin
  7. sudo ln -s /usr/lib/x86_64-linux-gnu/ nvvm
  8. sudo mkdir -p extras/CUPTI
  9. cd extras/CUPTI
  10. sudo ln -s /usr/lib/x86_64-linux-gnu/ lib64
  11. sudo ln -s /usr/include/ include

Another Update: TensorFlow requires version 7.0.4 of the cudnn, I had originally downloaded 7.1.2, the code has been updated accordingly.

Final Update: I set up another instance and followed this process and it almost worked. I needed to export another path which I added here. The commands to export the path were temporary and had to be repeated every time the instance was booted, I changed that to echo the path to .bashrc so it would be automatically set.

Labels: coding , machine_learning , tensorflow , google_cloud

No comments

Amazon EC2 Deep Learning Instances

March 28, 2018, 2:53 p.m.

To resolve the problems I was having yesterday I ended up paying for an Amazon EC2 instance with the Deep Learning Ubuntu AMI. The instance type is p2.xlarge which costs $0.90/hour, but seems to be well worth it so far. In the last ten minutes I've been training a relatively small model on Google Cloud, which has been able to get through 60 steps. In contrast, on the EC2 instance the much larger model, training on the same data, has gone through 375 steps, where each epoch is 687 steps.

I did have some trouble accessing TensorBoard on the EC2 instance, but was able to get it running by following the tutorial. I also got Jupyter Notebook running and accessible from the outside world, again by following the tutorial, although I had to comment out the lines about the SSL certificates in the jupyter conf file in order to be able to connect. I decided to not use Jupyter Notebook, but it's nice to have it as an option.

Since this is just a project I am working on for myself, I'd prefer to not have to pay for the compute, but $0.90 per hour is manageable, and well worth it for the 10x increase in training speed. 

Labels: machine_learning , tensorflow , google_cloud , ec2

No comments

Google CoLab and Google Cloud

March 23, 2018, 6:30 p.m.

While it was amazing for running smaller models, apparently CoLab has it's limitations. I'm working on a ConvNet that takes 299x299 images as input and trying to train it on Google CoLab kept crashing the runtime with no error messages provided. The training data totalled about 2.3 GB, and I guess CoLab just couldn't handle it for whatever reason. 

I tried training on my laptop, but I estimated it would take about 6 hours per epoch, which is ridiculous, so then I tried to use Google Cloud's free trial to set up an instance with GPUs. Unfortunately the free trial no longer supports the ability to add GPUs, so that didn't work. I did set up an instance without GPUs which is training faster than my laptop right now, but not that much faster. My current estimate about about 2 hours per epoch.

My plan is to let this train overnight and see how it goes. If it is too slow I may try to use Google's TPUs, which are ostensibly optimized for TensorFlow. However they are very expensive at $6/hr. Amazon EC2 instances with GPUs are about the same price, which doesn't leave me many options. 

Labels: python , machine_learning , tensorflow , google , google_cloud

No comments

TensorFlow Queues and Validation

March 22, 2018, 1:36 p.m.

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path])
X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000)
X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1])
y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Labels: python , data_science , machine_learning , tensorflow

1 comment

Google CoLaboratory File Persistence

Feb. 25, 2018, 10:59 a.m.

It took me a while to figure out exactly what was going on with the files I was uploading and creating using Google's CoLaboratory. Each user has a VM where their notebooks run and the VM only runs for 12 hours before it is spun down and recycled, taking with it any files you may have downloaded or created. The second day I used it I was surpised that the files I had spent time downloading, unzipping and importing were no longer there, and I had deleted the code to do that, so if you are using CoLab make sure you keep the code to get your data files!

I also tried to have two notebooks running at the same time thinking it would speed up some work I was doing, but it seems as if all of a user's notebooks run in the same VM, so there really is no advantage to having multiple notebooks running.

There is an instruction notebook that explains how to save files to Google Drive, which works very well and is easy to use. To do that run:

from google.colab import auth
from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build

auth.authenticate_user()

Then you have to enter a code to authenticate yourself. Then I use this function to save files:

drive_service = build('drive', 'v3')

def save_file_to_drive(name, path):
  file_metadata = {
    'name': name,
    'mimeType': 'application/octet-stream'
  }
  
  media = MediaFileUpload(path, 
                        mimetype='application/octet-stream',
                        resumable=True)
  
  created = drive_service.files().create(body=file_metadata,
                                       media_body=media,
                                       fields='id').execute()

  print('File ID: {}'.format(created.get('id')))
  return created

The function takes two arguments, the name of the file and the path to it, and write the file to the root of your Google drive.

Note - This post was updated because my original guess as to how the VMs work was completely wrong. The VM instance exists for 12 hours, they are not tied to the runtime.

Labels: coding , machine_learning , tensorflow , google

No comments

Update on TensorFlow GPU Windows Errors

Feb. 16, 2018, 9:07 a.m.

After playing with TensorFlow GPU on Windows for a few days I have more information on the errors. I am running TensorFlow 1.6, currently the latest version, with Python 3.6 and Nvidia CUDA 9.0 on an Nvidia GE Force GT 750M.

When the Python Windows process crashes with an error that says CUDA_ERROR_LAUNCH_FAILED, the problem can be solved by reducing the fraction of the GPU memory available with:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7

If the Python script fails with an error about exhausted resources or being unable to allocate enough memory, then you need to use a smaller batch size. This problem does not crash the Python process, Python throws an Exception but does not crash.

Once I figured these out, I have had no problems running models on the GPU at all.

Labels: python , machine_learning , tensorflow

No comments

Archives