Amazon EC2 Deep Learning Instances

March 28, 2018, 2:53 p.m.

To resolve the problems I was having yesterday I ended up paying for an Amazon EC2 instance with the Deep Learning Ubuntu AMI. The instance type is p2.xlarge which costs $0.90/hour, but seems to be well worth it so far. In the last ten minutes I've been training a relatively small model on Google Cloud, which has been able to get through 60 steps. In contrast, on the EC2 instance the much larger model, training on the same data, has gone through 375 steps, where each epoch is 687 steps.

I did have some trouble accessing TensorBoard on the EC2 instance, but was able to get it running by following the tutorial. I also got Jupyter Notebook running and accessible from the outside world, again by following the tutorial, although I had to comment out the lines about the SSL certificates in the jupyter conf file in order to be able to connect. I decided to not use Jupyter Notebook, but it's nice to have it as an option.

Since this is just a project I am working on for myself, I'd prefer to not have to pay for the compute, but $0.90 per hour is manageable, and well worth it for the 10x increase in training speed. 

Labels: machine_learning , tensorflow , google_cloud , ec2

No comments

Google CoLab and Google Cloud

March 23, 2018, 6:30 p.m.

While it was amazing for running smaller models, apparently CoLab has it's limitations. I'm working on a ConvNet that takes 299x299 images as input and trying to train it on Google CoLab kept crashing the runtime with no error messages provided. The training data totalled about 2.3 GB, and I guess CoLab just couldn't handle it for whatever reason. 

I tried training on my laptop, but I estimated it would take about 6 hours per epoch, which is ridiculous, so then I tried to use Google Cloud's free trial to set up an instance with GPUs. Unfortunately the free trial no longer supports the ability to add GPUs, so that didn't work. I did set up an instance without GPUs which is training faster than my laptop right now, but not that much faster. My current estimate about about 2 hours per epoch.

My plan is to let this train overnight and see how it goes. If it is too slow I may try to use Google's TPUs, which are ostensibly optimized for TensorFlow. However they are very expensive at $6/hr. Amazon EC2 instances with GPUs are about the same price, which doesn't leave me many options. 

Labels: python , machine_learning , tensorflow , google , google_cloud

No comments

TensorFlow Queues and Validation

March 22, 2018, 1:36 p.m.

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path])
X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000)
X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1])
y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Labels: python , data_science , machine_learning , tensorflow

No comments

Google CoLaboratory File Persistence

Feb. 25, 2018, 10:59 a.m.

It took me a while to figure out exactly what was going on with the files I was uploading and creating using Google's CoLaboratory. Each user has a VM where their notebooks run and the VM only runs for 12 hours before it is spun down and recycled, taking with it any files you may have downloaded or created. The second day I used it I was surpised that the files I had spent time downloading, unzipping and importing were no longer there, and I had deleted the code to do that, so if you are using CoLab make sure you keep the code to get your data files!

I also tried to have two notebooks running at the same time thinking it would speed up some work I was doing, but it seems as if all of a user's notebooks run in the same VM, so there really is no advantage to having multiple notebooks running.

There is an instruction notebook that explains how to save files to Google Drive, which works very well and is easy to use. To do that run:

from google.colab import auth
from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build

auth.authenticate_user()

Then you have to enter a code to authenticate yourself. Then I use this function to save files:

drive_service = build('drive', 'v3')

def save_file_to_drive(name, path):
  file_metadata = {
    'name': name,
    'mimeType': 'application/octet-stream'
  }
  
  media = MediaFileUpload(path, 
                        mimetype='application/octet-stream',
                        resumable=True)
  
  created = drive_service.files().create(body=file_metadata,
                                       media_body=media,
                                       fields='id').execute()

  print('File ID: {}'.format(created.get('id')))
  return created

The function takes two arguments, the name of the file and the path to it, and write the file to the root of your Google drive.

Note - This post was updated because my original guess as to how the VMs work was completely wrong. The VM instance exists for 12 hours, they are not tied to the runtime.

Labels: coding , machine_learning , tensorflow , google

No comments

Archives