Eric Antoine Scuccimarra

Google CoLab and Google Cloud

March 23, 2018, 6:30 p.m.

While it was amazing for running smaller models, apparently CoLab has it's limitations. I'm working on a ConvNet that takes 299x299 images as input and trying to train it on Google CoLab kept crashing the runtime with no error messages provided. The training data totalled about 2.3 GB, and I guess CoLab just couldn't handle it for whatever reason.

I tried training on my laptop, but I estimated it would take about 6 hours per epoch, which is ridiculous, so then I tried to use Google Cloud's free trial to set up an instance with GPUs. Unfortunately the free trial no longer supports the ability to add GPUs, so that didn't work. I did set up an instance without GPUs which is training faster than my laptop right now, but not that much faster. My current estimate about about 2 hours per epoch.

My plan is to let this train overnight and see how it goes. If it is too slow I may try to use Google's TPUs, which are ostensibly optimized for TensorFlow. However they are very expensive at $6/hr. Amazon EC2 instances with GPUs are about the same price, which doesn't leave me many options.

Labels: python , machine_learning , tensorflow , google , google_cloud

No comments

TensorFlow Queues and Validation

March 22, 2018, 1:36 p.m.

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path]) X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000) X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1]) y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Labels: python , data_science , machine_learning , tensorflow

1 comment

Google CoLaboratory File Persistence

Feb. 25, 2018, 10:59 a.m.

It took me a while to figure out exactly what was going on with the files I was uploading and creating using Google's CoLaboratory. Each user has a VM where their notebooks run and the VM only runs for 12 hours before it is spun down and recycled, taking with it any files you may have downloaded or created. The second day I used it I was surpised that the files I had spent time downloading, unzipping and importing were no longer there, and I had deleted the code to do that, so if you are using CoLab make sure you keep the code to get your data files!

I also tried to have two notebooks running at the same time thinking it would speed up some work I was doing, but it seems as if all of a user's notebooks run in the same VM, so there really is no advantage to having multiple notebooks running.

There is an instruction notebook that explains how to save files to Google Drive, which works very well and is easy to use. To do that run:

from google.colab import auth from googleapiclient.http import MediaFileUpload from googleapiclient.discovery import build

auth.authenticate_user()

Then you have to enter a code to authenticate yourself. Then I use this function to save files:

drive_service = build('drive', 'v3')

def save_file_to_drive(name, path): file_metadata = { 'name': name, 'mimeType': 'application/octet-stream' } media = MediaFileUpload(path, mimetype='application/octet-stream', resumable=True) created = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()

print('File ID: {}'.format(created.get('id'))) return created

The function takes two arguments, the name of the file and the path to it, and write the file to the root of your Google drive.

Note - This post was updated because my original guess as to how the VMs work was completely wrong. The VM instance exists for 12 hours, they are not tied to the runtime.

Labels: coding , machine_learning , tensorflow , google

No comments

Google CoLab

Feb. 20, 2018, 6:43 p.m.

On my laptop it takes forever to train my TensorFlow models. I was looking for cheap online services where I could run the code and not having any luck finding anything, Google Cloud Computing does give you $300 worth of free processing time, but that's not really free. I did find Google Colab which is a Python notebook based environment where you can run code for free, and it includes GPU support!

It took me a little while to get everything set up, but it was relatively easy and it runs incredibly fast. The tricky part was getting my data into the notebook. While Colab saves the notebooks to your Google Drive, they do not run on your Google Drive so you can't just put the data on the Drive and then access it.

I used wget to download the data from a URL to wherever the notebook is running, then unzipped it with Python and then I was able to read the data, so it wasn't all that complicated. When I tried to follow the instructions on importing data from Google Drive via an API I was unable to get it to work - I kept getting errors about directories and files not existing despite the fact that they showed up when I did !ls.

They have Tesla K80 GPUs available and the code runs incredibly fast. I'm still training my first model, but it seems like it's going to finish in about 20 minutes whereas it would have taken 3+ hours to train it locally. This difference in speed makes it possible to do things like tune the learning rate and hyperparameters, which are not practical to do locally if it takes hours to train the model.

This is an amazing service from Google and I am already using it heavily, just hours after having discovered it.

Labels: coding , python , machine_learning , google

No comments

«
1
2
3
4
5
6
7
8
9
10
11
12
13
14 (current)
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
»