CoLab TPUs One Month Later

Oct. 31, 2018, 11:49 a.m.

After having used both CoLab GPUs and TPUs for almost a month I must significantly revise my previous opinion. Even for a Keras model not written or optimized for TPUs, with some minimal configuration changes TPUs perform much faster - minimum of twice the speed. In addition to making sure that all operations are TPU compatible, the only major configuration change required is increasing the batch size by 8. At first I was playing around with the batch size, but I realized that this was unnecessary. TPUs have 8 shards, so you simply multiple the GPU batch size by 8 and that should be a good baseline. 

The model I am currently training on a TPU and a GPU simultaneously is training 3-4x faster on the TPU than on the GPU and the code is exactly the same. I have this block of code:

use_tpu = True
# if we are using the tpu copy the keras model to a new var and assign the tpu model to model
if use_tpu:
    TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    
    # create network and compiler
    tpu_model = tf.contrib.tpu.keras_to_tpu_model(
        model, strategy = tf.contrib.tpu.TPUDistributionStrategy(
            tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
    
    BATCH_SIZE = BATCH_SIZE * 8

The model is created with Keras and the only change I make is setting use_tpu to True on the TPU instance. 

One other thing I thought I would mention is that CoLab creates separate instances for GPU, TPU and CPU, so you can run multiple notebooks without sharing RAM or processor if you give each one a different type.

Labels: machine_learning , tensorflow , google , google_cloud

4 comments

I have previously written about Google CoLab which is a way to access Nvidia K80 GPUs for free, but only for 12 hours at a time. After a few months of using Google Cloud instances with GPUs I have run up a substantial bill and have reverted to using CoLab whenever possible. The main problem with CoLab is that the instance is terminated after 12 hours taking all files with it, so in order to use them you need to save your files somewhere.

Until recently I had been saving my files to Google Drive with this method, but while it is easy to save files to Drive it is much more difficult to read them back. As far as I can tell, in order to do this with the API you need to get the file id from Drive and even then it is not so straightforward to upload the files to CoLab. To deal with this I had been uploading files that needed to be accessed often to an AWS S3 bucket and then downloading them to CoLab with wget, which works fine, but there is a much simpler way to do the same thing by using Google Cloud Storage instead of S3.

First you need to authenticate CoLab to your Google account with:

from google.colab import auth

auth.authenticate_user()

Once this is done you need to set your project and bucket name and then update the gcloud config.
project_id = [project_name]
bucket_name = [bucket_name]
!gcloud config set project {project_id}

After this has been done files can simply and quickly be upload or downloaded from the bucket with the following simple commands:

# download
!gsutil cp gs://{bucket_name}/foo.bar ./foo.bar

# upload
!gsutil cp  ./foo.bar gs://{bucket_name}/foo.bar

I actually have been adding the line to upload the weights to GCS to my training code so it is automatically uploaded every couple epochs, which removes the need for me to manually back them up periodically throughout the day.

Labels: coding , python , machine_learning , google , google_cloud

1 comment

Google CoLab and Google Cloud

March 23, 2018, 6:30 p.m.

While it was amazing for running smaller models, apparently CoLab has it's limitations. I'm working on a ConvNet that takes 299x299 images as input and trying to train it on Google CoLab kept crashing the runtime with no error messages provided. The training data totalled about 2.3 GB, and I guess CoLab just couldn't handle it for whatever reason. 

I tried training on my laptop, but I estimated it would take about 6 hours per epoch, which is ridiculous, so then I tried to use Google Cloud's free trial to set up an instance with GPUs. Unfortunately the free trial no longer supports the ability to add GPUs, so that didn't work. I did set up an instance without GPUs which is training faster than my laptop right now, but not that much faster. My current estimate about about 2 hours per epoch.

My plan is to let this train overnight and see how it goes. If it is too slow I may try to use Google's TPUs, which are ostensibly optimized for TensorFlow. However they are very expensive at $6/hr. Amazon EC2 instances with GPUs are about the same price, which doesn't leave me many options. 

Labels: python , machine_learning , tensorflow , google , google_cloud

No comments

Google CoLaboratory File Persistence

Feb. 25, 2018, 10:59 a.m.

It took me a while to figure out exactly what was going on with the files I was uploading and creating using Google's CoLaboratory. Each user has a VM where their notebooks run and the VM only runs for 12 hours before it is spun down and recycled, taking with it any files you may have downloaded or created. The second day I used it I was surpised that the files I had spent time downloading, unzipping and importing were no longer there, and I had deleted the code to do that, so if you are using CoLab make sure you keep the code to get your data files!

I also tried to have two notebooks running at the same time thinking it would speed up some work I was doing, but it seems as if all of a user's notebooks run in the same VM, so there really is no advantage to having multiple notebooks running.

There is an instruction notebook that explains how to save files to Google Drive, which works very well and is easy to use. To do that run:

from google.colab import auth
from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build

auth.authenticate_user()

Then you have to enter a code to authenticate yourself. Then I use this function to save files:

drive_service = build('drive', 'v3')

def save_file_to_drive(name, path):
  file_metadata = {
    'name': name,
    'mimeType': 'application/octet-stream'
  }
  
  media = MediaFileUpload(path, 
                        mimetype='application/octet-stream',
                        resumable=True)
  
  created = drive_service.files().create(body=file_metadata,
                                       media_body=media,
                                       fields='id').execute()

  print('File ID: {}'.format(created.get('id')))
  return created

The function takes two arguments, the name of the file and the path to it, and write the file to the root of your Google drive.

Note - This post was updated because my original guess as to how the VMs work was completely wrong. The VM instance exists for 12 hours, they are not tied to the runtime.

Labels: coding , machine_learning , tensorflow , google

No comments

Google CoLab

Feb. 20, 2018, 6:43 p.m.

On my laptop it takes forever to train my TensorFlow models. I was looking for cheap online services where I could run the code and not having any luck finding anything, Google Cloud Computing does give you $300 worth of free processing time, but that's not really free. I did find Google Colab which is a Python notebook based environment where you can run code for free, and it includes GPU support!

It took me a little while to get everything set up, but it was relatively easy and it runs incredibly fast. The tricky part was getting my data into the notebook. While Colab saves the notebooks to your Google Drive, they do not run on your Google Drive so you can't just put the data on the Drive and then access it.

I used wget to download the data from a URL to wherever the notebook is running, then unzipped it with Python and then I was able to read the data, so it wasn't all that complicated. When I tried to follow the instructions on importing data from Google Drive via an API I was unable to get it to work - I kept getting errors about directories and files not existing despite the fact that they showed up when I did !ls.

They have Tesla K80 GPUs available and the code runs incredibly fast. I'm still training my first model, but it seems like it's going to finish in about 20 minutes whereas it would have taken 3+ hours to train it locally. This difference in speed makes it possible to do things like tune the learning rate and hyperparameters, which are not practical to do locally if it takes hours to train the model.

This is an amazing service from Google and I am already using it heavily, just hours after having discovered it.

Labels: coding , python , machine_learning , google

No comments

Archives