Nov. 13, 2020, 9:40 a.m.
I have been using CoLab for quite a few years now and have always really appreciated the ability to get access to GPUs (and TPUs) for free. So when I recently found out about CoLab Pro I was reluctant to pay $10 a month for something I had been getting for free. However, at the same time I was paying hundreds of dollars a month for cloud GPU instances. Last week, after going well over my AWS budget last month, I decided to maybe try CoLab Pro and I am very glad I did.
CoLab Pro gives you priority on high-end GPUs - so far I have never not gotten a V100. This is the same GPU I was paying $0.90/hour spot rate (preemptible) on AWS. For me, the main disadvantage of CoLab was that each instance lasted usually about 10 hours before shutting down, and they would time out if left unattended or if I wasn't at the computer. CoLab Pro instances will last up to 24 hours, and they will not time out. I had one running at work the other day and when I got home I figured it had timed out, but when I went back the next morning it was still running !
Obviously, CoLab Pro is better suited to running experiments than executing long training, and it doesn't support multiple GPUs. And if you are using TensorFlow you have TPUs (I prefer PyTorch.) In the past I have repeatedly kicked myself after spending hundreds of dollars training a model, and then finding a small mistake. In the future I will be running my experiments on CoLab Pro and only using VMs when I am sure everything is correct and I need to train models quickly.
Sept. 16, 2019, 10:57 a.m.
Discovering how much cheaper spot EC2 instances were than normal on-demand instances gave me the courage to try out a faster GPU. I had been using K80s which are painfully slow, but very cheap. The spot price for the V100 is about the same as the on-demand price of the K80s, so using those with spot instances won't be any cheaper, but it won't be more expensive either.
I didn't think the V100s were such great GPUs, so I wasn't expecting it to be worth the extra cost. How wrong I was. Training the network I am currently playing with on a K80 with a batch size of 48 took about 8-12 hours per epoch. Training it on a V100 with a batch size of 64 is looking like it's going to take about 2 hours. With the V100s priced at about 4x the K80s, that works out to about the same price per compute to a little bit cheaper, depending on exactly how long it took per epoch on the K80.
When you factor in the value of not having to wait an entire day to see the results of an epoch, this is a no-brainer as far as I'm concerned. Unfortunately, I'm sure my AWS bill is going to increase substantially. That's how they get you... Once you have a taste of HPC they know you'll be back for more...
Sept. 12, 2019, 4:35 p.m.
My major complaint about using EC2 GPU instances was the cost, it gets very expensive to run a GPU instance for more than a few hours. Last week I was wondering why I wasn't using spot instances, so I set up a request and I've been running it for a few days now. It is about 1/4 the price of a normal instance, so it's not much more expensive than renting a CPU-only on-demand instance. I was hoping to get a better GPU than the K80, but I ended up settling for the K80 because it was more available than the better GPUs, but next time I may request a better one and see what happens.
The downside of spot instances is that they will be terminated if the capacity is needed for an on-demand instance, and my instance was terminated the other night. But then I spun up a new one in the morning and that one has been running for a few days now. I can't believe I haven't used these before.
Aug. 27, 2019, 6:44 a.m.
It is difficult to play around with the structure for the GAN I am working on in Colab since it trains so slowly. I can usually get maybe 2 or 3 epochs in a day, which means that I need to wait a day before evaluating each change I make. I decided to rent a GPU in the cloud for a few days so I could train it a bit more quickly and figure out what works and what doesn't work before going back to Colab.
I already have a Google Cloud GPU instance I was using for my work with mammography, but it was running CUDA 9.0 which apparently is not supported by PyTorch out of the box. I tried to upgrade CUDA to 10, but I think I ended up just making things worse. Rather than spend a whole day trying to fix the GCS instance, and since I have some AWS credits, I decided to try to use an AWS Deep Learning AMI instance, which already has everything configured.
It was incredibly easy to get set up, it comes pre-configured with virtual environments for different deep learning frameworks and packages, so there is no need to install CUDA or drivers or anything like that, which is a huge advantage, since back when I was setting up the GCS instance it took me a few days to get everything installed and working. One thing I quickly noticed was that the default disk size was not even close to big enough - after downloading a few data files I was already running out of disk space, but it was very easy to increase the disk size.
Then all I had to do was activate the pytorch environment, launch a notebook and everything was running smoothly. I did run into a few minor issues, none of which were difficult to resolve:
I used to prefer GCS to AWS because it was more configurable and easier to use. While AWS does have a bit of a learning curve, they really have thought of and provided for just about every possible contingency. We use AWS at my work, and it really is very impressive. I still like the simplicity of GCS, but even simple things like AMIs make such a huge difference in set-up time that I think I'll be using AWS more often now.
March 19, 2019, 2:20 p.m.
I was running an AWS Glue job where I was reading Parquet files from an S3 bucket. When I loaded the files individually there were no problems, but when I loaded the entire directory and tried to do any sort of transformation on the data I got this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 23, ip-172-31-4-9.eu-west-1.compute.internal, executor 1): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://...
I couldn't find any useful information about this online, so thought I'd post the solution here in case anyone else has the same problem.
The problem was that some of the files had columns which were entirely Null and apparently Spark doesn't like that. Reading the files individually I guess it probably read the schema from the file, but reading them as a whole apparently caused errors. I solved the problem by dropping any Null columns before writing the Parquet files.
March 7, 2019, 4:52 p.m.
I've been working with AWS Lambda recently and I am very impressed. Usually if I need to set up a microservice or a recurring task or anything like that I'll just set up something on one of my virtual servers so I didn't think Lambda would be all that useful. But it makes it really, really easy to set up little tasks and it is much cheaper than having a whole virtual server.
You can create tasks in a number of different languages, and set up a variety of triggers ranging from HTTP requests to scheduled tasks, and when the Lambda is triggered AWS spins it up, executes it and then shuts it down. Since it is so ephemeral it is completely stateless, but you can load files from S3 buckets if you need data of any sort. I assume you can probably also connect to a variety of AWS databases as well, although I haven't done this yet. If you need additional libraries or packages that are not default you can create a layer containing them.
Lambda is not going to replace servers for most use cases, but I think serverless technology is going to make quite a dent in the near future.