Sept. 16, 2019, 10:57 a.m.
Discovering how much cheaper spot EC2 instances were than normal on-demand instances gave me the courage to try out a faster GPU. I had been using K80s which are painfully slow, but very cheap. The spot price for the V100 is about the same as the on-demand price of the K80s, so using those with spot instances won't be any cheaper, but it won't be more expensive either.
I didn't think the V100s were such great GPUs, so I wasn't expecting it to be worth the extra cost. How wrong I was. Training the network I am currently playing with on a K80 with a batch size of 48 took about 8-12 hours per epoch. Training it on a V100 with a batch size of 64 is looking like it's going to take about 2 hours. With the V100s priced at about 4x the K80s, that works out to about the same price per compute to a little bit cheaper, depending on exactly how long it took per epoch on the K80.
When you factor in the value of not having to wait an entire day to see the results of an epoch, this is a no-brainer as far as I'm concerned. Unfortunately, I'm sure my AWS bill is going to increase substantially. That's how they get you... Once you have a taste of HPC they know you'll be back for more...
March 28, 2018, 2:53 p.m.
To resolve the problems I was having yesterday I ended up paying for an Amazon EC2 instance with the Deep Learning Ubuntu AMI. The instance type is p2.xlarge which costs $0.90/hour, but seems to be well worth it so far. In the last ten minutes I've been training a relatively small model on Google Cloud, which has been able to get through 60 steps. In contrast, on the EC2 instance the much larger model, training on the same data, has gone through 375 steps, where each epoch is 687 steps.
I did have some trouble accessing TensorBoard on the EC2 instance, but was able to get it running by following the tutorial. I also got Jupyter Notebook running and accessible from the outside world, again by following the tutorial, although I had to comment out the lines about the SSL certificates in the jupyter conf file in order to be able to connect. I decided to not use Jupyter Notebook, but it's nice to have it as an option.
Since this is just a project I am working on for myself, I'd prefer to not have to pay for the compute, but $0.90 per hour is manageable, and well worth it for the 10x increase in training speed.