How I lost a week to a loading bar

In support of trips to Canada

Oct 10, 2023

This summer, I trained quite a few AI models, and since I was on a massive time crunch trying to get research results in the span of a few weeks, every hour counted. I had modified a script to train these large vision models and one thing this included was a loading bar for how many epochs were left and an estimation for how long it would take to finish training the model.

Generally, you have to do multiple iterations of hyper-parameters to find a strong model, so I was hoping training a full model would take maybe a few hours on the GPUs we had, but to my horror, I started my training script, aaaaand: 8 days 17 hours. At that rate, I’d get 1 try to train a strong model and write a paper draft and create a research poster. This was not going to work.

I found smaller and more efficient pretrained models. I cleaned the data to the highest quality. I set to train for less epochs. I trained on multiple GPUs. It was no use. The lowest I was able to eek it out to was 3 and a half days. After the time I’d burned trying to optimize every little setting and looking for why my training was so slow, I gave up. I started a training run and finally left the lab after a week of fluorescent lighting and programming. Matrix algebra, convolution layers, and research be damned, I was going to enjoy my weekend in Toronto.

Except, I have no self control. I promised myself I wouldn’t check how the training was going until I got back, but c’mon, one more check before I didn’t have cell service wouldn’t hurt. Bracing myself against the inevitable pain, I open the logs and play peekaboo with my phone screen:

=== MODEL FINISHED TRAINING ===

Checking the progress files in confusion, I saw that the estimated time for each epoch would drop from 4 days to 3 hours in just a few epochs. Because I was compiling the model, doing some stochastic optimizations, and loading all the data on the GPU, the initial epoch was the slowest by an order of magnitude. I’ve never trained models of this size with expensive GPUs before, so this wasn’t a heuristic I had acquired yet.

I don’t really know what the lesson to be learned is. Maybe it’s to try and do some back of napkin math before stressfully optimizing everything. Maybe it’s loading bars should take into account the fact that the first epoch will always take a very long time for dumb programmers like myself. Maybe it’s trips to Canada solve everything.

Or maybe it’s just a fun little story about how a loading bar wasted a week of my life.

Branching Beyond

Discussion about this post