Checkpointing Tutorial for TensorFlow, Keras, and PyTorch
This post will demonstrate how to checkpoint your training models on FloydHub so that you can resume your experiments from these saved states.
Wait, but why?
If you've ever played a video game, you might already understand why checkpoints are useful. For example, sometimes you'll want to save your game right before a big boss castle - just in case everything goes terribly wrong inside and you need to try again. Checkpoints in machine learning and deep learning experiments are essentially the same thing - a way to save the current state of your experiment so that you can pick up from where you left off.
Trust me, you're going to have a bad time if you lose one or more of your experiments due to a power outage, OS fault, job preemption, or any other type of unexpected error. Other times, even if you don't experience an unforeseen error, you might just want just to resume a particular state of the training for a new experiment - or try different things from a given state.
That's why you need checkpoints!
But, wait - there's one more reason, and it's a big one. If you don't checkpoint your training models at the end of a job, you'll have lost all of your results! Like, they're just gone. Simply put, if you'd like to make use of your trained models, you're going to need some checkpoints.
So what is a checkpoint really?
The Keras docs provide a great explanation of checkpoints (that I'm going to gratuitously leverage here):
- The architecture of the model, allowing you to re-create the model
- The weights of the model
- The training configuration (loss, optimizer, epochs, and other meta-information)
- The state of the optimizer, allowing to resume training exactly where you left off.
Again, a checkpoint contains the information you need to save your current experiment state so that you can resume training from this point. Just like in that infernal Zelda II: The Adventure of Link game from my childhood.
Checkpoint Strategies
At this point, I'll assume I've convinced you that checkpoints need to be a vital part of your deep learning workflow. So, let's talk strategy.
You can employ different checkpoint strategies according to the type of experiment training regime you're performing:
- Short Training Regime (minutes to hours)
- Normal Training Regime (hours to day)
- Long Training Regime (days to weeks)
Short Training Regime
The typical practice is to save a checkpoint only at the end of the training, or at the end of every epoch.
Normal Training Regime
In this case, it's common to save multiple checkpoints every n_epochs
and keep track of the best one with respect to some validation metric that we care about. Usually, there's a fixed maximum number of checkpoints so as to not take up too much disk space (for example, restricting your maximum number of checkpoints to 10, where the new ones will replace the earliest ones).
Long Training Regime
In this type of training regime, you'll likely want to employ a similar strategy to the Normal regime - where you're saving multiple checkpoints every n_epochs
and keeping track of the best one with respect to the validation metric that you care about. In this case, since the training can be very long, it's common to save checkpoints less frequently but maintain a greater number of checkpoints.
Which regime is right for me?
The tradeoff among these various strategies is between the frequency and the number of checkpoint files to keep. Let's take a look what's happening when we act over these two parameters:
FREQUENCY | CHECKPOINTS | CONS | PRO |
---|---|---|---|
High | High | You need a lot of space!! | You can resume very quickly in almost all the interesting training states |
High | Low | You could have lost precious states | Minimize the storage space you need |
Low | High | It will take time to get to intermediate states | You can resume the experiments in a lot of interesting states |
Low | Low | You could have lost precious states | Minimize the storage space you need |
Hopefully, now you have a good intuition about what might be the best checkpoint strategy for your training regime. It should go without saying that you can obviously develop your own custom checkpoint strategy based on your experiment needs! These are just tips and best practices that I take into consideration for my own projects.
Save and Resume on FloydHub
Now, let's dive into some code on FloydHub. I'll show you how to save checkpoints in three popular deep learning frameworks available on FloydHub: TensorFlow, Keras, and PyTorch.
Before you start, log into the FloydHub command-line-tool with the floyd logincommand, then fork and init
the project:
$ git clone https://github.com/floydhub/save-and-resume.git
$ cd save-and-resume
$ floyd init save-and-resume
For our checkpointing examples, we'll be using the Hello, World
of deep learning: the MNIST classification task using a Convolutional Neural Network model.
Because it's always important to be clear about our checkpointing strategy up-front, I'll state the approach we're going to be taking:
- Keep only one checkpoint
- Trigger the strategy at the end of every epoch
- Save the one with the best (maximum) validation accuracy
Considering this toy example, we can employ the Short Training Regime strategy. Feel free to adapt this for your own more complicated experiments!
The commands
Before we dive into specific working examples, let's outline the basic commands you'll need. When starting a new job, your first command will look something like this:
代码语言:javascript复制floyd run
[--gpu]
--env <env>
--data <your_dataset>:<mounting_point_dataset>
"python <script_and_parameters>"
Important note: within your python script, you'll want to make sure that the checkpoint is being saved to the /output
folder. FloydHub will automatically save the contents of the /output
directory as a job's Output
, which is how you'll be able to leverage these checkpoints to resume jobs.
Once your job has been completed, you'll then be able to mount that's job's output as an input to your next job - allowing your script to leverage the checkpoint you created in the next run of this project.
代码语言:javascript复制floyd run
[--gpu]
--env <env>
--data <your_dataset>:<mounting_point_dataset>
--data <output_of_previous_job>:<mounting_point_model>
"python <script_and_parameters>"
Okay, enough of that. Let's see how to make this tangible using three of the most popular frameworks on FloydHub.
TensorFlow
View full example on a FloydHub Jupyter Notebook
TensorFlow provides different ways to save and resume a checkpoint. In our example, we will use the tf.Estimator API, which uses tf.train.Saver, tf.train.CheckpointSaverHook and tf.saved_model.builder.SavedModelBuilder behind the scenes.
To be more clear, the tf.Estimator
API uses the first function to save the checkpoint, the second one to act according to the adopted checkpointing strategy, and the last one to export the model to be served with export_savedmodel()
method.
Let's dig in.
Saving a TensorFlow checkpoint
Before initializing an Estimator
, we have to define the checkpoint strategy. To do so, we have to create a configuration for the Estimator using the tf.estimator.RunConfigAPI. Here's an example of how we might do this:
# Save the checkpoint in the /output folder
filepath = "/output/mnist_convnet_model"
# Checkpoint Strategy configuration
run_config = tf.contrib.learn.RunConfig(
model_dir=filepath,
keep_checkpoint_max=1)
In this way, we're telling the estimator which directory to save or resume a checkpoint from, and also how many checkpoints to keep.
Next, we have to provide this configuration at the initialization of the Estimator
:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, config=run_config)
That's it. Seriously. We're now set up to save checkpoints in our TensorFlow code.
Resuming a TensorFlow checkpoint
Guess what? We're also already set up to resume from checkpoints in our next experiment run. If the Estimator
finds a checkpoint inside the given model folder, it will load from the last checkpoint.
Okay, let me try
Don't take my word for it - try it out yourself. Here are the steps to run the TensorFlow checkpointing example on FloydHub.
Via FloydHub's Command Mode
First time training command:
代码语言:javascript复制floyd run
--gpu
--env tensorflow-1.3
--data redeipirati/datasets/mnist/1:input
'python tf_mnist_cnn.p
y'
- The
--env
flag specifies the environment that this project should run on (Tensorflow 1.3.0 Keras 2.0.6 on Python3.6) - The
--data
flag specifies that the pytorch-mnist dataset should be available at the/input
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine
Resuming from your checkpoint:
代码语言:javascript复制floyd run
--gpu
--env tensorflow-1.3
--data redeipirati/datasets/mnist/1:input
--data <your-username>/projects/save-and-resume/<jobs>/output:/model
'python tf_mnist_cnn.py'
- The
--env
flag specifies the environment that this project should run on (Tensorflow 1.3.0 Keras 2.0.6 on Python3.6) - The first
--data
flag specifies that the pytorch-mnist dataset should be available at the/input
directory - The second
--data
flag specifies that the output of a previus Job should be available at the/model
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine
Via FloydHub's Jupyter Notebook Mode
代码语言:javascript复制floyd run
--gpu
--env tensorflow-1.3
--data redeipirati/datasets/mnist/1:input
--mode jupyter
- The
--env
flag specifies the environment that this project should run on (Tensorflow 1.3.0 Keras 2.0.6 on Python3.6) - The
--data
flag specifies that the pytorch-mnist dataset should be available at the/input
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine - The
--mode
flag specifies that this job should provide a Jupyter notebook instance
Resuming from your checkpoint:
Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model
to the previous command if you want to load a checkpoint from a previous Job in your Jupyter notebook.
Keras
View full example on a FloydHub Jupyter Notebook
Keras provides a great API for saving and loading checkpoints. Let's take a look:
Saving a Keras checkpoint
Keras provides a set of functions called callbacks: you can think of callbacks as events that will be triggered at certain training states. The callback we need for checkpointing is the ModelCheckpoint which provides all the features we need according to the checkpointing strategy we adopted in our example.
Note: this function will only save the model's weights - if you want to save the entire model or some of the components, you can take a look at the Keras docs on saving a model.
First up, we have to import the callback functions:
代码语言:javascript复制from keras.callbacks import ModelCheckpoint
Next, just before the call to model.fit(...)
, it's time to prepare the checkpoint strategy.
# Save the checkpoint in the /output folder
filepath = "/output/mnist-cnn-best.hdf5"
# Keep only a single checkpoint, the best over test accuracy.
checkpoint = ModelCheckpoint(filepath,
monitor='val_acc',
verbose=1,
save_best_only=True,
mode='max')
filepath="/output/mnist-cnn-best.hdf5"
: Remember, FloydHub will save the contents of/output
folder! See more on job output in the FloydHub docs,monitor='val_acc'
: This is the metric we care about - validation accuracy,verbose=1
: It will print more informationsave_best_only=True
: Keep only the best checkpoint (in terms of maximum validation accurancy)mode='max'
: Save the checkpoint with max validation accuracy
By default, the period (or checkpointing frequency) is set to 1, which means at the end of every epoch.
For more information (such as filepath formatting options, checkpointing period, and more), you can explore the Keras ModelCheckpoint API.
Finally, we are ready to see this checkpointing strategy applied during model training. In order to do this, we need to pass the callback variable to the model.fit(...)
call:
# Train
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
callbacks=[checkpoint]) # <- Apply our checkpoint strategy
According to our chosen strategy, you will see:
代码语言:javascript复制# This line when the training reach a new max
Epoch < n_epoch >: val_acc improved from < previous val_acc > to < new max val_acc >, saving model to /output/mnist-cnn-best.hdf5
# Or this line
Epoch < n_epoch >: val_acc did not improve
That's it - you're now set up to save your Keras checkpoints.
Resuming a Keras checkpoint
Keras models provide the load_weights()
method, which loads the weights from a hdf5
file.
To load the model's weights, you just need to add this line after the model definition:
代码语言:javascript复制... # Model Definition
model.load_weights(resume_weights)
Okay, let me try
Here's how you can do run this Keras example on FloydHub:
Via FloydHub's Command Mode
First time training command:
代码语言:javascript复制floyd run
--gpu
--env tensorflow-1.3
'python keras_mnist_cnn.py'
- The
--env
flag specifies the environment that this project should run on (Tensorflow 1.3.0 Keras 2.0.6 on Python3.6) - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine
Keras provides an API to handle MNIST data, so we can skip the dataset mounting in this case.
Resuming from your checkpoint:
代码语言:javascript复制floyd run
--gpu
--env tensorflow-1.3
--data <your-username>/projects/save-and-resume/<jobs>/output:/model
'python keras_mnist_cnn.py'
- The
--env
flag specifies the environment that this project should run on (Tensorflow 1.3.0 Keras 2.0.6 on Python3.6) - The
--data
flag specifies that the output of a previus Job should be available at the/model
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine
Via FloydHub's Jupyter Notebook Mode
代码语言:javascript复制floyd run
--gpu
--env tensorflow-1.3
--mode jupyter
- The
--env
flag specifies the environment that this project should run on (Tensorflow 1.3.0 Keras 2.0.6 on Python3.6) - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine - The
--mode
flag specifies that this job should provide us a Jupyter notebook.
Resuming from your checkpoint:
Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model
if you want to load a checkpoint from a previous job.
PyTorch
View full example on a FloydHub Jupyter Notebook
Unfortunately, at the moment, PyTorch does not have as easy of an API as Keras for checkpointing. We'll need to write our own solution according to our chosen checkpointing strategy.
Saving a PyTorch checkpoint
PyTorch does not provide an all-in-one API to defines a checkpointing strategy, but it does provide a simple way to save and resume a checkpoint. According the official docs about semantic serialization, the best practice is to save only the weights - due to a code refactoring issue.
Therefore, let's take a look at how to save the model weights in PyTorch.
First up, let's define a save_checkpoint
function which handles all the instructions about the number of checkpoints to keep and the serialization on file:
def save_checkpoint(state, is_best, filename='/output/checkpoint.pth.tar'):
"""Save checkpoint if a new best is achieved"""
if is_best:
print ("=> Saving a new best")
torch.save(state, filename) # save checkpoint
else:
print ("=> Validation Accuracy did not improve")
Then, inside the training (which is usually a for-loop of the number of epochs), we define the checkpoint frequency (in our case, at the end of every epoch) and the information we'd like to store (the epochs, model weights, and best accuracy achieved):
代码语言:javascript复制...
# Training the Model
for epoch in range(num_epochs):
train(...) # Train
acc = eval(...) # Evaluate after every epoch
# Some stuff with acc(accuracy)
...
# Get bool not ByteTensor
is_best = bool(acc.numpy() > best_accuracy.numpy())
# Get greater Tensor to keep track best acc
best_accuracy = torch.FloatTensor(max(acc.numpy(), best_accuracy.numpy()))
# Save checkpoint if is a new best
save_checkpoint({
'epoch': start_epoch epoch 1,
'state_dict': model.state_dict(),
'best_accuracy': best_accuracy
}, is_best)
That's it! You can now save checkpoints in your PyTorch experiments.
Resuming a PyTorch checkpoint
To resume a PyTorch checkpoint, we have to load the weights and the meta information we need before the training:
代码语言:javascript复制# cuda = torch.cuda.is_available()
if cuda:
checkpoint = torch.load(resume_weights)
else:
# Load GPU model on CPU
checkpoint = torch.load(resume_weights,
map_location=lambda storage,
loc: storage)
start_epoch = checkpoint['epoch']
best_accuracy = checkpoint['best_accuracy']
model.load_state_dict(checkpoint['state_dict'])
print("=> loaded checkpoint '{}' (trained for {} epochs)".format(resume_weights, checkpoint['epoch']))
For more information on loading GPU-trained weights on a CPU instance, you can check out this PyTorch discussion.
Okay, let me try
Here's how you can do run this PyTorch example on FloydHub:
Via FloydHub's Command Mode
First time training command:
代码语言:javascript复制floyd run
--gpu
--env pytorch-0.2
--data redeipirati/datasets/pytorch-mnist/1:input
'python pytorch_mnist_cnn.py'
- The
--env
flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3) - The
--data
flag specifies that the pytorch-mnist dataset should be available at the/input
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine
Resuming from your checkpoint:
代码语言:javascript复制floyd run
--gpu
--env pytorch-0.2
--data redeipirati/datasets/pytorch-mnist/1:input
--data <your-username>/projects/save-and-resume/<jobs>/output:/model
'python pytorch_mnist_cnn.py'
- The
--env
flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3) - The first
--data
flag specifies that the pytorch-mnist dataset should be available at the/input
directory - The second
--data
flag specifies that the output of a previus Job should be available at the/model
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine
Via FloydHub's Jupyter Notebook Mode
代码语言:javascript复制floyd run
--gpu
--env pytorch-0.2
--data redeipirati/datasets/pytorch-mnist/1:input
--mode jupyter
- The
--env
flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3) - The
--data
flag specifies that the pytorch-mnist dataset should be available at the/input
directory - The
--gpu
flag is actually optional here - unless you want to start right away with running the code on a GPU machine - The
--mode
flag specifies that this job should provide us a Jupyter notebook.
Resuming from your checkpoint:
Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model
if you want to load a checkpoint from a previous Job.