Public Member Functions | |
def | __init__ (self, job, checkpoint_manager=None, resume_from_epoch=None) |
def | __call__ (self, client) |
def | load_blobs_from_checkpoints (self, blob_names, epoch, session) |
Public Attributes | |
resume_from_epoch | |
checkpoint | |
job | |
Implement the runtime logic for jobs with checkpointing at the level of epoch. Can be used to run either single-host or distributed jobs. Job runner is a callable to be called once from the client, passing a Session as argument. This call will block until the Job execution is complete. If a checkpoint_manager is passed, checkpoints will be taken after initialization and after each epoch execution. If, in addition, `resume_from_epoch` is an epoch number, the corresponding checkpoint will be loaded and job execution will continue from the given epoch. In this case, the job's init_group will not be run. Refer to checkpoint_test.py for an example.
Definition at line 297 of file checkpoint.py.
def checkpoint.JobRunner.load_blobs_from_checkpoints | ( | self, | |
blob_names, | |||
epoch, | |||
session | |||
) |
Loads the necessary blobs from the checkpoints. Checkpoints store the snapshots of the workspace in each node. Sometimes we only need to load a subset of the blobs from the checkpoints. One common scenario is to load only the model blobs from the checkpoints for evaluation purpose. Given the names of the necessary blobs, this function goes over all the checkpoints of all the nodes, but only loads the blobs specified in the blob_names to the current workspace. Args: blob_names: A list of strings. Each string is the name of a blob. epoch: An integer. The checkpoint epoch to load from. session: A Session object to execute the load ops. Raises: ValueError: When the checkpoint manager is invalid.
Definition at line 356 of file checkpoint.py.