阶段离散下降调整策略:
首先“阶段离散”下降调整这个词不是个专有名词,它只是一个形容。 符合这种调整策略的方法,一般是step,step学习率下降策略是最为常用的一种,表现为,在初始学习率的基础上,每到一个阶段学习率将以gamma的指数倍下降,通常情况下gamma为0.1。显然随着训练迭代学习率会变的越来越小,但是不管怎么变,这个数都在趋近于0,永远不会到0. 效果类似于:
代码语言:javascript复制# lr = 0.05 if epoch < 30
# lr = 0.005 if 30 <= epoch < 60
# lr = 0.0005 if 60 <= epoch < 90
pytorch中定义了两种方法做这件事,分别是等间隔调整学习率(Step),按需调整学习率(MultiStep),实际上它们的效果是一致的
等间隔下降调整策略
等间隔的调整是在定义间隔是什么,即step_size,当训练的epoch满足step_size时,学习率就调整一次,last_epoch对step_size取整。
代码语言:javascript复制class StepLR(_LRScheduler):
def __init__(self, optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False):
self.step_size = step_size
self.gamma = gamma
super(StepLR, self).__init__(optimizer, last_epoch, verbose)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.", UserWarning)
if (self.last_epoch == 0) or (self.last_epoch % self.step_size != 0):
return [group['lr'] for group in self.optimizer.param_groups]
return [group['lr'] * self.gamma
for group in self.optimizer.param_groups]
def _get_closed_form_lr(self):
return [base_lr * self.gamma ** (self.last_epoch // self.step_size)
for base_lr in self.base_lrs]
按需下降调整策略
按需调整学习率是在直接定义目标是什么,训练中,当前的epoch达到目标的时候,学习率调整,milestones就是定义的一系列目标,当last_epoch不在milestones中时,学习率保持不变,相反的,则gamma的指数倍调整,当然gamma是个小数,所以学习率越来越小。
代码语言:javascript复制class MultiStepLR(_LRScheduler):
def __init__(self, optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False):
self.milestones = Counter(milestones)
self.gamma = gamma
super(MultiStepLR, self).__init__(optimizer, last_epoch, verbose)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.", UserWarning)
if self.last_epoch not in self.milestones:
return [group['lr'] for group in self.optimizer.param_groups]
return [group['lr'] * self.gamma ** self.milestones[self.last_epoch]
for group in self.optimizer.param_groups]
def _get_closed_form_lr(self):
milestones = list(sorted(self.milestones.elements()))
return [base_lr * self.gamma ** bisect_right(milestones, self.last_epoch)
for base_lr in self.base_lrs]
连续下降调整策略:
“连续”下降这个词不是个专有名词,它只是一个形容。 符合这种下降策略的方法有,线性下降策略,cos下降策略,指数衰减下降策略,前两种在Pytorch没有实现。它们的表现是,随着训练迭代,学习率的变化时逐步减少的,不会出现step方式的阶段下降,并且到迭代结束时,学习率为0,而不是一个非常小的值。
线性下降调整策略
线性下降策略非常好理解,就是学习率与迭代周期是线性关系,初始学习率和周期确定了,下降的斜率也就确定了。
代码语言:javascript复制 lr = args.lr * (1 - (current_iter - warmup_iter) / (max_iter - warmup_iter))
cos下降调整策略
cos下降策略到最后一步迭代的最后时,系数刚好为cos(pi/2),即为0,开始迭代时系数为为cos(0),即为1,中间遵循余弦曲线的方式下降。
代码语言:javascript复制lr = args.lr * (1 cos(pi * (current_iter - warmup_iter) / (max_iter - warmup_iter))) / 2
指数衰减调整策略
指数衰减调整策略的计算方式和step是一样的,都是当前的epoch作为指数, gamma作为底数,即gamma**epoch。不同之处在于,指数衰减下降策略是每个epoch都会做的,可以看做在epoch间连续,其次,更为重要的是,选择指数衰减下降策略时gamma不能选择为0.1,否则几个epoch过去,学习率就非常趋近于0了,所以一般是0.9。
代码语言:javascript复制class ExponentialLR(_LRScheduler):
def __init__(self, optimizer, gamma, last_epoch=-1, verbose=False):
self.gamma = gamma
super(ExponentialLR, self).__init__(optimizer, last_epoch, verbose)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.", UserWarning)
if self.last_epoch == 0:
return self.base_lrs
return [group['lr'] * self.gamma
for group in self.optimizer.param_groups]
def _get_closed_form_lr(self):
return [base_lr * self.gamma ** self.last_epoch
for base_lr in self.base_lrs]
周期性调整策略:
周期性调整的特点是不再从始至终单调的下降,而是会出现上升的情况。
余弦退火调整策略
以余弦函数为周期,并在每个周期最大值时重新设置学习率。以初始学习率为最大学习率,以 2∗Tmax 为周期,在一个周期内先下降,后上升。
代码语言:javascript复制class CosineAnnealingLR(_LRScheduler):
def __init__(self, optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False):
self.T_max = T_max
self.eta_min = eta_min
super(CosineAnnealingLR, self).__init__(optimizer, last_epoch, verbose)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.", UserWarning)
if self.last_epoch == 0:
return self.base_lrs
elif (self.last_epoch - 1 - self.T_max) % (2 * self.T_max) == 0:
return [group['lr'] (base_lr - self.eta_min) *
(1 - math.cos(math.pi / self.T_max)) / 2
for base_lr, group in
zip(self.base_lrs, self.optimizer.param_groups)]
return [(1 math.cos(math.pi * self.last_epoch / self.T_max)) /
(1 math.cos(math.pi * (self.last_epoch - 1) / self.T_max)) *
(group['lr'] - self.eta_min) self.eta_min
for group in self.optimizer.param_groups]
def _get_closed_form_lr(self):
return [self.eta_min (base_lr - self.eta_min) *
(1 math.cos(math.pi * self.last_epoch / self.T_max)) / 2
for base_lr in self.base_lrs]
循环调整策略
循环调整顾名思义,就是以一个周期和一个上下界反复调整学习率,这个方法出自《Cyclical Learning Rates for Training Neural Networks》,这么做的理由是要避免模型进入局部最优的状态,也就是鞍点(saddle points)。而循环学习率方法使得一个范围(base_lr ~ max_lr)里的学习率在训练中都能得到运用,也就是说,在下边界和上边界中,那个最佳的学习率将会在训练中有机会运用到训练中。
代码语言:javascript复制torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr,
step_size_up=2000,
step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None,
scale_mode='cycle', cycle_momentum=True, base_momentum=0.8,
max_momentum=0.9, last_epoch=-1)
自适应调整策略:
顾名思义,之前的调整策略都是在出初始学习率和训练周期固定好之后就确定下来的规则,不会根据训练状态的变化而变化,自适应的调整则不同,依训练状况伺机调整,该法通过监测某一指标的变化情况,当该指标不再怎么变化的时候,就是调整学习率的时机。
ReduceLROnPlateau
ReduceLROnPlateau的名字很直观,就是在持续平稳的状态时下降学习率,当某指标不再变化(下降或升高),则调整学习率,这是非常实用的学习率调整策略。例如,当验证集的 loss 不再下降时,进行学习率调整;或者监测验证集的 accuracy,当accuracy 不再上升时,则调整学习率。
代码语言:javascript复制class ReduceLROnPlateau(object):
def __init__(self, optimizer, mode='min', factor=0.1, patience=10,
threshold=1e-4, threshold_mode='rel', cooldown=0,
min_lr=0, eps=1e-8, verbose=False):
if factor >= 1.0:
raise ValueError('Factor should be < 1.0.')
self.factor = factor
# Attach optimizer
if not isinstance(optimizer, Optimizer):
raise TypeError('{} is not an Optimizer'.format(
type(optimizer).__name__))
self.optimizer = optimizer
if isinstance(min_lr, list) or isinstance(min_lr, tuple):
if len(min_lr) != len(optimizer.param_groups):
raise ValueError("expected {} min_lrs, got {}".format(
len(optimizer.param_groups), len(min_lr)))
self.min_lrs = list(min_lr)
else:
self.min_lrs = [min_lr] * len(optimizer.param_groups)
self.patience = patience
self.verbose = verbose
self.cooldown = cooldown
self.cooldown_counter = 0
self.mode = mode
self.threshold = threshold
self.threshold_mode = threshold_mode
self.best = None
self.num_bad_epochs = None
self.mode_worse = None # the worse value for the chosen mode
self.eps = eps
self.last_epoch = 0
self._init_is_better(mode=mode, threshold=threshold,
threshold_mode=threshold_mode)
self._reset()
def _reset(self):
"""Resets num_bad_epochs counter and cooldown counter."""
self.best = self.mode_worse
self.cooldown_counter = 0
self.num_bad_epochs = 0
def step(self, metrics, epoch=None):
# convert `metrics` to float, in case it's a zero-dim Tensor
current = float(metrics)
if epoch is None:
epoch = self.last_epoch 1
else:
warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
self.last_epoch = epoch
if self.is_better(current, self.best):
self.best = current
self.num_bad_epochs = 0
else:
self.num_bad_epochs = 1
if self.in_cooldown:
self.cooldown_counter -= 1
self.num_bad_epochs = 0 # ignore any bad epochs in cooldown
if self.num_bad_epochs > self.patience:
self._reduce_lr(epoch)
self.cooldown_counter = self.cooldown
self.num_bad_epochs = 0
self._last_lr = [group['lr'] for group in self.optimizer.param_groups]
def _reduce_lr(self, epoch):
for i, param_group in enumerate(self.optimizer.param_groups):
old_lr = float(param_group['lr'])
new_lr = max(old_lr * self.factor, self.min_lrs[i])
if old_lr - new_lr > self.eps:
param_group['lr'] = new_lr
if self.verbose:
print('Epoch {:5d}: reducing learning rate'
' of group {} to {:.4e}.'.format(epoch, i, new_lr))
@property
def in_cooldown(self):
return self.cooldown_counter > 0
def is_better(self, a, best):
if self.mode == 'min' and self.threshold_mode == 'rel':
rel_epsilon = 1. - self.threshold
return a < best * rel_epsilon
elif self.mode == 'min' and self.threshold_mode == 'abs':
return a < best - self.threshold
elif self.mode == 'max' and self.threshold_mode == 'rel':
rel_epsilon = self.threshold 1.
return a > best * rel_epsilon
else: # mode == 'max' and epsilon_mode == 'abs':
return a > best self.threshold
def _init_is_better(self, mode, threshold, threshold_mode):
if mode not in {'min', 'max'}:
raise ValueError('mode ' mode ' is unknown!')
if threshold_mode not in {'rel', 'abs'}:
raise ValueError('threshold mode ' threshold_mode ' is unknown!')
if mode == 'min':
self.mode_worse = inf
else: # mode == 'max':
self.mode_worse = -inf
self.mode = mode
self.threshold = threshold
self.threshold_mode = threshold_mode
def state_dict(self):
return {key: value for key, value in self.__dict__.items() if key != 'optimizer'}
def load_state_dict(self, state_dict):
self.__dict__.update(state_dict)
self._init_is_better(mode=self.mode, threshold=self.threshold, threshold_mode=self.threshold_mode)
自定义调整策略:
LambdaLR
LambdaLR可以为不同参数组设定不同学习率调整策略。
代码语言:javascript复制torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)
与其他调整规则的区别在于,optimizer和lr_lambda可以是list,对应之后,相应的参数就会根据对应规则调整