

Returns the state of the optimizer as a dict. Should be an object returnedįrom a call to state_dict(). Param_group ( dict) – Specifies what Tensors should be optimized along with group Trainable and added to the Optimizer as training progresses. This can be useful when fine tuning a pre-trained network as frozen layers can be made Graph capture this instance, leave it False (default: False)Īdd a param group to the Optimizer s param_groups. Passing True can impair ungraphed performance, so if you don’t intend to Maximize ( bool, optional) – maximize the params based on the objective, instead ofĬapturable ( bool, optional) – whether this instance is safe to capture in a CUDA graph. Weight_decay ( float, optional) – weight decay (L2 penalty) (default: 0)Īmsgrad ( boolean, optional) – whether to use the AMSGrad variant of thisĪlgorithm from the paper On the Convergence of Adam and Beyondįoreach ( bool, optional) – whether foreach implementation of optimizer Running averages of gradient and its square (default: (0.9, 0.999))Įps ( float, optional) – term added to the denominator to improve Lr ( float, optional) – learning rate (default: 1e-3)īetas ( Tuple, optional) – coefficients used for computing Params ( iterable) – iterable of parameters to optimize or dicts defining Input : γ (lr), β 1, β 2 (betas), θ 0 (params), f ( θ ) (objective) λ (weight decay), amsgrad, maximize initialize : m 0 ← 0 ( first moment), v 0 ← 0 (second moment), v 0 ^ m a x ← 0 for t = 1 to … do if maximize : g t ← − ∇ θ f t ( θ t − 1 ) else g t ← ∇ θ f t ( θ t − 1 ) if λ ≠ 0 g t ← g t + λ θ t − 1 m t ← β 1 m t − 1 + ( 1 − β 1 ) g t v t ← β 2 v t − 1 + ( 1 − β 2 ) g t 2 m t ^ ← m t / ( 1 − β 1 t ) v t ^ ← v t / ( 1 − β 2 t ) if a m s g r a d v t ^ m a x ← m a x ( v t ^ m a x, v t ^ ) θ t ← θ t − 1 − γ m t ^ / ( v t ^ m a x + ϵ ) else θ t ← θ t − 1 − γ m t ^ / ( v t ^ + ϵ ) r e t u r n θ t \begin input : γ (lr), β 1 , β 2 (betas), θ 0 (params), f ( θ ) (objective) λ (weight decay), amsgrad, maximize initialize : m 0 ← 0 ( first moment), v 0 ← 0 (second moment), v 0 ma x ← 0 for t = 1 to … do if maximize : g t ← − ∇ θ f t ( θ t − 1 ) else g t ← ∇ θ f t ( θ t − 1 ) if λ = 0 g t ← g t + λ θ t − 1 m t ← β 1 m t − 1 + ( 1 − β 1 ) g t v t ← β 2 v t − 1 + ( 1 − β 2 ) g t 2 m t ← m t / ( 1 − β 1 t ) v t ← v t / ( 1 − β 2 t ) if am s g r a d v t ma x ← max ( v t ma x, v t ) θ t ← θ t − 1 − γ m t / ( v t ma x + ϵ ) else θ t ← θ t − 1 − γ m t / ( v t + ϵ ) return θ t įor further details regarding the algorithm we refer to Adam: A Method for Stochastic Optimization.


CUDA Automatic Mixed Precision examples.
