Shampoo

Gauss-Newton Optimization for Deep Neural Networks and LLM Training This document provides a brief technical overview of modern optimization techniques for large-scale deep learning. It starts with classical optimization foundations, moves through advanced second-order methods, and ends with practical curvature approximations that make these techniques feasible for billion-parameter models. What you’ll learn: How classical optimization methods scale (or don’t) to modern deep learning The mathematical foundation of second-order optimization Practical approximations like K-FAC, Shampoo, and Rank-1 methods When and why to use each optimization approach 📋 Table of Contents Classical Optimization Foundations Problem Setup Gradient Descent Newton’s Method Gauss–Newton and Generalized Gauss–Newton Setup: Network Outputs and Loss Hessian Structure Gauss–Newton (GN) Approximation Generalized Gauss–Newton (GGN) Why Use GGN? Practical Challenge: Computing J and G Approximating GGN in Practice Gradient Descent (First-Order Baseline) Adam K-FAC: Kronecker-Factored Approximate Curvature Shampoo Optimizer Rank-1 Curvature Approximation Summary of GGN Approximations Intuitive Analogy: Mountain Navigation Practical Guidance for Optimizer Selection Key Takeaways 1. Classical Optimization Foundations 1.1 Problem Setup The foundation of all optimization in deep learning is the empirical risk minimization problem. We consider the general problem of minimizing an objective function: $$ L(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}[\ell(f(x; \theta), y)] $$ ...