Lévy-Driven Stochastic Differential Equation formulation of SGD¶

Author: Aneesh Jonelagadda Date: Oct 2023 (edited 2025)

Stochastic Gradient Descent (SGD) is a common optimizer in deep-learning. This project will analyze the SGD process from a stochastic differential equation (SDE) lens as a Lévy SDE process with noise process. We will thus formulate and implement the SGD process as containing an exact term and a separate noise term. The 'manually-noised' optimization will then be simulated for varying loss landscape geometries and noise distribution properties, which will be useful to understand scaling across different factors. These scaling behaviors are further useful to predict which optimizers tend towards which types of basins as we will briefly get into at the end.

It is worth noting that I have drawn from the mathematical theory of several papers for this project, the main ones being Zhou et al., 2021 and Imkeller et al., 2010 with this project building off the mathematical theory to actually simulate basin escape via Python, implement Chambers-Mallow-Stuck simulation of Lévy processes, directly compare Lévy with Gaussian noise escape dynamics, simulate and analyze basin-geometry scaling differences on a few pretty-looking 2D loss landscapes, and more!

We will first define a no gradient-noise SGD with separate noise term, then add gradient noise from a Gaussian process to the SGD. Finally we will draw on the findings that gradient-noise is in practice heavy-tailed, and instead add a Lévy-$\alpha$-stable distribution noise process to SGD with varying levels of $\alpha$.

We will simulate escape-time $\Gamma$ from a local basin across varying geometries of loss landscape basins for these differently-noised optimizers. This will show how geometry dependence on $\Gamma$ changes when Gaussian noise is changed to Lévy noise, as well as showing tail-thickness dependence on $\Gamma$ by modulating tail index $\alpha$.

Now to build the SGD algorithms, Let us first define the traditional form of Stochastic Gradient Descent.

$$\theta_{\text{t+1}} = \theta_t - \eta \nabla f_{\mathcal{S_t}}(\theta_t)$$ with $\nabla f_{\mathcal{S_t}}(\theta_t) = \frac{1}{S}\sum_{i \in \mathcal{S_t}}\nabla f_{i}(\theta_t)$ across mini-batches $\mathcal{S_t}$ sampled from data with size $\mathcal{S}$.

We can now define the "gradient noise" $u_t$ as the difference between the true gradient at a particular location at time t and the computed/bootstrapped gradient. Thus we can reformulate the above SGD in terms of the exact gradient and a noise term:

$$u_t = \nabla F(\theta_t) - \nabla f_{\mathcal{S_t}}(\theta_t) \tag{1} $$ $$\theta_{\text{t+1}} = \theta_t - \eta \nabla F(\theta_t)+\eta u_t$$

Since we are defining the basins analytically, we will be able to compute the true gradient and then add a noise process to the algorithm as per (1). First let's describe our basins, then we will talk about noise.

In [2]:
import numpy as np
import torch
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from typing import Callable

2D Loss landscape formulation¶

Let's formulate the loss landscapes analytically in 2D so that we can directly compute the loss and gradient without any gradient noise. This is important because we will be directly adding gradient noise to the optimization later. We also want to formulate the landscapes analytically so that we can have a direct control of the basin geometry. The high-level landscape type we will simulate is a two-basin model with one suboptimal and one optimal minimum. The optimization will start in the basin near the local minimum, and 'escape time' will be quantified as the number of iterations until the wider basin global minimum is reached. We can modulate the geometry of the starting basin independently of the other; this way escape time is easy to quantify for varying geometries of basins. $$ \begin{align} F(x, y) &= \min\left\{ F_1(x, y), F_2(x, y) \right\} \\ F_i(x, y) &= h_i \left( \frac{\sqrt{(x - x_i)^2 + (y - y_i)^2}}{w_i} \right)^{p_i} \\ \end{align} \tag{2} $$

where

  • $\mu = (x_i, y_i)$ is the center of basin $i$
  • $h_i > 0$ is the height/depth
  • $w_i > 0$ is the width
  • $p_i \geq 2$ is the sharpness (power)
  • $r_i$ is the distance from basin center

We can modulate the h,w,p parameter across two basins as such:

$$ \begin{array}{|c|c|c|c|c|} \hline \text{Basin} & \boldsymbol{\mu} & h & w & p \\ \hline \text{Optimal} & (3, 3) & 1.0 & 2.5 & 2 \\ \hline \text{Suboptimal (Wide)} & (0, 0) & 2.0 & 2.0 & 2 \\ \hline \text{Suboptimal (Medium)} & (0, 0) & 2.0 & 1.0 & 4 \\ \hline \text{Suboptimal (Sharp)} & (0, 0) & 2.0 & 0.5 & 6 \\ \hline \end{array} $$

In [257]:
#Configuration for clear comparison
configs_sharp = {
    'suboptimal': {
        'center': (0, 0),
        'depth': 1.0,    
        'width': 0.35,    
        'power': 6      
    },
    'optimal': {
        'center': (3, 3),  
        'depth': 1.0,     
        'width': 2.0,  
        'power': 2         
    }
}

configs_medium = {
    'suboptimal': {
        'center': (0, 0),
        'depth': 1.0,   
        'width': 0.75,    
        'power': 4       
    },
    'optimal': {
        'center': (3, 3),  
        'depth': 1.0,      
        'width': 2.0,      
        'power': 2       
    }
}

configs_wide = {
    'suboptimal': {
        'center': (0, 0),
        'depth': 1.0,    
        'width': 1.0,   
        'power': 2       
    },
    'optimal': {
        'center': (3, 3),  
        'depth': 1.0,      
        'width': 2.0,     
        'power': 2        
    }
}

def F(x, y, configs=configs_sharp):
    #Suboptimal basin where SGD experiments will begin
    r1 = np.sqrt((x - 0)**2 + (y - 0)**2)
    F1 = configs['suboptimal']['depth'] * \
         (r1 / configs['suboptimal']['width'])**configs['suboptimal']['power']
    
    #Optimal basin
    r2 = np.sqrt((x - 3)**2 + (y - 3)**2)
    F2 = configs['optimal']['depth'] * \
         (r2 / configs['optimal']['width'])**configs['optimal']['power']
    
    return np.minimum(F1, F2)

Now let's plot this loss landscape to visualize:

In [20]:
xs = np.linspace(-4,5,1000)
ys = np.linspace(-4,5,1000)
x, y = np.meshgrid(xs, ys)

F_sharp = F(x, y, configs_sharp)
F_medium = F(x, y, configs_medium)
F_wide = F(x, y, configs_wide)

fig = plt.figure(figsize=[18,18])

ax = fig.add_subplot(121, projection='3d')

ax.set_title("Two-basin Loss Landscape for 3 different basin shapes")
ax.plot_surface(x , y, F_sharp, alpha=0.4, color='purple')
ax.plot_surface(x , y, F_medium, alpha=0.3, color='yellow')
ax.plot_surface(x , y, F_wide, alpha=0.2, cmap='viridis')


ax = fig.add_subplot(122, projection='3d')
ax.view_init(elev=0, azim=0)

ax.set_title("Two-basin Loss Landscape (Orthogonal View)")
#ax.plot_surface(x , y, F_elliptical_well(x,y,3,3), alpha=0.7, cmap='viridis')
ax.plot_surface(x , y, F_sharp, alpha=0.4, color='purple')

ax.plot_surface(x , y, F_medium, alpha=0.3, color='yellow')

ax.plot_surface(x , y, F_wide, alpha=0.2, cmap='viridis')


plt.show()
No description has been provided for this image

Gradient Formulation¶

Now let's formulate the gradient $\nabla F$ for the loss landscapes. For elliptical well, it is simply: $$\nabla F = \begin{bmatrix} \frac{\partial F}{\partial x} \\ \frac{\partial F}{\partial y} \end{bmatrix} $$

$$ \begin{align} \nabla F(x, y) &= \nabla F_i(x, y) \quad \text{where } i = \arg\min_j F_j(x, y) \\[10pt] \frac{\partial F_i}{\partial x} &= \frac{h_i p_i}{w_i^2} \left( \frac{\sqrt{(x - x_i)^2 + (y - y_i)^2}}{w_i} \right)^{p_i - 2} (x - x_i) \\[10pt] \frac{\partial F_i}{\partial y} &= \frac{h_i p_i}{w_i^2} \left( \frac{\sqrt{(x - x_i)^2 + (y - y_i)^2}}{w_i} \right)^{p_i - 2} (y - y_i) \end{align} \tag{3} $$

In [52]:
def grad_F(x, y, configs):
    '''
    Input: loss surface config, point(s) to evaluate
    Outputs: 
    gradient at x (float),
    gradient at y (float),
    escaped (bool) (whether we're in basin 2 or not)
    '''
    c_1 = configs['suboptimal']
    c_2 = configs['optimal']
    
    #Centers
    x1, y1 = c_1['center']
    x2, y2 = c_2['center']
    
    #Distances
    r1 = np.sqrt((x-x1)**2 + (y-y1)**2)
    r2 = np.sqrt((x-x2)**2 + (y-y2)**2)
    
    #Detects which basin to use
    F1 = c_1['depth'] * (r1/c_1['width'])**c_1['power']
    F2 = c_2['depth'] * (r2/c_2['width'])**c_2['power']
    use_basin1 = F1 < F2
    
    #Avoid division by zero
    r1 = np.maximum(r1, 1e-10)
    r2 = np.maximum(r2, 1e-10)
    
    #Gradient of basin 1
    coeff1 = (c_1['depth']*c_1['power']/c_1['width']**2)*(r1/c_1['width'])**(c_1['power']-2)
    grad1_x = coeff1 * (x - x1)
    grad1_y = coeff1 * (y - y1)
    
    #Gradient of basin 2
    coeff2 = (c_2['depth']*c_2['power']/c_2['width']**2)*(r2/c_2['width'])**(c_2['power']-2)
    grad2_x = coeff2*(x-x2)
    grad2_y = coeff2*(y-y2)
    
    #Return gradient from active basin and escaped bool
    return np.array([np.where(use_basin1, grad1_x, grad2_x), np.where(use_basin1, grad1_y, grad2_y)]), not use_basin1

Noise Processes for $u_t$¶

As stated previously, one commonly used assumption to analyze SGD behavior is to assume gradient noise obeys a Gaussian distribution.

Instead, what is proposed in the literature is the Lévy Stable-$\alpha$ distribution which is a class of distributions with parameter $\alpha \in [0,2]$. This distribution can be characterized as:

$$ \mathbb{E}[e^{i \omega x}] = \exp(-\sigma^\alpha |\omega|^\alpha), $$

These distributions have infinite variance for $\alpha < 2$ and infinite mean for $\alpha < 1$. The breakpoints at $\alpha=1,2$ are two very common distributions: $\alpha=1$ corresponds to the Cauchy or Lorenz distribution, and $\alpha=2$ is the beloved Gaussian distribution! Thus as $\alpha$, sometimes called the 'tail index' decreases, the heaviness of the tails increases.

Chambers–Mallows–Stuck (CMS) Method for Simulating Stable α-Distributions¶

The Chambers–Mallows–Stuck (CMS) method is a widely used algorithm for simulating random variables from an $\alpha$-stable distribution.

The parametrizations of the class of distributions are:

  • $\alpha$ (tail index/thickness) $0<\alpha\le 2$
  • $\beta$ (skewness/asymmetry), $-1\le \beta \le 1$
  • $\gamma$ (scale) $\gamma>0$
  • $\delta$ (location, shift), $\delta \in \mathbb{R}$

The algorithm is detailed as such:

  1. First we draw two random numbers $U$ and $W$ from $$ U \sim \mathrm{Uniform}\!\left(-\frac{\pi}{2},\,\frac{\pi}{2}\right), \qquad W \sim \mathrm{Exp}(1). $$

  2. If $\alpha \ne 1$, set $$ \phi \;=\; \frac{1}{\alpha}\arctan\!\big(\beta \tan(\tfrac{\pi \alpha}{2})\big), $$ and compute $$ X \;=\; \delta \;+\; \gamma \; \frac{\sin\!\big(\alpha\,(U+\phi)\big)}{\big(\cos U\big)^{1/\alpha}}\; \left(\frac{\cos\!\big(U-\alpha\,(U+\phi)\big)}{W}\right)^{\tfrac{1-\alpha}{\alpha}}. $$

    Note if $\beta=0$ then $\phi=0$, thus the above reduces to

    $$ X \;=\; \delta \;+\; \gamma \; \frac{\sin(\alpha U)}{\big(\cos U\big)^{1/\alpha}} \left(\frac{\cos\!\big((1-\alpha)U\big)}{W}\right)^{\tfrac{1-\alpha}{\alpha}}. \tag{4} $$

  3. If $\alpha = 1$, use the special form $$ X \;=\; \delta \;+\; \frac{2}{\pi}\,\gamma\left[ \left(\frac{\pi}{2} + \beta U\right)\tan U \;-\; \beta \,\ln\!\left(\frac{(\tfrac{\pi}{2})\,W\,\cos U}{\tfrac{\pi}{2}+\beta U}\right) \right]. $$

    If $\beta=0$, note this reduces to

    $$ X = \tan U $$

Now let's implement this in Python in a nice handy function which can draw arbitrary # of samples (according to size parameter) from the levy stable distribution. Let's set the shift, scale, and skewness to convenient values and mainly concern ourselfves with the tail index $\alpha$.

We will also use this function to plot the difference between Lévy stable (with $\alpha < 2$) and Gaussian distributions, focusing on the tail heaviness.

In [178]:
def levy_stable_sample(alpha, beta=0, size=1):
    #Handles Gaussian case
    if alpha == 2:
        return np.random.randn(size)
    
    #Chambers-Mallows-Stuck method
    U = np.random.uniform(-np.pi/2, np.pi/2, size)
    W = np.random.exponential(1.0, size)
    
    #Handles Cauchy case
    if alpha == 1:
        return np.tan(U)
    
    const = beta * np.tan(np.pi * alpha / 2)
    B = np.arctan(const) / alpha
    S = (1 + const**2)**(1/(2*alpha))
    
    term1 = np.sin(alpha * (U + B))
    term2 = (np.cos(U))**(1/alpha)
    term3 = (np.cos(U - alpha * (U + B)) / W)**((1 - alpha)/alpha)
    
    samples = S*term1/term2*term3
    
    return samples[0] if size == 1 else samples
In [30]:
#Generate samples
np.random.seed(42)
gaussian = np.random.randn(10000)
levy = levy_stable_sample(alpha=1.5, size=10000)

#Clip extreme outliers for visualization
levy_clipped = np.clip(levy, -10, 10)

fig, axes = plt.subplots(1, 4, figsize=[18, 5])

#Plot 1: Gaussian
axes[0].hist(gaussian, bins=100, alpha=0.8, color='blue', density=True, edgecolor='black')
axes[0].set_xlim(-4, 4)
axes[0].set_title('Gaussian Distribution', fontsize=14)
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Density')
axes[0].grid(alpha=0.3)

#Plot 2: Lévy (clipped for visualization)
axes[1].hist(levy_clipped, bins=100, alpha=0.8, color='red', density=True, edgecolor='black')
axes[1].set_xlim(-4, 4)
axes[1].set_title('Lévy α=1.5 (Heavy Tails)', fontsize=14)
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Density')
axes[1].grid(alpha=0.3)


#Plot 3: Log scale to see tails
axes[2].hist(gaussian, bins=200, alpha=0.6, label='Gaussian', color='blue', 
             range=(-50, 50), log=True)
axes[2].hist(levy, bins=200, alpha=0.6, label='Lévy (α=1.5)', color='red', 
             range=(-50, 50), log=True)
axes[2].set_xlim(-50, 50)
axes[2].set_title('Heavy Tails (log scale)', fontsize=14)
axes[2].set_xlabel('Value')
axes[2].set_ylabel('Count (log)')
axes[2].legend()


#Plot 4: Comparison showing tail probabilities
tail_thresholds = np.arange(0, 10, 0.5)
gaussian_tail_prob = [np.mean(np.abs(gaussian) > t) for t in tail_thresholds]
levy_tail_prob = [np.mean(np.abs(levy) > t) for t in tail_thresholds]

axes[3].semilogy(tail_thresholds, gaussian_tail_prob, 'b-', linewidth=2, label='Gaussian')
axes[3].semilogy(tail_thresholds, levy_tail_prob, 'r-', linewidth=2, label='Lévy α=1.5')
axes[3].set_xlabel('Threshold', fontsize=12)
axes[3].set_ylabel('P(|X| > threshold)', fontsize=12)
axes[3].set_title('Tail Probability (Heavy Tails!)', fontsize=14)
axes[3].legend(fontsize=12)
axes[3].grid(alpha=0.3)

plt.tight_layout()
plt.show()
No description has been provided for this image

Lévy-driven Stochastic Differential Equation¶

Let's assume $u_t$ follows an $S\alpha S$ distribution with a time-dependent covariance matrix $\Sigma_t$, which models the structure of gradient noise. For simplicity, we choose anisotropic noise where $\Sigma_t$ is diagonal, so each coordinate can have its own scaling but remains independent of the others.

For a small learning rate $\eta$ and $ \varepsilon = \eta^{(\alpha - 1)/\alpha}$, the dynamics of SGD can be approximated by the Lévy-driven stochastic differential equation

$$ d\theta_t = -\nabla F(\theta_t)\,dt + \varepsilon \Sigma_t\, dL_t. \tag{5} $$

The first term $-\nabla F(\theta_t)\,dt$ represents the deterministic drift toward minimizing the loss, while the second term $\varepsilon \Sigma_t\, dL_t$ injects random fluctuations driven by heavy-tailed Lévy noise. All of this is driven by the Lévy motion $L_t \in \mathbb{R}^d$, which is a random vector whose entries $L_{t,i} \sim S\alpha S(1)$.

When $\alpha = 2$, the $S\alpha S$ distribution reduces to the Gaussian and the SDE simplifies to a Brownian-driven form which corresponds to the typical 'diffusion' description of SGD. In this case, the Lévy motion $L_t$ becomes a standard Brownian motion $B_t$, and the SDE simplifies to

$$ d\theta_t = -\nabla F(\theta_t)dt + \sqrt{\eta}\,\Sigma_t dB_t \tag{6} $$

$B_t$ is Brownian motion, whose increments are Gaussian and have finite variance. The noise term $\sqrt{\eta}\,\Sigma_t\, dB_t$ corresponds to small Gaussian fluctuations. For $\alpha < 2$, the Lévy-driven formulation captures rare but large gradient fluctuations that cannot be modeled by Gaussian noise.

Escape Dynamics¶

Before we launch into the experiment of running differently noised SGD, let us first examine the theory of the escape dynamics as per https://arxiv.org/pdf/2010.05627. We will attempt to paraphrase the key theorem and contextualize it to our use-case for this experiment:

Theorem from Zhou et al., 2021: escaping time bounds¶

Let $$ \varepsilon = \eta^{(\alpha-1)/\alpha}, \qquad \Theta(\varepsilon^{-1}) = \tfrac{2}{\alpha}\,\varepsilon^{-\alpha}, $$ and let $m(\mathcal{W})$ be the Radon measure of the escape window $\mathcal{W}$ which corresponds to the size of the exit set from a basin, $\Gamma$ be the escaping time,

For some small error parameter $\rho$ and any $u>-1$, $$ \frac{1-\rho}{1+u+\rho} \;\le\; \mathbb{E}\!\left[\exp\!\big(-u\,m(\mathcal{W})\,\Theta(\varepsilon^{-1})\,\Gamma\big)\right] \;\le\; \frac{1+\rho}{1+u-\rho}. $$

As a result, the escaping time $\Gamma$ can be related to the escape set Radon measure and the learning rate

$$ \Gamma \sim O\!\left(\frac{1}{m(\mathcal{W})\,\Theta(\varepsilon^{-1})}\right) \tag{7} $$

One main thing to note is the relationship between the escape set Radon measure $m(\mathcal{W})$ and the basin Radon measure $m(\Omega)$. If the basin $\Omega$ has a larger volume, it has a larger Radon measure $m(\Omega)$ as per the definition of Radon measure (Volume can actually be selected as a specific choice of Radon measure). This additionally means that the complementary set $\mathcal{W}^C$ of the exit set $\mathcal{W}$ (the part of the loss landscape that's the "in" set rather than the "exit" set) has a larger Radon measure $m(\mathcal{W}^C)$. Since $\mathcal{W} \cup \mathcal{W^C}$ is constant, a larger $m(\mathcal{W}^C)$ means a smaller $m(\mathcal{W})$.

As a result, the following relationship holds between three basins of increasing width (and thus volume) $\Omega_A$, $\Omega_B$, $\Omega_C$:

$$\begin{aligned} V(\Omega_A) < V(\Omega_B) < V(\Omega_C) \to m(\Omega_A) < m(\Omega_B) < m(\Omega_C)\to \\ m(\mathcal{W_A}^C) < m(\mathcal{W_B}^C) < m(\mathcal{W_C}^C)\to m(\mathcal{W_A}) > m(\mathcal{W_B}) > m(\mathcal{W_C})\to\\ \Gamma_A < \Gamma_B < \Gamma_C \end{aligned} \tag{8} $$

We can use this as a hypothesis for our experiments with the 3 differently shaped basins denoted "sharp" "medium" and "wide". The 'sharp' configuration should have a lower basin volume than medium, which in turn has a lower basin volume than the 'wide' configuration, so we should expect faster escape out of the sharp basins. This also makes it more clear why the heavy-tailed Lévy noise dramatically accelerates escape compared to Gaussian noise since for wider basins the escape is driven by large jumps, which are much more common in the Lévy process.

Escape Experiment: Lévy escape for different geometry basins and gradient noise tail index $\alpha$¶

Now let us perform SGD escape given the noise term, recalling noised SGD formulation (1): $$\theta_{\text{t+1}} = \theta_t - \eta \nabla F(\theta_t)+\eta u_t$$

We will examine escape properties for $u_t$ obeying Lévy-stable and Gaussian processes, the latter of which would correspond to Brownian drift of gradient measure as commonly assumed. Escape will be simply quantified as the point where $F_1$ changes to $F_2$, as in, the transition points at which the gradient is no longer in the 'suboptimal' basin. Since Gaussian noise simply corresponds to $\alpha=2$, we can just use the same lévy noise function for both of these analyses.

The escape times will be averaged over 1000 runs per geometry to ensure reliability of results, and the optimization will be ran for our 3 different basin geometries. Let's finally define the Python SGD function as per (1) and run our experiment:

In [240]:
def SGD(steps, configs, eta=0.001, noise_alpha=2):
    np.random.seed()
    #Defines starting point
    #Set starting point=center
    theta = configs['suboptimal']['center']
    thetas=[theta]
    escaped=False
    step_count = 0
    #Generates noise array (we do this at once so we have reproducibility with seed setting)
    levy_noise = levy_stable_sample(alpha=noise_alpha, size=steps)

    while not escaped and step_count < steps: 
        #Calcs exact gradient
        grad, escaped = grad_F(x=theta[0],y=theta[1],configs=configs)
        grad_term = eta*grad
        #Samples noise distribution
        noise_term = eta*levy_stable_sample(alpha=noise_alpha,size=1)
        #Increments
        theta = theta - grad_term + noise_term
        thetas.append(theta)
        step_count += 1
        #if step_count % 100000 == 0:
        #    print(step_count)
        #    print(noise_term)
    return(thetas,step_count)
In [211]:
#Averages escape over 1000 turns for more robust analysis
s_wide,s_medium,s_sharp=0,0,0
eta=0.05
alpha=1.5
for i in range(1000):
    s_wide += SGD(10000,configs_wide,eta=eta,noise_alpha=alpha)[1]
    s_medium += SGD(10000,configs_medium,eta=eta,noise_alpha=alpha)[1]
    s_sharp += SGD(10000,configs_sharp,eta=eta,noise_alpha=alpha)[1]
s_wide /= 1000
s_medium /= 1000
s_sharp /= 1000
print(f"Average Wide Escape Time: {s_wide}\nAverage Medium Escape Time: {s_medium}\nAverage Sharp Escape Time:{s_sharp}")
Average Wide Escape Time: 328.322
Average Medium Escape Time: 130.378
Average Sharp Escape Time:19.348
In [241]:
s_wide,s_medium,s_sharp=0,0,0
eta=0.2
alpha=2
for i in range(1000):
    s_wide += SGD(10000,configs_wide,eta=eta,noise_alpha=alpha)[1]
    s_medium += SGD(10000,configs_medium,eta=eta,noise_alpha=alpha)[1]
    s_sharp += SGD(10000,configs_sharp,eta=eta,noise_alpha=alpha)[1]
s_wide /= 1000
s_medium /= 1000
s_sharp /= 1000
print(f"Average Wide Escape Time: {s_wide}\nAverage Medium Escape Time: {s_medium}\nAverage Sharp Escape Time:{s_sharp}")
Average Wide Escape Time: 8581.439
Average Medium Escape Time: 143.298
Average Sharp Escape Time:4.331
In [288]:
x=[configs_sharp['suboptimal']['width'],configs_medium['suboptimal']['width'],configs_wide['suboptimal']['width']]
y_1=[19.348,130.378,328.322]
y_2=[4.331,143.298,8581.439]

fig = plt.figure(figsize=[14,5])

#Escape time for Lévy
ax = fig.add_subplot(121)
ax.set_title("Lévy Escape Time vs Basin Width (Radon Measure Proxy)")
ax.set_xlabel("Starting Basin Width")
ax.set_ylabel("Escape Time $\Gamma$ (SGD steps)")
ax.scatter(x, y_1, color='blue', label='Lévy Noise')
ax.plot(x, y_1, color='blue')
ax.set_xscale("log")
ax.set_yscale("log")
plt.legend()

ax2 = fig.add_subplot(122)
ax2.set_title("Gaussian Escape Time vs Basin Width (Radon Measure Proxy)")
ax2.set_xlabel("Starting Basin Width")
ax2.set_ylabel("Escape Time $\Gamma$ (SGD steps)")
ax2.set_xscale("log")
ax2.set_yscale("log")
ax2.scatter(x, y_2, color='red', label='Gaussian Noise')
ax2.plot(x, y_2, color='red')

plt.legend()
plt.show()
No description has been provided for this image

Analysis¶

First of all, most importantly our hypothesis (8) of $\Gamma_A < \Gamma_B < \Gamma_C$ was validated; the wider basins (and thus larger Radon measure basins) had a higher escape time $\Gamma$. Additionally the heavier-tailed Lévy noise facilitated much faster exit as compared to Gaussian noise.

In terms of scaling laws, we see that Lévy escape time is approximately polynomial as a function of basin width which agrees with literature (Zhou et al., 2021). We also see that the polynomial approximation of Gaussian escape is not fully accurate, probably due to curvature also varying across the three test geometries which relates to the Eyring-Kramers law for Brownian-driven SDEs. To explain fully, the Eyring–Kramers law gives the expected escaping time: $$ \mathbb{E}[\Gamma] \;\approx\; \frac{2\pi}{|\lambda^-|} \sqrt{\frac{\det \nabla^2 F(\theta^\ast)}{|\det \nabla^2 F(\theta^\dagger)|}} \;\exp\!\left(\tfrac{2}{\eta}\,\big[F(\theta^\dagger)-F(\theta^\ast)\big]\right). $$ The polynomial term $ \frac{2\pi}{|\lambda^-|} \sqrt{\frac{\det \nabla^2 F(\theta^\ast)}{|\det \nabla^2 F(\theta^\dagger)|}}$ depends on the basin curvature and width while the exponential term $ \exp\!\left(\tfrac{2}{\eta}\,[F(\theta^\dagger)-F(\theta^\ast)]\right) $ depends on the barrier height.

Thus as per the theory, Gaussian escaping time grows exponentially with the barrier height $F(\theta^\dagger)-F(\theta^\ast)$ and polynomially with the Eyring-Kramers prefactor (a measure of curvature/width), and Lévy escaping time scales only polynomially in $\varepsilon$ as per (7) (with $\varepsilon$ dependent on $\eta$ and $\alpha$) as well as escape set Radon measure $m(\mathcal{W}) \propto \frac{1}{V_{\text{basin}}}$.

Conclusion¶

So to wrap things up, this theory for the most part matches what we observe with the Lévy escape dynamics in the figures since we observe polynomial scaling. For Gaussian, as just shown we cannot directly compare the escape time to basin width since the curvature is also being varied across the three geometries. Thus the Eyring-Kramers prefactor is not being properly visualized and varied to show the true polynomial scaling. Also it is good to note that we fix barrier height here so we actually wouldn't observe any exponential scaling for this experiment. Some final remarks are that probably since Gaussian noise produces small fluctutations, crossing barriers is exponentially rare throughout the stochastic process. Lévy noise on the other hand has heavy tails which yield occasional large jumps, making escape much faster and only polynomially limited.

As stated in the introduction, this type of analysis is useful to understand the basin preferences of different optimizers. SGD has been observed empirically to generalize very well, which corresponds to a preference of wider basin. This makes sense given this analysis (that is also mentioned in Zhou et al., 2021)! We assume the SGD process has a quite heavy-tailed gradient noise distribution that can be expressed as Lévy-stable-$\alpha$, and that increasing basin width (and thus decreasing escape set Radon measure) becomes a limiting factor for the escape, causing wide loss basins to be pretty stable in SGD.

In [1]:
import numpy as np
import copy
import math
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from itertools import cycle
import matplotlib.pyplot as plt
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

Follow Up Study: Empirical Lévy Index $\alpha$ Estimation on Gradient Noise during Training¶

To demonstrate one of the first assumptions in our experiment that the 'real' noise is actually $\alpha$-stable and not Gaussian, we will track the gradient noise distribution of a high-dimensional deep neural network during training. Then using fancy math we will estimate the Lévy index $\alpha$ of this distribution and track $\alpha$ as the model fits.

Lévy index $\alpha$ estimation - How do we even get $\alpha$ in practice??¶

We can use Mohammadi, Mohammadpour (2014)'s nifty method of estimating Lévy index $\alpha$ using samples generated from the distribution we want to estimate. Since we are estimating the gradient distribution, all we need is samples of gradients which we can directly get from the network during training.

Once we have the gradients, we use Equation (9) from Mohammadi (2014) (which is also coincidentally equation 9 in this project). Let $X_i \sim S_{\alpha}(\sigma, \beta, \mu)$ which means $X_i$ is drawn from an arbitrary Stable-$\alpha$ distribution with $X_i$ for $i=1,2,\dots,m$ being an i.i.d sequence: $$X_1 + \dots X_m \overset{d}{=} m^{1/\alpha}X_1 + \mu(m-m^{1/\alpha}) \tag{9}$$

This implies that the sum of length-$m$ blocks of i.i.d samples from $S_\alpha$ scales like $m^{1/\alpha}$. If we let $\theta_m$ be the scaling parameter for the sum statistic $Y=\Sigma_i X_i$, and each $X_i$ has scale $\theta$, then (9) gives us $\theta_m = m^{1/\alpha}\theta$. Taking the logs of both sides, we get: $$\frac{1}{\alpha} = \frac{\log{\theta_m}- \log{\theta}}{\log{m}}$$

Now we can use the estimators for $\theta_m$ and $\theta$ to get an estimator for $\alpha$. The paper's suggestion is to use two order statistics to get a measure of the 'spread' of the distribution $X_{i:n}, X_{j:n}$ and $Y_{i:n}, Y_{j:n}$. These act as estimators for the scale parameters: $$ \frac{1}{\hat{\alpha}_{i,j}(m,n)} = \frac{ \log\!\left( \frac{Y_{j:n} - Y_{i:n}}{\mathbb{E}(Z_{j:n} - Z_{i:n})} \right) - \log\!\left( \frac{X_{j:n} - X_{i:n}}{\mathbb{E}(Z_{j:n} - Z_{i:n})} \right) }{ \log(m) } = \frac{ \log(Y_{j:n} - Y_{i:n}) - \log(X_{j:n} - X_{i:n}) }{ \log(m) }. $$

In practice a simpler empirical estimator of the spread can also simply be the norm of the order statistics raw distribution $X$ and sum distribution $Y$. Hence this becomes:

$$\frac{1}{\hat{\alpha}_{i,j}(m,n)} = \frac{ \log{(||Y||)} - \log{(||X||)}}{\log{(m)}} $$

where $Y$ is our distribution of $n$ binned sums with block length $m$ of $X$, which is the raw distribution of gradients.

In [2]:
def alpha_estimator(grads):
    '''
    Estimates alpha according to Mohammadi, Mohammadpour (2014) Section 3.2
    '''
    #First finds m=sum block size (closest to root N probably best). We pray here that N is not prime.
    N = len(grads)
    
    for i in range(1, int(N**0.5)+1):
        if N%i == 0:
            m=i
    n = int(N/m) #Gets num of blocks n (each with size m)

    #Shifts the µ of the alpha-stable distribution to 0 (since Section 3.2 assumes centered distribution)
    grads = torch.stack(grads)
    grads -= grads.mean()
    
    #Gets the real noise distr. and the binned sum noise distr. for comparison
    X = grads
    Y = torch.sum(X.view(n, m, -1), 1)
    
    #Calculates alpha using inverse-log-scaling of distr summing
    Y_log_norm = torch.log(Y.norm(dim=1) + 1e-15).mean()
    X_log_norm = torch.log(X.norm(dim=1) + 1e-15).mean()
    alpha = 1/((Y_log_norm - X_log_norm) / math.log(m))
    
    return(alpha)    

Network and Data Definition¶

For the network, we will use an InceptionNet-derived Convolutional Neural Network using Inception modules from Szegedy et al. (2015) and fit this on CIFAR10 to demonstrate the gradient noise distribution properties. All that matters is that the network is 'sufficiently' dense such that the approximation of true gradient is not so noisy that lévy-stable behavior gets obfuscated, and that we provide it with the necessary tools such as convolutions to not get stuck in such a suboptimal basin that we cannot fairly gauge basin-escape behavior.

InceptionNet¶

The InceptionNet architectures attempts to solve the problems of earlier CNN's feature map size dependence by stacking modules called "Inception Modules", which are comprised of 1x1, 3x3, 5x5, and a pooling layer. These inception modules have sequentially varying filter sizes and thus capture different orders of visual feature locality into the module's latent space through filter concatenation. This results in a 'parallel' structure of the inception module, where the 1x1, 3x3, and 5x5 convolutions are performed all on the previous module's output. A pooling layer is added to the parallel scheme as well, since pooling does perform important regularization and feature relation tasks.

It is also worth noting that in the optimized version of this network and the version we will implement, additional 1x1 convolutions are performed prior to the 3x3 and 5x5 layers, also as a way to reduce the number of total convolutions performed. 1x1 convolutions are additionally added after the pooling layer. An inception module thus looks like:

Previous layer--->(1x1)------------------>Filter Concatenation  
               ↓                      ↑
               |->(1x1)--(3x3)--------|
               |->(1x1)--(5x5)--------|
               |->(3x3m.pool)-(1x1)---|

Our implementation in this project will be slightly different from that in the paper; we simply wish to experiment with a model architecture that contains inception modules. We will thus build a somewhat simple inception-based network which has a traditional convolutional head, two stacked inception modules as a body, and an averagepooling-dropout-softmax classification tail.

In terms of how these layers are organized, it is ideal from a memory perspective to first start out the network traditionally as a regular CNN would do, perform some max pooling, and then implement the Inception modules further down stream. Max pooling is performed then after the inception modules, and then we switch to the tail end of the model (the classification tail) which involves global average pooling, which essentially does a more parameter-efficient flattening, and then a classic softmax output layer.

In [3]:
class InceptionModule(nn.Module):
    def __init__(self, in_channels, sublayer_1, sublayer_2_input, sublayer_2_output, 
                 sublayer_3_input, sublayer_3_output, sublayer_maxpool_output):
        super(InceptionModule, self).__init__()
        
        #1x1 sublayer
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, sublayer_1, kernel_size=1, padding=0),
            nn.ReLU(inplace=True)
        )       
        #3x3 sublayer with 1x1 bottleneck)
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, sublayer_2_input, kernel_size=1, padding=0),
            nn.ReLU(inplace=True),
            nn.Conv2d(sublayer_2_input, sublayer_2_output, kernel_size=3, padding=1),
            nn.ReLU(inplace=True)
        )      
        #5x5 sublayer (with 1x1 bottleneck)
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, sublayer_3_input, kernel_size=1, padding=0),
            nn.ReLU(inplace=True),
            nn.Conv2d(sublayer_3_input, sublayer_3_output, kernel_size=5, padding=2),
            nn.ReLU(inplace=True)
        )       
        #MaxPooling sublayer
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, sublayer_maxpool_output, kernel_size=1, padding=0),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        branch1_out = self.branch1(x)
        branch2_out = self.branch2(x)
        branch3_out = self.branch3(x)
        branch4_out = self.branch4(x)      
        #Concatenate along channel dimension
        output = torch.cat([branch1_out, branch2_out, branch3_out, branch4_out], dim=1)
        return output


class InceptionCNN(nn.Module):
    def __init__(self, num_classes=6):
        super(InceptionCNN, self).__init__()
        
        #Initial conv layers
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )      
        self.conv2 = nn.Sequential(
            nn.Conv2d(64, 64, kernel_size=1, padding=0),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 192, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        
        #Inception modules
        self.inception1 = InceptionModule(192, 64, 96, 128, 16, 32, 32)
        self.inception2 = InceptionModule(256, 128, 128, 192, 32, 96, 64)
        
        #Output layers
        self.global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(480, num_classes)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.inception1(x)
        x = self.inception2(x)
        x = self.global_avg_pool(x)
        x = torch.flatten(x, 1)
        x = self.dropout(x)
        x = self.fc(x)
        return x

Fully-Connected Model Dinosaur Exhibit¶

I originally tried to do this with a FC network trying to follow the footsteps of literature such as the Zhou et al., (2015) publication but my results didn't really tell me that much since the FC network was incredibly insufficient at doing CIFAR10 (image classification), and the network itself probably didn't have the necessary capacity to sensibly hop basins. In a way this made the network-type and lack of convolutions a confounding variable to the basin escape study. Introducing a proper CNN as done earlier with Inception modules reduces this headache and we can fairly assess basin escape, stochastic dynamics, and noise distributions at ease. Anyways, I've included the FC model definition here for the fun of it.

In [4]:
class FullyConnectedNetwork(nn.Module):
    def __init__(self, input_dim=[3,32,32], width=30, depth=7, output_dim=10):
        super().__init__()
        self.input_dim = input_dim[0]*input_dim[1]*input_dim[2]
        self.width = width
        self.depth = depth
        self.output_dim = output_dim
        
        fcblock = [nn.Linear(self.width, self.width, bias=False), nn.ReLU(inplace=True)]        
        self.fc = nn.Sequential(nn.Linear(self.input_dim, self.width, bias=False),
                               nn.ReLU(inplace=True),
                               *[fcblock[int(i%2!=0)] for i in range(self.depth-2)],
                               nn.Linear(self.width, self.output_dim, bias=False))
        
    def forward(self, x):
        x = x.view(x.size(0), self.input_dim)
        x = self.fc(x)
        return(x)

Data Loading¶

Now we define our data loaders etc for CIFAR10.

In [5]:
#Setup
seed = 0
torch.manual_seed(seed)

#Load CIFAR-10 data
batch_size, num_workers, num_classes = 125, 2, 10
normalize = transforms.Normalize((0.4914, 0.4822, 0.4465),
                                 (0.2470, 0.2435, 0.2616))

tf_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(), normalize])

tf_test = transforms.Compose([transforms.ToTensor(), normalize])

train = datasets.CIFAR10('./data', train=True, download=True, transform=tf_train)
test  = datasets.CIFAR10('./data', train=False, download=True, transform=tf_test)

train_loader = torch.utils.data.DataLoader(train, batch_size, shuffle=True, num_workers=num_workers)
train_loader_eval = torch.utils.data.DataLoader(datasets.CIFAR10('./data', train=True, transform=tf_test),
                                                batch_size, shuffle=False, num_workers=num_workers)
test_loader_eval = torch.utils.data.DataLoader(test, batch_size, shuffle=False, num_workers=num_workers)
Files already downloaded and verified
Files already downloaded and verified

Model Training¶

Finally we will define the parameters for evaluation, initialize the model and training loop, and run (infinite) training. We use an infinite dataloader so that we can keep training by iterations. We train for 7000 iterations. Our output is scrubbed for brevity so that readers don't have to look through a long trace: that's what plots are for!

In [ ]:
# Setup params
iterations = 7000
lr = 0.1
mom = 0.9
wd = 5e-4
print_freq = 100
eval_freq = 100


def eval_model(eval_loader, model, crit, opt):
    """Evaluate model and estimate alpha"""
    model.eval()
    
    total_size = 0
    total_loss = 0
    total_correct = 0
    
    #Get loss and accuracy 
    with torch.no_grad():
        for x, y in eval_loader:
            x, y = x.to(device), y.to(device)
            out = model(x)
            
            # Get predictions
            pred = out.argmax(dim=1)
            correct = (pred == y).sum().item()
            
            # Calculate loss
            loss = crit(out, y)
            
            bs = x.size(0)
            total_size += bs
            total_loss += loss.item() * bs
            total_correct += correct
    
    #Fetch gradients for alpha estimation
    grads = []
    for x, y in eval_loader:
        x, y = x.to(device), y.to(device)
        opt.zero_grad()
        out = model(x)
        loss = crit(out, y)
        loss.backward()
        
        grad = nn.utils.parameters_to_vector(
            [p.grad for p in model.parameters() if p.grad is not None]
        )
        grads.append(grad.cpu())
    
    #Estimate alpha from gradients
    alpha = alpha_estimator(grads)
    
    avg_loss = total_loss / total_size
    avg_acc = 100.0 * total_correct / total_size
    
    return avg_loss, avg_acc, alpha


#Init Inception Conv. neural network model
model = InceptionCNN(num_classes=10).to(device)

#Init optimizer to be SGD and loss to be x-entropy
opt = optim.SGD(model.parameters(), lr=lr, momentum=mom, weight_decay=wd)
crit = nn.CrossEntropyLoss()

#rceate infinite data loader so we can infinitely iterate
circ_train_loader = cycle(train_loader)

training_history = []
evaluation_history_train = []
evaluation_history_val = []
alpha_history_train = []
alpha_history_val = []

#Training starts
print("Starting training")

for i, (x, y) in enumerate(circ_train_loader):
    #Periodic evaluation
    if i % eval_freq == 0:
        print(f"\nEvaluating at iteration {i}")
        val_loss, val_acc, val_alpha = eval_model(test_loader_eval, model, crit, opt)
        train_loss, train_acc, train_alpha = eval_model(train_loader_eval, model, crit, opt)
        
        evaluation_history_val.append([i, val_loss, val_acc])
        evaluation_history_train.append([i, train_loss, train_acc])
        alpha_history_train.append(train_alpha)
        alpha_history_val.append(val_alpha)
        
        print(f"Val   - Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%, Alpha: {val_alpha:.4f}")
        print(f"Train - Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%, Alpha: {train_alpha:.4f}")
    
    #Training step
    model.train()
    x, y = x.to(device), y.to(device)
    
    opt.zero_grad()
    out = model(x)
    loss = crit(out, y)
    loss.backward()
    opt.step()
    
    with torch.no_grad():
        pred = out.argmax(dim=1)
        acc = 100.0 * (pred == y).sum().item() / y.size(0)
    
    training_history.append([i, loss.item(), acc])
    
    if i % print_freq == 0:
        print(f"Iter {i}: Loss={loss.item():.4f}, Acc={acc:.2f}%")
        
    if i >= iterations:
        break

#Final eval
print("\nFinal evaluation:")
val_loss, val_acc, val_alpha = eval_model(test_loader_eval, model, crit, opt)
train_loss, train_acc, train_alpha = eval_model(train_loader_eval, model, crit, opt)

evaluation_history_val.append([i + 1, val_loss, val_acc])
evaluation_history_train.append([i + 1, train_loss, train_acc])
alpha_history_train.append(train_alpha)
alpha_history_val.append(val_alpha)

print(f"Final Val  - Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%, Alpha: {val_alpha:.4f}")
print(f"Final Train - Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%, Alpha: {train_alpha:.4f}")

Now we make a helper function for plotting metrics:

In [19]:
def plot_metrics(train_history, val_history, train_alpha, val_alpha):
    """Plot training and validation metrics."""
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    train_data = list(zip(*train_history))
    val_data = list(zip(*val_history))
    
    plots = [
        (axes[0, 0], 'Loss', train_data[1], val_data[1]),
        (axes[0, 1], 'Accuracy (%)', train_data[2], val_data[2]),
        (axes[1, 0], r'Estimated $\alpha$ (Train Set)', None, train_alpha, 'blue'),
        (axes[1, 1], r'Estimated $\alpha$ (Val Set)', None, val_alpha, 'orange')
    ]
    
    for i, plot_info in enumerate(plots):
        ax = plot_info[0]
        title = plot_info[1]
        
        if i < 2:   #Loss plots
            ax.plot(train_data[0], plot_info[2], label='Train', alpha=0.7)
            ax.plot(val_data[0], plot_info[3], label='Val', marker='o', markersize=4)
            ax.legend()
        else:  #Alpha plots
            ax.plot(val_data[0], plot_info[3], marker='o', markersize=4, color=plot_info[4])
        
        ax.set_xlabel('Iteration')
        ax.set_ylabel(title.split('(')[0].strip())
        ax.set_title(f'Training and Validation {title}' if i < 2 else title)
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

Results: Gradient noise $\alpha$ and acc/loss during deep network training¶

In [20]:
# Plot results
plot_metrics(training_history, evaluation_history_val, alpha_history_train, alpha_history_val)
No description has been provided for this image

Conclusion:¶

We see that the estimated Lévy index of the gradient noise of a real network during training is actually much different from Gaussian with estimated $\alpha$ during training always falling under 2 since it mainly hangs around the 1.1-1.4 range. So our initial hypothesis way back in the beginning of this project that Gaussian noise ($\alpha=2$) does not describe true network gradient noise is supported by this evidence!

During training the $\alpha$ is actually quite dynamic due to the changing landscapes and dynamic model capabilities of estimating the true gradient. As the model learns, it gets better at estimating the true gradient, reducing the gradient noise spread and increasing the tail index $\alpha$ while settling in some basin. However if settling in a local basin, the tail-event-driven Lévy escape studied previously will eventually happen. Once an escape occurs, the model will become slightly worse at predicting causing a small dip in accuracy, and the gradient noise index will get lower. We see this during training particularly the regimes at around 2000-3000 iterations, where the model appears to procedurally get settled in a local basin with a higher $\alpha$ and then escape occurring yielding downwards spikes in $\alpha$ and small dips in accuracy. Towards the end of the training, we observe the accuracy stabilizing at a relatively high value with a corresponding increase the estimated tail index $\alpha$ for the last 500 training iterations. This can indicate the stabilization of the optimizer at a particularly optimal loss basin, which given our escape time $\Gamma$ dependence on escape set $m(\mathcal{W})$ implies a high Radon measure and thus wide basin. We further postulate this is a particularly wide basin since previous basins were escaped from quite quickly (this can correspond to the many $\alpha$ spikes from 3000-6500 iterations), but instead at the end the tail index keeps increasing and escape doesn't occur like before! The gradient tails become tighter and leaner since the model becomes more precise at estimating the true gradient at the basin.