GraB Presentation

## Better Data Ordering for Stochastic Gradient Descent

### Christopher De Sa

#### Assistant Professor — Cornell Computer Science

March 15, 2023
						---
						## Stochastic Gradient Descent

Goal: minimize

`$$f(w) = \frac{1}{\abs{\mathcal{D}}} \sum_{x \in \mathcal{D}} f(w; x)$$`

by running

`$$w_{t+1} = w_t - \alpha_t \nabla f(w_t; x_t)$$`

where `$x_0, x_1, x_2, \ldots$` are examples from the dataset $\mathcal{D}$.
						---
						## ...Or Minibatch SGD

Goal: minimize

`$$f(w) = \frac{1}{\abs{\mathcal{D}}} \sum_{x \in \mathcal{D}} f(w; x)$$`

by running

`$$w_{t+1} = w_t - \frac{\alpha_t}{B} \sum_{b=0}^{B-1} \nabla f(w_t; x_{Bt+b})$$`

where `$x_0, x_1, x_2, \ldots$` are examples from the dataset.
						---
						## ...Or Data-Augmented SGD

Goal: minimize

`$$f(w) = \frac{1}{\abs{\mathcal{D}}} \sum_{x \in \mathcal{D}} \mathbf{E}_{A \sim \mathcal{A}}\left[ f(w; A(x)) \right]$$`

where `$\mathcal{A}$` is a distribution over augmentations;

`$$w_{t+1} = w_t - \frac{\alpha_t}{B} \sum_{b=0}^{B-1} \nabla f(w_t; x_{Bt+b})$$`

where `$x_0, x_1, x_2, \ldots$` are augmented examples.
						---
						<span class="question">How do we select the sample order `$x_0, x_1, x_2, \ldots$`?</span>

What does existing work do?
						---
						## Random sampling

#### A.k.a. with-replacement sampling

<span class="emph">Select each $x_i$ independently at random from the dataset.</span>
						---
						## Incremental gradient

<div class="emph">Just sample the data cyclically in the order in which they appear.</div>
						---
						## Shuffle once

<div class="emph">Choose a random permutation of the dataset, then sample in that fixed order every epoch.</div>
						---
						## Random reshuffling

<div class="emph">Select each $x_i$ without replacement from the dataset.</div>

That is, we don't re-use any example $x \in \mathcal{D}$ before _all_ examples have been used.
						---
						## Curriculum Learning

Training strategies where "<span class="emph">the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones</span>" _(Bengio et al, 2009)_.

Essentially: learn the order to present examples.
						---
						## Curriculum Learning

Essentially: learn the order to present examples.

### Not the subject of this talk.
						---
						### Theoretical gap beteen with-replacement SGD and Random Reshuffling

For non-convex optimization on a smooth objective, after `$T$` steps on a dataset of size `$n$` the gradient magnitude is

* $\tilde{\mathcal{O}}\left(\frac{n^{1/3}}{T^{2/3}}\right)$ for random reshuffling
						* $\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T}}\right)$ for with-replacement SGD

<p class="rsmall">"Random reshuffling: Simple analysis with vast improvements." (Mishchenko et al, NeurIPS 2020)</p>
						---
						## Similar gaps exist in other settings...

...although good lower-bounds only known in strongly convex and quadratic cases.

<div class="rsmall">"How Good is SGD with Random Shuffling?" (Safran and Shamir, COLT 2020)</div>

<div class="rsmall">"Permutation-Based SGD: Is Random Optimal?" (Rajput, Lee, and Papailiopoulos, ICLR 2022)</div>
						---
						<div class="question">What is it about the random reshuffled example order that makes it converge faster?</div>

Want to generalize to arbitrary sequences $x_1, x_2, \ldots$
						---
						### A General Analysis of Example-Selection for Stochastic Gradient Descent

<span style="font-size: 60%;">Yucheng Lu, Cathy Meng, Christopher De Sa (ICLR 2022)</span>

<span class="emph">Main insight: the convergence rate of SGD depends on how fast the averages of consecutive example gradients $\nabla f(w; x_t)$ converge to the full objective gradient $\nabla f(w)$.</span>
						---
						## Unpacking this Insight

<span class="question">How fast do with-replacement samples converge to the full gradient?</span>
						---
						## Unpacking this Insight

<span class="question">Q: How fast do with-replacement samples converge to the full gradient?</span>

<span class="emph">A: At the classic monte carlo rate $$\norm{ \frac{1}{m} \sum_{t=\tau}^{\tau+m-1} \nabla f(w; x_t) - \nabla f(w) } \le \mathcal{O}\left( \frac{1}{\sqrt{m}} \right).$$</span>
						---
						## Unpacking this Insight

<span class="question">How fast do random-reshuffled samples converge to the full gradient?</span>
						---
						## Unpacking this Insight

<span class="question">How fast do random-reshuffled samples converge to the full gradient?</span>

<span class="emph">A: For large enough $m$, at the faster rate $$\norm{ \frac{1}{m} \sum_{t=\tau}^{\tau+m-1} \nabla f(w; x_t) - \nabla f(w) } \le \tilde{\mathcal{O}}\left( \frac{\sqrt{n}}{m} \right).$$</span>
						---
						## Unpacking this Insight

<span class="question">How fast do random-reshuffled samples converge to the full gradient?</span>

Note for later: This $1/m$ rate is typical of _Quasi-Monte-Carlo_ methods!
						---
						## Unpacking this Insight: Recap

<div class="twocolumn" style="font-size: 70%;">
					    <div>
					      For with-replacement random sampling, averages of consecutive example gradients converge to the full gradient at a rate of
								$$\mathcal{O}\left( \frac{1}{\sqrt{m}} \right)$$
								and non-convex SGD converges at
								$$\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T}}\right).$$
					    </div>
					    <div>
					      For random reshuffling, averages of consecutive example gradients converge to the full batch gradient at a rate of
								$$\tilde{\mathcal{O}}\left( \frac{\sqrt{n}}{m} \right)$$
								and non-convex SGD converges at
								$$\tilde{\mathcal{O}}\left(\frac{n^{1/3}}{T^{2/3}}\right).$$
					    </div>
					  </div>

<div style="font-size: 70%;">$m$ is number of averaged samples, $T$ number of steps, $n$ dataset size</div>
						---
						## Proving Convergence Rates

**Assumption: L-Smoothness.** For any `$u,v \in \mathbb{R}^d$`, `$x \in \mathcal{X}$`,

$$\norm{ \nabla f(u; x) - \nabla f(v; x) } \le L \cdot \norm{ u - v }$$

**New Assumption: Average Gradient Error.** For some $\gamma \in [1,2]$, $C > 0$, and for all $m, \tau \in \mathbb{N}$,

$$\norm{ \frac{1}{m} \sum_{t=\tau}^{\tau+m-1} \nabla f(w_{\tau}; x_t) - \nabla f(w_{\tau}) }^2 \le \frac{C^2}{m^\gamma}.$$
						---
						## Theorem

Using SGD with a particular constant step size for $T$ steps yields gradient norm at most

`$$
						    \min_{t \in \{0,\ldots,T-1\}} \; \| \nabla f(w_t) \|^2 \
						    \le
						    \tilde{\mathcal{O}}\left( C^{\frac{2}{\gamma + 1}} L^{\frac{\gamma}{\gamma + 1}} T^{-\frac{\gamma}{\gamma + 1}} \right).
						$$`

<span class="emph">This unifies the analysis of both methods...and greatly expands the class of sample orders we can analyze!</span>
						---
						## Old, New, and Better Rates

* We recover the classic rate for permutation SGD.

* We give a new faster`$^*$` rate for shuffle-once.

* We recover the best rate for random reshuffling.

* We get new rates for a random-reshuffling-like methods, random-reshuffling with data echoing.

* We get improved rates for a non-permutation method, Markov chain gradient descent.
						---
						## Problem with Data Augmentation

Goal: minimize

`$$f(w) = \frac{1}{\abs{\mathcal{D}}} \sum_{x \in \mathcal{D}} \mathbf{E}_{A \sim \mathcal{A}}\left[ f(w; A(x)) \right]$$`

where `$\mathcal{A}$` is a distribution over augmentations.

### This setup is inherently stochastic!
						---
						## Quasi-Monte Carlo

Idea: use Quasi Monte Carlo sampling, rather than random sampling, to select our augmentation functions.

Quasi Monte Carlo samples using a low-discrepancy sequence, such that it converges at a `$\mathcal{O}\left(\frac{\log(m)^s}{m}\right)$` rate rather than the `$\mathcal{O}\left(\frac{1}{\sqrt{m}}\right)$` rate typical of random sampling.
						---
						## Quasi-Monte Carlo

Idea: use Quasi Monte Carlo sampling, rather than random sampling, to select our augmentation functions.

<span class="emph">We can leverage this to make SGD converge faster!</span>
						---
						## QMC Results: Empirical

Also improve from 69.76% to 70.48% Top-1 accuracy for ResNet-18 on ImageNet.
						---
						## QMC Results: Theoretical

Using QMC with data augmentation (together with random reshuffling) improves the convergence rate from `$\tilde{\mathcal{O}}(1/\sqrt{T})$` to `$\tilde{\mathcal{O}}(1/T^{2/3})$`.

<span style="font-size: 75%;">(Where here the big-O is just in terms of $T$.)</span>
						---
						<p class="question">This is all nice...but can we do better than random reshuffling?</p>

No reason to believe a random ordering is necessarily best!

<p class="rsmall">"Random Reshuffling is Not Always Better" (De Sa, NeurIPS 2020)</p>
						---
						<p class="question">This is all nice...but can we do better than random reshuffling?</p>

No reason to believe a random ordering is necessarily best!

<p class="rsmall">"Random Reshuffling is Not Always Better" (De Sa, NeurIPS 2020)</p>

### GraB: Finding Provably Better Data Permutations than Random Reshuffling

<p style="font-size: 60%;">Yucheng Lu, Wentao Guo, and Christopher De Sa (NeurIPS 2022)</p>
						---
						#### Idea: try to choose order to minimize running sum

$$\norm{ \sum_{t=\tau}^{\tau+m-1} \nabla f(w; x_t) - \nabla f(w) }$$
						---
						#### Idea: try to choose order to minimize running sum

$$\norm{ \sum_{t=\tau}^{\tau+m-1} \nabla f(w; x_t) - \nabla f(w) }_{\infty}$$
						---
						#### Idea: try to choose order to minimize running sum

$$\norm{ \sum_{t=\tau}^{\tau+m-1} \nabla f(w; x_t) - \nabla f(w) }_{\infty}$$

Related to classic <span class="emph">herding problem</span>: given vectors $x_1, \ldots, x_n$ with $\norm{x_i}_2 \le 1$ and $\sum_i x_i = n \bar x$, find a permutation $\sigma$ that minimizes

`$$\max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{t=1}^{m} (x_{\sigma(i)} - \bar x) }_{\infty}.$$`
						---
						#### Idea: try to choose order to minimize running sum

$$\norm{ \sum_{t=\tau}^{\tau+m-1} \nabla f(w; x_t) - \nabla f(w) }_{\infty}$$

Related to classic <span class="emph">herding problem</span>: given vectors $x_1, \ldots, x_n$ with $\norm{x_i}_2 \le 1$ and $\sum_i x_i = n \bar x$, you can find a permutation $\sigma$ that bounds the running sum by

`$$\max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{t=1}^{m} (x_{\sigma(i)} - \bar x) }_{\infty} = \tilde{\mathcal{O}}(1).$$`
						---
						### A Deep Theory Dive: The Herding Problem
						<span style="font-size:80%;">Given vectors $x_1, \ldots, x_n \in \mathbb{R}^d$ with $\norm{x_i}_2 \le 1$ and $\sum_i x_i = n \bar x$, find a permutation $\sigma$ that bounds the running sum by

`$$\textstyle \max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{t=1}^{m} (x_{\sigma(i)} - \bar x) }_{\infty} \le H(n,d).$$`</span>
						---
						### A Deep Theory Dive: The Herding Problem
						<span style="font-size:80%;">Given vectors $x_1, \ldots, x_n \in \mathbb{R}^d$ with $\norm{x_i}_2 \le 1$ and $\sum_i x_i = n \bar x$, find a permutation $\sigma$ that bounds the running sum by

`$$\textstyle \max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{t=1}^{m} (x_{\sigma(i)} - \bar x) }_{\infty} \le H(n,d).$$`</span>

### The Vector Balancing Problem
						<span style="font-size:80%;">Given vectors $x_1, \ldots, x_n \in \mathbb{R}^d$ with $\norm{x_i}_2 \le 1$, find a sequence of signs $\epsilon_1, \ldots, \epsilon_n \in \{-1,+1\}$ that bound the running sum by

`$$\textstyle \max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{i=1}^{m} \epsilon_i x_i }_{\infty} \le B(n,d).$$`</span>
						---
						### Online Vector Balancing

The "self-balancing walk" of Alweiss et al, 2020 (later improved by Dwivedi et al, 2021) gives an <span style="emph">online</span> solution to the vector balancing problem with (whp)

`$$\textstyle \max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{i=1}^{m} \epsilon_i x_i }_{\infty} \le \mathcal{O}(\log(nd)).$$`
						---
						### Online Vector Balancing

The "self-balancing walk" of Alweiss et al, 2020 (later improved by Dwivedi et al, 2021) gives an <span style="emph">online</span> solution to the vector balancing problem with (whp)

`$$\textstyle \max_{m \in \{1, \ldots, n\}} \; \norm{ \sum_{i=1}^{m} \epsilon_i x_i }_{\infty} \le \mathcal{O}(\log(nd)).$$`

#### Sketch of the algorithm: super fast!

* maintain an accumulator $s \in \mathbb{R}^d$ initially $0$
						* at each step given $x_i$, choose a sign $\epsilon_i$ based on $\langle s, x_i \rangle$
						* update $s \leftarrow s + \epsilon_i x_i$
						---
						### Herding via Pair Vector Balancing

* Suppose we have vectors $x_1, \ldots, x_{2n}$ with $\norm{x_i} \le 1$.
						* Let $y_i = x_{2i} - x_{2i-1}$ be a sequence of differences of pairs.
						* Run vector balancing on $y$, producing signs $\epsilon_1, \ldots, \epsilon_n$.
						* For $i \in \{1,n\}$, set $z_i = x_{2i}$ and $z_{2n+1-i} = x_{2i-1}$ if $\epsilon_i = 1$; and $z_i = x_{2i-1}$ and $z_{2n+1-i} = x_{2i}$ if $\epsilon_i = -1$.
						* Observe that $2z_i = x_{2i-1} + x_{2i} + \epsilon_i y_i$, and $\norm{y_i} \le 2$, so

`$$\norm{ \sum_{i=1}^m (z_i - \bar x) }_{\infty} \le \frac{1}{2} \norm{ \sum_{i=1}^{2m} (x_i - \bar x) }_{\infty} + \frac{1}{2} \norm{ \sum_{i=1}^{m} \epsilon_i y_i }_{\infty}.$$`
						---
						### Herding via Pair Vector Balancing

`$$\norm{ \sum_{i=1}^m (z_i - \bar x) }_{\infty} \le \frac{1}{2} \norm{ \sum_{i=1}^{2m} (x_i - \bar x) }_{\infty} + B(n,d).$$`
						---
						### Herding via Pair Vector Balancing

But $z$ is just a reordering of $x$, and we had

`$$H' \le \frac{1}{2} \cdot H + B(n,d).$$`

where $H'$ is the "herding metric" after reordering and $H$ is the previous herding metric.

<span class="emph">We can repeat this to get $H$ arbitrarily close to the vector balancing bound $B(n,d)$. This  yields $H(n,d) = B(n,d)$.</span>

<p class="rsmall">"Near-Optimal Herding" (Harvey and Samadi, COLT 2014); "Kernel Thinning" (Dwivedi and Mackey, 2021)</p>
						---
						### A Better Order via Gradient Balancing

Idea: maintain a permutation $\sigma$ and iteratively improve it using online vector balancing on the gradients from SGD.
						---
						### GraB: SGD with Gradient Balancing

* Maintain some permutation $\sigma$ of $\inbraces{1, \ldots, n}$.
						* On getting two new example gradients $g_{2i-1}$ and $g_{2i}$, pass their difference $g_{2i} - g_{2i-1}$ to online vector balancing algorithm.
						* If sign is $+1$, append $2i$ to beginning of next-epoch's order and prepend $2i-1$ to the end of next-epoch's order. Otherwise, do the opposite.

<p class="emph">Memory overhead $\mathcal{O}(d)$. Compute overhead $\mathcal{O}(nd)$.</p>

<p class="rsmall">Note: this is not the exact version of GraB presented in our NeurIPS paper, but a later improvement/simplification we developed while parallelizing GraB.</p>
						---
						### GraB: Convergence Rate

Assumption: bound on gradient error. For any $w, i$,

$$\norm{ \nabla f(w; x_i) - \frac{1}{n} \sum_{j=1}^n \nabla f(w; x_j) }_2 \le \sigma.$$

Assumption: cross-norm L-Smoothness. For any $v, w, i$,

`$$\norm{ \nabla f(v; x_i) - \nabla f(w; x_i) }_2 \le L_{2, \infty} \norm{ v - w }_{\infty}.$$`
						---
						### GraB: Convergence Rate

After $T = nK$ total iterations, GraB gets

`$$\frac{1}{K} \sum_{k=0}^{K-1} \| \nabla f(w_k) \|^2 \le \tilde{\mathcal{O}}\left( L^{2/3}_{2, \infty} \sigma^{2/3} T^{-2/3} \right).$$`

Compare this $\tilde{\mathcal{O}}(T^{-2/3})$ with the $\tilde{\mathcal{O}}(n^{1/3} T^{-2/3})$ of RR.

<p class="emph">GraB will tend to be better when $L^2_{2, \infty} \ll L^2 n$, e.g. when $d \ll n$ or when gradients are sparse.</p>
						---
						### GraB: Convex Empirical Results

<img src="imgs/grab_logreg.png" class="r-stretch"/>
						---
						### GraB: Deep Learning Empirical Results

<img src="imgs/grab_deep.png" class="r-stretch"/>
						---
						### Making GraB distributed

<p class="emph">Problem: the orders produced by GraB don't conform to any particular partition of the data.</p>

In distributed learning with decentralized data, we'd like to be able to get the accelerated rate of GraB and then get the full linear parallel speedup on top of it.
						---
						### Scale up with Order: Finding Good Data Permutations for Distributed Training

<span style="font-size: 40%;">Wentao Guo, Khiem Pham, Yucheng Lu, Tiancheng Yuan, Charlie F. Ruan, Christopher De Sa (in submission)</span>

Main idea: <span class="emph">we can still use online vector balancing to order data...by cleverly balancing differences of pairs of examples.</span>
						---
						### Distributed GraB

<img src="imgs/dgrab.png" class="r-stretch"/>
						---
						### Distributed GraB Theory

For non-convex optimization (now ignoring problem parameters), we get nearly perfect linear speedup of

`$$\frac{1}{K}\sum_{k=1}^{K} \norm{\nabla f(w_k)}_2^2 \le \tilde{O}\left(\frac{1}{(mnK)^{2/3}} + \frac{1}{K}\right)$$`

after $K$ epochs on $n$ machines with $m$ examples per machine. This is a <span class="emph">linear speedup in the dominant term</span>.
						---
						### Distributed GraB for Deep Learning

<img src="imgs/dgrab_deep.png" class="r-stretch"/>
						---
						## Takeaway: there's lots of potential improvement from using better example orders...

<p class="emph">...and principled theory can give us insight into the "right way" to do it.</p>
						---
						## Better-than-Random: Next Steps

* Zero-shot example ordering for fine-tuning large foundation models where there is only one epoch.

* Vector balancing for better quantization.

* Releasing efficient implementations of all this stuff!
						---
						# Thank You!

I also work on many other things, including: <span class="emph">decentralized learning</span>, learning on <span class="emph">manifolds</span>, <span class="emph">accountable</span> ML, <span class="emph">quantization/compression</span>, and <span class="emph">Markov chain Monte Carlo</span>.

### Questions?

`https://cs.cornell.edu/~cdesa`
						`cmd353@cornell.edu`