"Random reshuffling: Simple analysis with vast improvements." (Mishchenko et al, NeurIPS 2020)
This is all nice...but can we do better than random reshuffling?
"Random Reshuffling is Not Always Better" (De Sa, NeurIPS 2020)
Yucheng Lu, Wentao Guo, and Christopher De Sa (NeurIPS 2022)
"Near-Optimal Herding" (Harvey and Samadi, COLT 2014); "Kernel Thinning" (Dwivedi and Mackey, 2021)
Memory overhead $\mathcal{O}(d)$. Compute overhead $\mathcal{O}(nd)$.
Note: this is not the exact version of GraB presented in our NeurIPS paper, but a later improvement/simplification we developed while parallelizing GraB.
GraB will tend to be better when $L^2_{2, \infty} \ll L^2 n$, e.g. when $d \ll n$ or when gradients are sparse.
Problem: the orders produced by GraB don't conform to any particular partition of the data.
...and principled theory can give us insight into the "right way" to do it.