using PyPlot
Let's look at using stochastic gradient descent with various methods to optimize logistic regression. First, we'll generate a training set at random from the generative model associated with logistic regression.
# generate the data
srand(424242)
d = 20;
N = 10000;
wtrue = randn(d);
wtrue = d^2 * wtrue / norm(wtrue);
X = randn(N, d);
X ./= sqrt(sum(X.^2, 2));
Y = (1 ./ (1 + exp(-X * wtrue)) .>= rand(N)) * 2 - 1;
gamma = 0.001;
Let's do logistic regression with regularization here. Our objective samples will be of the form
$$f_i(w) = -\log \left( 1 + \exp(-y_i x_i^T w) \right) + \frac{\gamma}{2} \| w \|^2$$and the SGD updates will look like
$$w_{t+1} = w_t + \alpha_t \left( \frac{y_i x_i}{1 + \exp(y_i x_i^T w_t)} - \gamma w_t \right).$$Let's look at the constants of strong convexity and Lipschitz continuity for this problem, to get a handle on the theory/optimal parameters. If we differentiate the objective twice, we get
$$\nabla^2 f_i(w) = x_i x_i^T \frac{1}{(1 + \exp(y_i x_i^T w_t)) (1 + \exp(-y_i x_i^T w_t))} + \gamma I.$$It's pretty easy to see that
$$0 < \frac{1}{(1 + \exp(u)) (1 + \exp(-u))} \le \frac{1}{4},$$and so since we initialized such that $\| x_i \|^2 = 1$, from the way we generated the examples, we can approximate
$$\gamma I \preceq \nabla^2 f_i(w) \preceq \left(\gamma + \frac{1}{4} \right) I.$$So we can set $\mu = \gamma$ and $L = \gamma + \frac{1}{4}$. What about bounding the variance of the gradient samples? (Again here I'm using the nonstandard definition of variance for vectors: $\mathbf{Var}(X) = \mathbf{E}[\| X \|^2] - \| \mathbf{E}[ X ] \|^2$.) Well,
\begin{align*} \mathbf{Var}(\nabla f_i(w)) &= \mathbf{Var}\left( \frac{y_i x_i}{1 + \exp(y_i x_i^T w)} - \gamma w \right) \\ &= \mathbf{Var}\left( \frac{y_i x_i}{1 + \exp(y_i x_i^T w)} \right) \\ &\le \mathbf{E}\left[ \left\| \frac{y_i x_i}{1 + \exp(y_i x_i^T w)} \right\|^2 \right] \\ &\le \mathbf{E}\left[ \left\| x_i \right\|^2 \right] \\ &\le 1 \end{align*}where this last line happens because we sampled $x_i$ uniformly from the unit ball. So we can set $\sigma = 1$.
mu = gamma;
L = gamma + 0.25;
sigma = 1;
What is the optimal step size for SGD under these conditions? Well, from Lecture 2, we had $$\alpha_t = \frac{2 \mu \| w_0 - w^* \|^2}{4 M + \mu^2 \| w_0 - w^* \|^2 t}$$ or $$\alpha_t = \frac{\alpha_0}{1 + \gamma t}$$ where $$\alpha_0 = \frac{2 \mu \| w_0 - w^* \|^2}{4 M}$$ and $$\gamma = \frac{\mu^2 \| w_0 - w^* \|^2}{4 M}.$$
w0 = randn(d);
function sgd_logreg(w0, alpha0, gamma, X, Y, gamma, niters, wopt)
w = w0
(N, d) = size(X)
dist_to_optimum = zeros(niters)
for k = 1:niters
alpha = alpha0 / (1 + gamma * (k-1));
i = rand(1:N)
xi = X[i,:];
yi = Y[i];
w = (1 - alpha * gamma) * w + alpha * xi * yi / (1 + exp(yi * dot(xi, w)));
dist_to_optimum[k] = norm(w - wopt);
end
return (w, dist_to_optimum);
end
# find the true minimum
function newton_logreg(w0, X, Y, gamma, niters)
N = size(X, 1);
d = size(X, 2);
w = w0;
for k = 1:niters
g = -X' * (Y ./ (1 + exp(Y .* (X * w)))) + N * gamma * w;
H = X' * ((1 ./ ((1 + exp(Y .* (X * w))) .* (1 + exp(-Y .* (X * w))))) .* X) + N * gamma * eye(d);
w = w - H \ g;
println("gradient norm: $(norm(g))")
end
return w
end
wopt = newton_logreg(wtrue, X, Y, gamma, 10);
alpha0 = 2 * mu * norm(w0 - wopt)^2 / (4 * M);
gamma = mu^2 * norm(w0 - wopt)^2 / (4 * M);
srand(123456);
(w, dto) = sgd_logreg(w0, alpha0, gamma, X, Y, gamma, 50000, wopt);
plot(dto)
xlabel("iteration");
ylabel("distance to optimum");
srand(123456);
(w2, dto2) = sgd_logreg(w0, 2*alpha0, 4*gamma, X, Y, gamma, 50000, wopt);
srand(123456);
(w2, dto3) = sgd_logreg(w0, 3*alpha0, 9*gamma, X, Y, gamma, 50000, wopt);
semilogy(dto; label = "optimal step size")
semilogy(dto2; label = "2x optimal")
semilogy(dto3; label = "3x optimal")
xlabel("iteration");
ylabel("distance to optimum");
legend();
What is the best assignment of the step size after 20000 iterations?
etas = exp(collect(-1:0.05:3));
dists = [ sgd_logreg(w0, eta*alpha0, eta^2*gamma, X, Y, gamma, 20000, wopt)[2][end] for eta in etas ];
loglog(etas, dists);
scatter(etas[21], dists[21]);
imin = indmin(dists);
scatter(etas[imin], dists[imin]; color="red");
xlabel("step size scaled by");
ylabel("distance to optimum after 20000 steps");
Takeaway: the theory gave us something good, but not the best.