Processing math: 100%
In [1]:
using PyPlot
using LinearAlgebra
using Statistics
using Random
import Base.MathConstants.e
┌ Warning: PyPlot is using tkagg backend, which is known to cause crashes on MacOS (#410); use the MPLBACKEND environment variable to request a different backend.
└ @ PyPlot /Users/cdesa/.julia/packages/PyPlot/XHEG0/src/init.jl:192

Hyperparameter Optimization

Let's look at using stochastic gradient descent with various methods to optimize logistic regression. First, we'll generate a training set at random from the generative model associated with logistic regression. This generative model is, for label y{1,1}, features xRd and model wRd,

Pw(y|x)=11+exp(yxTw).

This means that if we make a bunch of independent observations, the total probability is

p(w)=Ni=111+exp(yixTiw)

and so maximizing this is equivalent to maximizing the log likelihood

logp(w)=Ni=1log(1+exp(yixTiw)).

The gradient of this is

logp(w)=Ni=1exp(yixTiw)(yixi)1+exp(yixTiw)

which reduces to

logp(w)=Ni=1yixi1+exp(yixTiw).

Anyway, we can see that this corresponds to logistic regression.

In [2]:
# generate the data
Random.seed!(424242)
d = 20;
N = 10000;
wtrue = randn(d);
wtrue = d^2 * wtrue / norm(wtrue);
X = randn(N, d);
X ./= sqrt.(sum(X.^2; dims=2));
Y = (1 ./ (1 .+ exp.(-X * wtrue)) .>= rand(N)) .* 2 .- 1;
sigma = 0.001;

Let's do logistic regression with regularization here. Our objective samples will be of the form

fi(w)=log(1+exp(yixTiw))+σ2w2

and the SGD updates will look like

wt+1=wt+αt(yixi1+exp(yixTiwt)σwt).

Let's look at the constants of strong convexity and Lipschitz continuity for this problem, to get a handle on the theory/optimal parameters. If we differentiate the objective twice, we get

2fi(w)=xixTi1(1+exp(yixTiwt))(1+exp(yixTiwt))+σI.

It's pretty easy to see that

0<1(1+exp(u))(1+exp(u))14,

and so since we initialized such that xi2=1, from the way we generated the examples, we can approximate

σI2fi(w)(σ+14)I.

So we can set μ=σ and L=σ+14. What about bounding the variance of the gradient samples? (Again here I'm using the nonstandard definition of variance for vectors: Var(X)=E[X2]E[X]2.) Well,

Var(fi(w))=Var(yixi1+exp(yixTiw)σw)=Var(yixi1+exp(yixTiw))E[yixi1+exp(yixTiw)2]E[xi2]1

where this last line happens because we sampled xi uniformly from the unit ball. So we can set M=1.

In [3]:
mu = sigma;
L = sigma + 0.25;
M = 1;

What is the optimal step size for SGD under these conditions? Well, from Lecture 2, we had αt=2μw0w24M+μ2w0w2t or αt=α01+γt where α0=2μw0w24M and γ=μ2w0w24M.

In [4]:
w0 = randn(d);
In [5]:
function sgd_logreg(w0, alpha0, gamma, X, Y, sigma, niters, wopt)
    w = w0
    (N, d) = size(X)
    dist_to_optimum = zeros(niters)
    for k = 1:niters
        alpha = alpha0 / (1 + gamma * (k-1));
        i = rand(1:N)
        xi = X[i,:];
        yi = Y[i];
        w = (1 - alpha * sigma) * w + alpha * xi * yi / (1 .+ exp.(yi * dot(xi, w)));
        dist_to_optimum[k] = norm(w - wopt);
    end
    return (w, dist_to_optimum);
end
Out[5]:
sgd_logreg (generic function with 1 method)
In [6]:
# find the true minimum
function newton_logreg(w0, X, Y, sigma, niters)
    N = size(X, 1);
    d = size(X, 2);
    w = w0;
    for k = 1:niters
        g = -X' * (Y ./ (1 .+ exp.(Y .* (X * w)))) + N * sigma * w;
        H = X' * ((1 ./ ((1 .+ exp.(Y .* (X * w))) .* (1 .+ exp.(-Y .* (X * w))))) .* X) + N * sigma * I;
        w = w - H \ g;
        println("gradient norm: $(norm(g))")
    end
    return w
end
Out[6]:
newton_logreg (generic function with 1 method)
In [7]:
wopt = newton_logreg(wtrue, X, Y, sigma, 10);
gradient norm: 4000.02719781973
gradient norm: 902.6268327267067
gradient norm: 232.93930636601334
gradient norm: 58.80808542002503
gradient norm: 5.867449489793506
gradient norm: 0.06735352427339845
gradient norm: 9.014551783823106e-6
gradient norm: 1.8960945153068899e-13
gradient norm: 4.982514695141926e-14
gradient norm: 5.1349465193916404e-14
In [8]:
alpha0 = 2 * mu * norm(w0 - wopt)^2 / (4 * M);
gamma = mu^2 * norm(w0 - wopt)^2 / (4 * M);
In [9]:
Random.seed!(123456);
(w, dto) = sgd_logreg(w0, alpha0, gamma, X, Y, sigma, 50000, wopt);
In [10]:
plot(dto)
xlabel("iteration");
ylabel("distance to optimum");

Now let's try some different values of alpha and gamma.

In [11]:
Random.seed!(123456);
(w2, dto2) = sgd_logreg(w0, 2*alpha0, 4*gamma, X, Y, sigma, 50000, wopt);
Random.seed!(123456);
(w2, dto3) = sgd_logreg(w0, 3*alpha0, 9*gamma, X, Y, sigma, 50000, wopt);
In [12]:
semilogy(dto; label = "optimal step size")
semilogy(dto2; label = "2x optimal")
semilogy(dto3; label = "3x optimal")
xlabel("iteration");
ylabel("distance to optimum");
legend();

What is the best assignment of the step size after 20000 iterations?

In [13]:
## do not re-run; takes too long
etas = exp.(collect(-1:0.05:3));
dists = [ mean([sgd_logreg(w0, eta*alpha0, eta^2*gamma, X, Y, sigma, 20000, wopt)[2][end] for i=1:100]) for eta in etas ];
In [14]:
loglog(etas, dists);
scatter(etas[21], dists[21]);
imin = argmin(dists);
scatter(etas[imin], dists[imin]; color="red");
xlabel("step size scaled by");
ylabel("distance to optimum after 20000 steps");

Takeaway: the theory gave us something good, but not the best.

What about other algorithms? Maybe if we ran SVRG, the theoretically optimal parameters would be correct.

In [15]:
function svrg(w0, alpha, X, Y, sigma, niters, nepochs, wopt)
    w = w0
    (N, d) = size(X)
    dist_to_optimum = zeros(niters * nepochs)
    for epi = 1:nepochs
        wtilde = w;
        gtilde = X' * (Y ./ (1 .+ exp.(Y .* (X * wtilde)))) / N - sigma * wtilde;
        for k = 1:niters
            i = rand(1:N)
            xi = X[i,:];
            yi = Y[i];
            w = w + alpha * (xi * yi / (1 .+ exp.(yi * dot(xi, w))) - sigma * w - xi * yi / (1 .+ exp.(yi * dot(xi, wtilde))) + sigma * wtilde + gtilde);
            dist_to_optimum[k + (epi-1)*niters] = norm(w - wopt);
        end
    end
    return (w, dist_to_optimum);
end
Out[15]:
svrg (generic function with 1 method)

Optimal step size from analysis in SVRG paper (assuming contraction factor of e1 at each outer epoch) is α=14L(e+1) and optimal epoch length was T=8Le(e+1)μ.

In [16]:
alpha = 1 / (4 * L * (e+1));
T = Int64(ceil(8 * L * e * (e+1) / mu));
K = 10;
In [17]:
time1 = @timed (w, dto_svrg) = svrg(w0, alpha, X, Y, sigma, T, K, wopt);
In [18]:
semilogy(dto_svrg);
xlabel("iterations");
ylabel("distance to optimum");
title("Convergence of SVRG with Standard Parameters");

Let's see what happens when we try a smaller step size.

In [19]:
time2 = @timed (w2, dto_svrg2) = svrg(w0, alpha / 5, X, Y, sigma, Int64(ceil(T)), K, wopt);
In [20]:
semilogy(dto_svrg; label = "optimal", color = "blue");
semilogy(dto_svrg2; label = "20% step size", color = "red");
xlabel("inner iteration")
ylabel("distance to optimum")
legend();
In [21]:
semilogy(collect(1:length(dto_svrg)) / length(dto_svrg) * time1[2], dto_svrg; label = "optimal", color = "blue");
semilogy(collect(1:length(dto_svrg2)) / length(dto_svrg2) * time2[2], dto_svrg2; label = "3x step size", color = "red");
xlabel("wall clock time (seconds)")
ylabel("distance to optimum")
legend();

Now, what if we also adjust the epoch length to be smaller?

In [22]:
Random.seed!(123456);
time3 = @timed (w3, dto_svrg3) = svrg(w0, alpha / 2, X, Y, sigma, Int64(ceil(T / 2)), K, wopt);
In [23]:
semilogy(dto_svrg; label = "optimal", color = "blue");
semilogy(dto_svrg2; label = "20% step size", color = "red");
semilogy(dto_svrg3; label = "50% epoch, 50% step size", color = "green");
xlabel("inner iteration")
ylabel("distance to optimum")
legend();
In [24]:
semilogy(collect(1:length(dto_svrg)) / length(dto_svrg) * time1[2], dto_svrg; label = "optimal", color = "blue");
semilogy(collect(1:length(dto_svrg2)) / length(dto_svrg2) * time2[2], dto_svrg2; label = "20% step size", color = "red");
semilogy(collect(1:length(dto_svrg3)) / length(dto_svrg3) * time3[2], dto_svrg3; label = "50% epoch, 50% step size", color = "green");
xlabel("wall clock time (seconds)")
ylabel("distance to optimum")
legend();

Take-away: we can often do better than the simple theoretical recipe!

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Can you do better?

In [27]:
alpha_class = alpha / 2;
T_class = Int64(ceil(T /2));
K_class = Int64(ceil(K));

Random.seed!(123456);
time_class = @timed (w_class, dto_svrg_class) = svrg(w0, alpha_class, X, Y, sigma, T_class, K_class, wopt);

semilogy(collect(1:length(dto_svrg)) / length(dto_svrg) * time1[2], dto_svrg; label = "theoretical baseline", color = "blue");
semilogy(collect(1:length(dto_svrg3)) / length(dto_svrg3) * time3[2], dto_svrg3; label = "starting point", color = "green");
semilogy(collect(1:length(dto_svrg_class)) / length(dto_svrg_class) * time_class[2], dto_svrg_class; label = "your suggestion", color = "purple");
xlabel("wall clock time (seconds)")
ylabel("distance to optimum")
legend();
In [ ]: