using PyPlot
Simple linear model for multiclass regression. For $W \in \mathbb{R}^{c \times d}$,
$$h_w(x) = \arg \max_{i \in \{1, \ldots, c\}} e_i^T W x$$where $e_i$ is the $i$th standard basis vector.
# number of classes
c = 64;
# dimension of model
d = 256 * 1024;
# number of test examples
n = 1024;
# random weight matrix
w = randn(c, d);
# test examples
x = randn(d, n);
function predict(w::Array{Float64,2}, x::Array{Float64,2})
wx = w*x;
return [findmax(wx[:,i])[2] for i = 1:size(x,2)];
end
Suppose we want to make a prediction for all $n = 1024$ test examples. How long will this take us?
# first run of predictions to make sure everything is compiled
predict(w, x);
# now time the time it takes to predict
(predictions, elapsed) = @timed predict(w, x);
println("latency: took $elapsed seconds to return predictions");
println("throughput: on average, $(n/(elapsed)) predictions/second");
Now what if we make a prediction for just one isolated example?
(predictions, elapsed) = @timed predict(w, x[:,1:1]);
println("latency: took $elapsed seconds to return predictions");
println("throughput: on average, $(1/(elapsed)) predictions/second");
What happened?
The latency went down, but so did the throughput.
This exposes a tradeoff: if we can batch examples to be inferred, we can usually raise the throughput...but this comes at a cost of higher latency!
batch_size = [1,2,4,8,16,32,64,128,256,512,1024];
latencies = zeros(length(batch_size));
throughputs = zeros(length(batch_size));
for i = 1:length(batch_size)
xsi = copy(x[:,1:batch_size[i]]);
(predictions, elapsed) = @timed predict(w, xsi);
latencies[i] = elapsed;
throughputs[i] = batch_size[i] / elapsed;
end
loglog(batch_size, latencies, "-o");
title("Latency as Batch Size Changes");
xlabel("batch size");
ylabel("latency (seconds before first prediction)");
loglog(batch_size, throughputs, "-^r");
title("Throughput as Batch Size Changes");
xlabel("batch size");
ylabel("throughput (predictions/second)");