So far, we've looked at how to train models efficiently at scale, how to use capabilities of the hardware such as parallelism to speed up training, and even how to run inference efficiently.
However, all of this has been in the context of so-called "batch learning" (also known as the traditional machine learning setup).
In ML theory, this corresponds to the setting of PAC learning (probably approximately correct learning).
This is the classic ML pipeline setup.
There is typically some fixed dataset $\mathcal{D}$ of labeled training examples $(x,y)$, and we assume it is drawn from the same population that we will be using the model to make predictions on
This dataset is cleaned and preprocessed, then split into a training set, a validation set, and a test set.
An ML model is trained on the training set using some large-scale optimization method (or other method), using the validation set to evaluate and set hyperparameters.
The trained model is then evaluated once on the test set to see if it performs well.
The trained model is possibly compressed for efficient deployment and inference.
Finally, the trained model is deployed and used for whatever task it is intended. Importantly, once deployed, the model does not change!
Formally, in ML theory, PAC learning is defined in terms of whether a problem class/hypothesis class is PAC learnable i.e. whether it can be learned by a learning algorithm under the PAC learning setup. For classification, this boils down to the claim that:
That is, by increasing the size of the training set, we can achieve arbitrarily low error with arbitrarily high probability.
Do we really need to separate learning and inference like we do in the batch learning/PAC learning setting?
Not this kind of online learning...
Online learning is a different setting, where learning and inference are interleaved rather than happening in two separate phases.
Online learning on a population with distribution $\mathcal{D}$ works by looping the following steps:
The learner is given the label $y$ and can now update its model parameters $w_t$ using $(x,y)$ to produce a new vector of parameters $w_{t+1}$ to be used at the next timestep.
With the batch learning setup, we often train by minimizing the training loss and evaluate by looking at the validation error/loss and test error/loss. All these metrics are inherently a function of batches i.e. bulk datasets...How does this translate to the online learning setting where we do not have batches?
One way to state the goal of online learning is to minimize the regret.
For any fixed parameter vector $w$ and number of examples $T$, the regret relative to $h_w$ is defined to be the worst-case extra loss incurred by not consistently predicting using the parameters $w$, that is
$$R(w, T) = \sup_{(x_1,y_1), \cdots, (x_T, y_T)} \; \left( \sum_{t=1}^T \ell(\hat y_t, y_t) - \sum_{t=1}^T \ell(h_w(x_t), y_t) \right),$$where $\hat y_t$ is the prediction actually made by the learning algorithm on this dataset at iteration $t$. The regret relative to the entire hypothesis class is the worst-case regret relative to any assignment of the parameters, $$R(T) = \sup_w R(w, T) = \sup_{(x_1,y_1), \cdots, (x_T, y_T)} \; \left( \sum_{t=1}^T \ell(\hat y_t, y_t) - \inf_w \sum_{t=1}^T \ell(h_w(x_t), y_t) \right).$$ You may recognize this last infimum as looking a lot like empirical risk minimization! We can think about the regret as being the amount of extra loss we incur by virtue of learning in an online setting, compared to the training loss we would have incurred from solving the ERM problem exactly in the batch setting.
What are some applications where we might want to use an online learning setup rather than a traditional batch learning approach?
One popular algorithm for online learning that is very similar to what we've discussed so far is online gradient descent.
It's exactly what it sounds like.
At each step of the online learning loop, it runs $$w_{t+1} = w_t - \alpha \nabla_{w_t} \ell(h_{w_t}(x_t), y_t).$$
This should be very recognizable as the same type of update loop as we used in SGD!
In fact, we can use pretty much the same analysis that we used for SGD to bound the regret.
Assume that each of the functions $f_t(w) = \ell(h_w(x_t), y_t)$ is convex, no matter what example $(x_t, y_t)$ is chosen. We can equivalently write this as $$f_t(w) - f_t(v) \le \nabla f_t(w)^T (w - v).$$ (This is just the first-order definition of a convex function, and says that the function lies above the tangent line to the function at any point. It is equivalent to...) $$f_t(v) \ge f_t(w) + \nabla f_t(w)^T (v - w).$$
Suppose that we are at the $t$th iteration and the current parameter value is $w_t$. Then for any possible parameter value $\hat w$, the squared-distance to $\hat w$ at the next timestep will be \begin{align*} \norm{ w_{t+1} - \hat w }^2 &= \norm{ w_t - \alpha \nabla f_t(w_t) - \hat w }^2 \ &=
\norm{ w_t - \hat w }^2
-
2 \alpha (w_t - \hat w)^T \nabla f_t(w_t)
+
\alpha^2 \norm{ \nabla f_t(w_t) }^2 \\
&\le
\norm{ w_t - \hat w }^2
-
2 \alpha \left( f_t(w_t) - f_t(\hat w) \right)
+
\alpha^2 \norm{ \nabla f_t(w_t) }^2.
\end{align*} Here, in the last line we applied our inequality from the fact that $f_t$ is convex.
Next, if we sum this up across $T$ total steps, we get \begin{align*} \sum{t=1}^T \norm{ w{t+1} - \hat w }^2 &\le
\sum_{t=1}^T \norm{ w_t - \hat w }^2
-
2 \alpha \sum_{t=1}^T \left( f_t(w_t) - f_t(\hat w) \right)
+
\alpha^2 \sum_{t=1}^T \norm{ \nabla f_t(w_t) }^2.
\end{align*}
This is equivalent to \begin{align*} \norm{ w{t+1} - \hat w }^2 + \sum{t=2}^{T} \norm{ w_t - \hat w }^2 &\le \norm{ w_1 - \hat w }^2 +
\sum_{t=2}^T \norm{ w_t - \hat w }^2
-
2 \alpha \sum_{t=1}^T \left( f_t(w_t) - f_t(\hat w) \right)
+
\alpha^2 \sum_{t=1}^T \norm{ \nabla f_t(w_t) }^2.
\end{align*}
Canceling out all by the non-overlapping terms from the first two sums, and moving the terms about, gives us \begin{align*} 2 \alpha \sum_{t=1}^T \left( f_t(w_t) - f_t(\hat w) \right) &\le
\norm{ w_1 - \hat w }^2
-
\norm{ w_{T+1} - \hat w }^2
+
\alpha^2 \sum_{t=1}^T \norm{ \nabla f_t(w_t) }^2 \\
&\le
\norm{ w_1 - \hat w }^2
+
\alpha^2 \sum_{t=1}^T \norm{ \nabla f_t(w_t) }^2.
\end{align*} And dividing by $2 \alpha$, $$\sum_{t=1}^T \left( f_t(w_t) - f_t(\hat w) \right) \le \frac{1}{2 \alpha} \norm{ w_1 - \hat w }^2 + \frac{\alpha}{2} \sum_{t=1}^T \norm{ \nabla f_t(w_t) }^2.$$
The first expression on the left can be seen to be close to $\alpha$ times the regret, since \begin{align*} R(\hat w, T) &= \sup_{(x_1,y_1), \cdots, (x_T, y_T)} \; \left( \sum_{t=1}^T \ell(\hat y_t, y_t) - \sum_{t=1}^T \ell(h_{\hat w}(x_t), y_t) \right) \\&= \sup_{(x_1,y_1), \cdots, (x_T, y_T)} \; \left( \sum_{t=1}^T \ell(h_{w_t}(x_t), y_t) - \sum_{t=1}^T \ell(h_{\hat w}(x_t), y_t) \right) \\&= \sup_{(x_1,y_1), \cdots, (x_T, y_T)} \; \left( \sum_{t=1}^T f_t(w_t) - f_t(\hat w) \right). \end{align*} And since the above analysis was done for any samples $(x_t,y_t)$, it follows that $$ R(\hat w, T) \le \frac{\norm{ w_1 - \hat w }^2}{2 \alpha} + \frac{\alpha}{2} \sum_{t=1}^T \norm{ \nabla f_t(w_t) }^2.$$
If we add the additional assumptions that
(1) we are limiting the range of parameter vectors we are learning over to a region of diameter at most $D$
(2) the magnitude of the gradients is always at most $G$, i.e. $\norm{\nabla f_t(w)} \le G$,
then we can say something about the regret relative to the entire hypothesis class:
$$R(T) \le \sup_{\hat w} \; R(\hat w, T) \le \frac{D^2}{2 \alpha} + \frac{\alpha}{2} \cdot T \cdot G^2.$$And now if we pick step size to minimize this of
$$\alpha = \frac{D}{G \cdot \sqrt{T}} = \mathcal{O}\left(\frac{1}{\sqrt{T}}\right),$$then
$$R(T) \le D G \sqrt{T} = \mathcal{O}\left( \sqrt{T} \right).$$In the online setting, regret grows naturally with time in a way that is very different from loss in the batch setting. If we look at the definition of regret, at each timestep it's adding a new component $$ \ell(\hat y_t, y_t) - \ell(h_w(x_t), y_t) $$ which tends to be positive for an optimally-chosen $w$.
For this reason, we can't expect the regret to go to zero as time increases: the regret will be increasing with time, not decreasing.
Instead, for online learning we generally want to get what's called sublinear regret: a regret that grows sublinearly with time.
Equivalently, we can think about situations in which the average regret $$ \frac{R(T)}{T} = \frac{1}{T} \sum_{t=1}^T \ell(\hat y_t, y_t) - \inf_w \frac{1}{T} \sum_{t=1}^T \ell(h_w(x_t), y_t) $$ goes to zero. We can see that this happens in the case of online gradient descent, where $$ R(T) = O(\sqrt{T}) = o(T). $$
Most of the techniques we've discussed in class are readily applicable to the online learning setting.
If you're interested in more details about how to do this, there are a lot of papers in the literature.
Many online-setting variants of SGD are subject to ongoing active research, particularly the question of how we should build end-to-end systems and frameworks to support online learning.
A major application of online learning is to deal with real-time streams of data where we want to simultaneously
Online learning algorithms let us naturally update our models to "follow" small changes in the data distribution over time.
A classic example of this is spam detection.
Of course, for a spam detection system to be useful for spam filtering, it needs to make predictions about what emails are spam, and it needs to make those predictions in real time.
But it also needs to learn from new spam emails, so that it can quickly adapt to new patterns in spam.
What problems might we encounter when trying to do spam detection in an online learning setting?