42? Really?: First Algorithms in Online Learning

Considering the whole framework already discussed in a previous post, I am going to present some basic algorithms in the online learning field, namely, Follow the Leader (or Follow the Best Expert), Halving Algorithm and Weighted Majority Algorithm.

Follow the Leader Algorithm

A first situation we may think is a forecaster that has to predict a sequence of numbers that are inside an interval. One may define the loss $\ell(p_t)$ of the prediction $p_t$ at time step $t$ as the distance between the prediction and the real outcome i.e., $\ell(p_t) = ||p_t - y||^2$ where $y$ is the real outcome of time step $t$. Szepesvari et al. define this game as the Shooting Game, where the predictions lay in the unit circle $\mathbb{R}^d$. I will use this game to present a bound for the first algorithm I am going to discuss, the Follow the Leader Algorithm (FTL) or Follow the Best Expert.

FTL is an extremely simple algorithm, where the prediction at time step $t$ is the one that minimizes the regret until that moment. If we define as the best forecaster in hindsight the one that chooses the best fixed point until that time ($u_t$); we can say that the FTL algorithm always predicts, at time step $t$, $u_{t-1}$. As an example, suppose the three first elements of the sequence that will be predicted are $\{0.6, 0.4, 0.8\}$. The FTL always predict $0$ at the first time step, then, the loss at the first time step is $||0 - 0.6||^2 = 0.36$. At the second time step, FTL predicts $\frac{0 + 0.6}{2} = 0.30$, and the loss at time step $2$ is $||0.3-0.4||^2 = 0.01$ and so on...

Despite being extremely simple, both Cesa-Bianchi, Lugosi; and Szepesvari et al. show that the FTL has an upper bound proportional to $\log n$. This is a very good performance, since $\lim_{n\rightarrow \infty} \frac{\log n}{n} = 0$. Thus, along the time, the regret becomes irrelevant.

Halving Algorithm

A different situation we can come up with is a situation in which you have a set of predictors (or forecasters) and you know, beforehand, that one of them is never wrong. Thus, the question now is how to discover it and the maximum number of mistakes one may get. Clearly, it must have something better than the FTL (Besides, this problem may be a discrete classification). We can use, for example, binary search -- if you have a binary problem, otherwise the reasoning is the same for more classes: You just count the number of forecasters that predicted class $1$ and then count the number of forecasters that selected the other class. You predict, for the current time step, as the majority voted. Two outcomes are possible: (1) you correctly predicted what you want to predict, and then you eliminate all the other predictors that gave you wrong information; or (2) you got the wrong answer for that time step, what is not so good, but at least you eliminated, at least, half of the predictors you had. This is the Halving algorithm.

Now, answering the second question: in the worst case, how many mistakes the Halving algorithm does make? Different from other proofs related to algorithms' bounds in this field, this proof is quite intuitive, the worst case possible is to always make the wrong prediction, until you discover the 'perfect forecaster', right? Otherwise, you would be eliminating forecasters without mistaking -- certainly this is not the worst case. Additionally, the worst case is not when all predictors are wrong, and just one is right, but when half of the forecasters are wrong and the other half, right. This way, using basic knowledge on binary search, half of the set is eliminated each iteration and, at the end, the worst case is to make $\log_{2}{N}$ mistakes, where $N$ is the number of experts.

Weighted Majority Algorithm

However, a more common situation (?) is to not have a perfect predictor. Then, how to approach this task? The idea is very close to the Halving algorithm, but in place of eliminating the forecasters that made a wrong decision, you just reduce your confidence on them and you predict based on a weight majority. This algorithm is called Weighted Majority Algorithm.

Its idea is to attribute a weight $w_{t, i}$ to each predictor $i$ at time step t. Initially, $1$. You then evaluate the summation of weights of the forecasters that predicted each possibility. You then choose the prediction made by the highest weights. For example, three different forecasters $f_1$, $f_2$ and $f_3$ have weights (sort of your confidence on them) equals to $0.6$, $0.3$ and $0.4$, respectively. $f_1$ predicted $A$ while $f_2$ and $f_3$ predicted $B$. Since $0.7 > 0.6$ (sum of weights), you predict $B$.

Regarding the update rule, for each predictor we define $w_{i,t}$ = $w_{i, t-1}$ . $\beta^{\mathbb{I}\{f_{i,t} \neq \ y_t}\}$. $\beta$ is a parameter, $y_t$ is the correct prediction revealed, and, most importantly: $\mathbb{I}$ is the indicator function, a function that returns the number of times the condition inside it was true. So, in this case, $\beta^\mathbb{I}\{f_{i,t} \neq \ y_t\}$ is $\beta^1$, thus $\beta$ in case the predictor was mistaken on that time step, and $\beta^0$, thus $1$ (no change!) if it was right. What about the number of mistakes?

Szepesvari, Gyorgy and Pal state that the Weighted Majority Algorithm makes at most
$$
\frac{\ln\left(\frac{1}{\beta}\right)L_n^* + \ln N}{\ln\left(\frac{2}{1+\beta}\right)}
$$
mistakes. I'm not going to present their proof here. Maybe later, in another post.

To conclude, notice that if $\beta = 0$, the Weighted Majority Algorithm becomes the Halving Algorithm. In the next post I intend to talk about Exponentially Weighted Average Forecasters, when we don't want to predict numerical values that are not necessarily discrete.

------------------
I'm sorry if this post was not so well written. I'm too busy with assignments and exams this week (and the next one).

This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andràs György and David Pál. The notation used is also the same. Of course, any mistakes are completely my fault.

42? Really?

Friday, October 4, 2013

First Algorithms in Online Learning

No comments:

Post a Comment