Sunday, April 27, 2014

Why to do Research on Games?

After a long time, here I go again... I gave up on promising to write posts on a regular basis because I will always have some other things to be concerned with. I hope to not give up posting... To "re-start slow", I will not discuss something specific to Machine Learning with equations and so on. Today I will try to answer a simple question: Why to do Research in Games?

------

My Master's thesis was about player modelling in the game Civilization IV (I will write a post about it someday). Basically one wants to learn human players' preferences in the game from their behaviour. The benefits of player modelling in the game are obvious, ranging from predicting when a player will quit to customizing the game for maximizing players entertainment. However, in the early days of my research, I was not sure whether this research was relevant in a greater sense. Is research in games "just" to make games better? When I started to look for an answer I quickly realized it is beyond that. Since it was not obvious for me that time, and I know some researchers who think this is "silly", I will try to answer why it is relevant to do research in games.

First, as I state in my homepage,  mutiagent systems have several applications such as; auctions, negotiation, security and surveillance. Such approaches deal with uncertainty in the true state of the world, uncertainty about future events and uncertainty in the decisions made by other agents. It is particularly difficult to evaluate proposed ideas, as full-scale deployment and testing is costly in such environments. Games, however, offer a self-contained domain with an easier path for testing and evaluation, where all the rules of the environment are know, success and failure are well defined, etc.

Fairclough et al. [2001] for example argue that “computer games offer an accessible platform upon which serious cognitive research can be engaged” [1], while Laird and van Lent [2000] suggest that computer games are the perfect platform to pursue research into human level AI [2].

There are also several applications of techniques developed in the games domain and that were later applied to different problems. Probably the most famous success of AI in games was DeepBlue, the first computer to beat the Chess World-Champion (Garry Kasparov). Later, "in 1999, IBM developed the Deep Computing Institute built on its experience with Deep Blue. The institute aimed to explore and solve large, complex technological problems through deep computing—using large-scale advanced computational methods applied to large data sets. Deep computing provides business decision makers with the ability to analyze and develop solutions to these very difficult problems."  This has been applied to Data Mining, Financial Modeling, Molecular Dynamics, etc [3]. Another example in the realm of board games are techniques developed to solve Checkers [4] that were later used by a bioinformatics company for biological computations.

We can state the same for techniques developed to generate AI for Poker. Two examples are (1) Conterfactual Regret Minimization [5], a technique to obtain a Nash Equilibrium that was later used in the design of diabetes treatment policies [6] and; (2) DIVAT [7] and Imaginary Observations [8], evaluation metrics used to create Poker Bots but that were recently used to study online gambling [9]. These techniques were used to model psychologic profiles of gamblers, showing that those who play online have different perceptions about their skill (are optimistic).

In summary, the take-home message is that doing research in games is relevant, cool and funny! One uses games to evaluate ideas that may be used in other domains. It is not just about "making games better". There are certainly many other cases where techniques developed to games were later applied to "real-world problems". I hope this small selection was enough to convince you about the relevance of such research. In fact, if you know any other case as those listed here, please let me know. Post in the comments and I will be very thankful!



References

[1] Fairclough, C., Fagan, M., Mac Namee, B., and Cunningham, P. (2001). Research Directions for AI in Computer Games. In Proceedings of the 12th Irish Conference on Artificial Intelligence and Cognitive Science, AICS, pages 333–344.

[2] Laird, J. E. and Lent, M. V. (2001). Human-Level AI’s Killer Application. AI Magazine, 22(2):15–26.
 
[3] IBM. Deep Blue - Transforming the World. Available at: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/transform/ Last access: April, 27 2014.

[4] Solving Checkers, showing that perfect play by both sides leads to a draw  Jonathan Schaeffer, Neil Burch, Yngvi Björnsson, Akihiro Kishimoto, Martin Müller, Robert Lake, Paul Lu, and Steven Sutphen. (2007). Checkers Is Solved. Science, 317(1518).

[5] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. (2008). Regret Minimization in Games with Incomplete Information. In NIPS.

[6] Katherine Chen, and Michael Bowling. (2012). Tractable Objectives for Robust Policy Optimization, In NIPS.

[7] Martin Zinkevich, Michael H. Bowling, Nolan Bard, Morgan Kan, Darse Billings. (2006). Optimal Unbiased Estimators for Evaluating Agent Performance. In AAAI, pages 573-579.

[8] Michael H. Bowling, Michael Johanson, Neil Burch, Duane Szafron. (2008). Strategy evaluation in extensive games with importance sampling. In ICML, pages 72-79.

[9] T. L. MacKay, Nolan Bard, Michael Bowling, D. C. Hodgins. (2014). Do pokers players know how good they are? Accuracy of poker skill estimation in online and offline players. Computers in Human Behavior, 31, pages 419-424.

Wednesday, November 13, 2013

Brazilian National League

Hi! 

I'm aware that I'm not posting as frequently as I should, but the end of the term is a little bit tricky! Nonetheless, despite not being a soccer blog, I want to share that the soccer team I support in Brazil just won the National League \o/ We now hold three Brazilian National League titles, 1966, 2003 and 2013! 


I swear that my next post will be about research!

Friday, November 8, 2013

Remarkable Brazilians

Trying to not just post stuff related to ML, today I am going to discuss a different topic: remarkable Brazilians. Why? Well, because Brazil is a country well known by its beaches, its soccer, its volleyball, and its women, but just this. In fact, most people just heard about the two largest cities in the country, Rio de Janeiro and São Paulo (none of them is Brazil's capital!). This post goes on the opposite direction. I am not going to list athletes, or brazilian top models, but remarkable Brazilians in other sense, most of them related to science or arts.

Before someone starts to complain, this is not a ranking. I listed them according to the date of birth. As you will notice, I decided to not list currently alive people, it is a safe approach to ensure that nothing that I wrote will change. Most of the information presented here is from Wikipedia.

-- 
Dom Pedro II (7 April 1831 – 15 November 1889): This man was the last Brazilian Emperor (the monarchy ended while he was alive). Despite being bounded to the monarchy, he was a man far ahead of his time. Leandro Narloch, in his book Guia Politicamente Incorreto da Historia do Brasil, says that Dom Pedro II once stated that his dream was to be the first president of Brazil, abolishing the monarchy. He didn't abolished it fearing a military coup (what happened and expelled him from Brazil), besides also trying to preserve the Brazilian territory (all the other countries that became a democracy too soon in South America are much smaller). In fact, his monarchy regimen was much "lighter" than others, he was against censoring in the press, accepting critics that would be considered betrayal in other countries. Another of his modern political views is related to religion. Narloch says that he was not worried with different religions in Brazil, believing that everybody should be free to have their own religion. While he was in the throne, during an absence, his daughter Princess Isabel abolished slavery in Brazil. Narloch also says that, when expelled, Dom Pedro II filled a pillow with Brazilian ground/land (?), so he could always be close to his country.

To finish this brief description of the last Brazilian emperor, I quote a paragraph in Wikipedia (I removed the references): Pedro II's erudition amazed Friedrich Nietzsche when both met. Victor Hugo told the Emperor: "Sire, you are a great citizen, you are the grandson of Marcus Aurelius", and Alexandre Herculano called him: "A Prince whom the general opinion holds as the foremost of his era because of his gifted mind, and due to the constant application of that gift to the sciences and culture."He became a member of the Royal Society, the Russian Academy of Sciences, The Royal Academies for Science and the Arts of Belgium and the American Geographical Society. In 1875, he was elected to the French Academy of Sciences, an honor previously granted to only two other heads of state: Peter the Great and Napoleon Bonaparte. Pedro II exchanged letters with scientists, philosophers, musicians and other intellectuals. Many of his correspondents became his friends, including Richard Wagner, Louis Pasteur, Louis Agassiz, John Greenleaf Whittier, Michel Eugène Chevreul, Alexander Graham Bell, Henry Wadsworth Longfellow, Arthur de Gobineau, Frédéric Mistral, Alessandro Manzoni, Alexandre Herculano, Camilo Castelo Branco and James Cooley Fletcher.

--
Machado de Assis (21 June 1839 - 29 September 1908): Just to start, he is widely regarded as the greatest Brazilian writer. He wrote in all genres (poetry, short stories and novel), but he is best known for some of his short stories and, mainly, by his novels. Probably, the three most famous novels he wrote are: (1) Memórias Póstumas de Brás Cubas (The Posthumous Memoirs of Bras Cubas, also known in English as Epitaph for a Small Winner), (2) Quincas Borba (also known in English as Philosopher or Dog?) and (3) Dom Casmurro (Sir Dour). His work have been studied by critics all around the world and definitely is an author that deserves to be read. He is also the founder of the  Brazilian Academy of Letters. A last information, extracted from Wikipedia: Machado de Assis was included on American literary critic Harold Bloom's list of the greatest 100 geniuses of literature, alongside writers such as Dante, Shakespeare and Cervantes.

--
Oswaldo Cruz (5 August 1872 - 11 February 1917) To ease my job on this long post, I quote this link: "Dr Cruz is widely praised for his pioneering work on identifying and controlling yellow fever, bubonic plague, malaria and smallpox. Of particular importance is his use of vaccines and instituting the modern medical practices of notification, quarantine of cases, eradication of pests carrying the disease and improvement to public hygeine. Although these were not always popular public policies they were central in the eradication of epidemics of these illnesses. Dr Cruz consequently became a national hero." Additionally, in 1916 he also assisted the creation of the Brazilian Academy of Sciences.

--
Santos Dumont (20 July 1873 - 23 July 1932) was an aviation pioneer. There is much discussion (at least in Brazil) about who first created an airplane, if Santos Dumont did it, or if the Wright brothers did it. I don't think it really matters for this post. The discussion already shows how important Santos-Dumont was. Most of his career was focused on dirigibles, but his most famous achievement was made with 14-bis, the first heavier-than-air flight to be certified by the Aéro Club de France and the Fédération Aéronautique Internationale (FAI).

--
Carlos Chagas (9 July 1879 - 8 November 1934): He was another sanitary physician, scientistic, as well as Oswaldo Cruz. He discovered Chagas disease while working at the Oswaldo Cruz Institute (yes! the same one that I cited above). This is not my field, so I am just going to quote Wikipedia again: "Chagas’ work is unique in the history of medicine, because he was the only researcher so far to describe completely a new infectious disease: its pathogen, vector (Triatominae), host, clinical manifestations and epidemiology. Chagas was also the first to discover and illustrate the parasitic fungal genus Pneumocystis, later infamously to be linked to PCP (Pneumocystis pneumonia in AIDS victims)."

--
Gilberto Freyre (15 March 1900 - 18 July 1987) He is considered one of the most important sociologists of the 20th century and he was one of the unique brazilians to receive the title of Sir from the British Crown. Additionally, it is interesting to observe that people in television still mention him, something not so common for sociologists...

--
Cândido Portinari (29 December 1903 - 6 February 1962) Portinari is one of the most famous painters in Brazil. Interestingly, he had a great impact outside Brazil, as his painting Guerra e Paz (War and Peace) in the United Nations building in New York and four murals in the Hispanic Reading Room of the Library of Congress in Washington, DC. There is an interesting story about War and Peace, but I will not tell it today. This post is way too long.

--
Oscar Niemeyer (15 December 1907 - 5 December 2012) By far, the greatest architect Brazil has ever had. He designed Brazil's capital, Brasília, the Pampulha Architectural Group, in the city I lived before moving to Canada, Belo Horizonte, and so on... He had important designs outside Brazil too, as the United Nations Headquarters, in New York (he was one of the architects in the project). According to Wikipedia, this last building engendered invitations to teach at Yale University and the Harvard Graduate School of Design (!). If you did not clicked on the links in each person's name, please do for Niemeyer. There are lots of pictures of its main buildings...

--
Guimarães Rosa (27 June 1908 - 19 November 1967) Machado de Assis, as I already told, is considered the greatest Brazilian writer. However, if it is not unanimous, is because of Guimarães Rosa.  He was a fantastic writer. In my opinion, Machado de Assis was more consistent, delivering much more novels. Nonetheless, the best Brazilian book I have ever read is The Devil to Pay in the Backlands (Grande Sertão: Veredas). I would say that it is the Brazilian equivalent of Ulysses, from James Joyce. Almost any Brazilian has already heard about this writer, despite almost no Brazilian already read him. What is justifiable, as most of the britannians never read James Joyce.

--
Chico Xavier (2 April 1910 - 30 June 2002) This is the most controversial name in this list. At least I believe so, for a simple reason: Chico Xavier is related to religion. However, I'll leave this topic aside. He was a "medium", but most important, a philanthropist. In 1981 and 1982, he was nominated for the Nobel Peace Prize. Of course, as a Brazilian, he did not win (more about this topic below). Currently his biography became a Brazilian movie, as well as some of his books. However, I'd like to highlight his philanthropism, this is why he is in this list. By the way, he was not nominated for the Nobel Peace Prize "for his extraordinary efforts to strengthen international diplomacy and cooperation between peoples". Sorry for that...

--
César Lattes (11 July 1924 - 8 March 2005) This last person was one of the most important  physicist Brazil ever had. To be short, coping Wikipedia text: "Although he was the main researcher and the first author of the historical Nature article describing the pion, Cecil Powell alone was awarded the Nobel Prize for Physics in 1950 for "his development of the photographic method of studying nuclear processes and his discoveries regarding mesons made with this method". The reason for this apparent neglect is that the Nobel Committee policy until 1960 was to give the award to the research group head, only. Niels Bohr is rumored to have left behind a letter titled "Why Cesar Lattes did not win the Nobel Prize - Open 50 years after my death", however inquiries at the Niels Bohr Archive in Copenhagen, Denmark has turned up no such document." I believe it is enough. And sounds very impressive for me.

--


Well, this is just an extremely short list of remarkable brazilians, at least in my opinion. I excluded several extremely important writers such as Carlos Drummond de Andrade (my favorite poet), Cecilia Meirelles, Manuel Bandeira, among others... as well as great painters as Tarsila do Amaral and Di Cavalcanti. As I initially told, I deliberately avoided to list important athletes as Pelé (the greatest soccer player ever, athlete of the century, and so on...) and Ayrton Senna (arguably the greatest racing driver ever).

I hope this post was useful to show that Brazil do have remarkable people who made a great difference in the world. Brazilians are not good promoting themselves, we do not have an international motion picture industry, to make movies about these people, and so on. This was my attempt to contribute with a list of important Brazilians. Unfortunately, most brazilians still do not know some of the listed people... Maybe I forgot someone that I shouldn't! If I did that, I'm sorry....

Disclaimer: If this post took too long to be posted, or if it is not so well written, it is certainly due to the incredible high amount of work I'm doing right now.

Friday, October 18, 2013

How an ML Algorithm is created

While I was an undergrad student, using several fancy different techniques to classify my data, I always wondered how anyone could come up with so "fancy" algorithms such as Boosting or SVM from nothing. Of course I was extremely naive thinking that someone would come up with a new algorithm from "nothing", and deep inside I knew that things do not work this way. However, a method with "fancy" math techniques was difficulty to trace, at least for me some years ago.

Today I will try to present an example of how ML algorithms are created. Of course, it is still more complex than a blog post, so I'm going to present here just the intuition of a "toy example" that created the algorithm called Prod [1]. It is an adaptation of the Exponential Weighting Algorithm (EWA), which I'm going to present first. While you are reading about the EWA, I invite you to think how you would improve it. To be honest? I wouldn't be able to do it while reading, but it may be a good exercise... I'm not trying to seem smart asking you this...


Please, remember the online learning framework that I described in this post, and the first algorithms I presented here. I'm too lazy to write everything again, and the post would be huge!


EWA

In the first algorithms I presented, the Follow the Leader was the unique algorithm that is able to make predictions in a continuous space, however, we can clearly see that following the leader may not be a good approach always.

EWA is an algorithm that blends the prediction of all experts, weighting them accordingly to the "confidence" you have on them. Intuitively, if they commit too many mistakes, their weight is much smaller than the weight of a very good forecaster.

The algorithm works as follows: you receive the prediction of a set of forecasters (just the way I presented before) and then predicts $p_t$. The way we mixture the forecasters predictions is:
$$
p_t = \frac{\sum_{i=1}^{N}w_{t-1, i}, f_{t,i}}{\sum_{i=1}}
$$
where $f_{t,i}$ is the prediction at time $t$ of the expert $i$, $N$ is the total number of experts, and $w_{t-1,i}$ is the weight given for the expert $i$ in the previous time step. We define $w_{0,i} = 1 \ \forall i \in D$, where $D$ is the set of forecasters.

After making the prediction, as always, we receive the loss function $\ell_t$ and incur the loss $\ell_t(p_t)$, which we assume to be in a convex set. Finally, we update all weights, as follows:
$$ w_{t,i} = w_{t-1, i} \cdot e^{-\eta \ell_t(f_{t,i})} $$
Notice that the $e^{-\eta}$ works as $\beta$ in the WMA. It is exponential for some reasons that do not need to be discussed here. At the end, we can see that the upper bound of the regret in the EWA is [2]:
$$R_n \leq \frac{\ln N}{\eta} + \frac{\eta n}{8}$$
I'm not going to discuss  this proof here. It is in [2]. This is all I wanted to present for EWA for the main goal of this post: to give an intuition of how we create new algorithms. Did you think about anything that could make the bound above smaller?

Prod

If you really thought about improving the algorithm above, probably you suspected that the update rule is where we have to focus. In fact there are other approaches, as using different techniques in the proof of the bound, but you had no clue to think about it.

What really happens is that in the proof, we used Hoeffding's lemma, this is the origin of the number $8$ in the bound, for example. This lemma assumes that the set of losses is convex. Are we able to obtain a tighter inequality? One approach is to avoid using Hoeffding's lemma and use a different inequality. $e^{-\eta}$ is clearly a convex function but we could use a concave function "avoiding" to use of Hoeffding's lemma.

What function to use? What about $1- \eta$? How did I came up with that? Well, it is tangent to $e^{-\eta}$. Look at the graph below, which I got from WolframAlpha.


Thus, we change EWA to have a different update rule,
$$ w_{t,i} = w_{t-1, i} (1 -\eta \ell_t(f_{t,i}))$$
and we are able to obtain the following bound [2]:
$$R_n \leq \frac{\ln N}{\eta} + \eta n\left(\frac{1}{n}\sum_{t=0}^{N}\ell_t^2(f_t,i) \right)$$
Despite looking scaring in a first moment, lets take a look on this bound. It scales with the average of the squared losses of the forecasters. This is good because the previous bound ignored that good forecasters should result in a smaller regret. Besides, assuming that the loss is between $0$ and $1$, this bound is even tighter, and better than a bound that scales linearly with time.



In summary, what I did here was to show a first algorithm, EWA. Then I changed its update rule to a "better" one, which provided us a tighter upper bound for the regret of the algorithm. In other words, I changed a part of an existent algorithm using some math knowledge over the proof of the first bound. I did not delved deep into the proof because the post would become a book chapter, but I hope you trust me. Maybe one day I will present this proof here. But the idea is: I know some math, I see that you used a bound that is not so tight, but that was necessary because of your assumption of convexity. I change this assumption with a new update function and then I propose a new algorithm, avoiding the inequality that impaired previous performance.

I hope I made more clear to you how we use our math knowledge to propose new algorithms in ML. Obviously, this is only a simple example, with a superficial discussion about Prod. The original paper [1] is much more dense and deep (as expected).  This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andras György and David Pál [2]. The notation used is also the same. Of course, any mistakes are completely my fault.

References:
[1] Cesa-Bianchi, N., Mansour, Y., and Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning Journal, 66(2-3):321-352.
[2] Gyorgy, A., Pal, D., Szepesvari, C. (Draft). Lecture Notes - Online learning.

-----------
I'm sorry for not posting anything last week. I really want to be able to post regularly here, however, somethings go out of our control. Broken laptop, assignments and so on...

Friday, October 4, 2013

First Algorithms in Online Learning

Considering the whole framework already discussed in a previous post, I am going to present some basic algorithms in the online learning field, namely, Follow the Leader (or Follow the Best Expert), Halving Algorithm and Weighted Majority Algorithm.


Follow the Leader Algorithm

A first situation we may think is a forecaster that has to predict a sequence of numbers that are inside an interval. One may define the loss $\ell(p_t)$ of the prediction $p_t$ at time step $t$ as the distance between the prediction and the real outcome i.e.,  $\ell(p_t) = ||p_t - y||^2$ where $y$ is the real outcome of time step $t$. Szepesvari et al. define this game as the Shooting Game, where the predictions lay in the unit circle $\mathbb{R}^d$. I will use this game to present a bound for the first algorithm I am going to discuss, the Follow the Leader Algorithm (FTL) or Follow the Best Expert.

FTL is an extremely simple algorithm, where the prediction at time step $t$ is the one that minimizes the regret until that moment. If we define as the best forecaster in hindsight the one that chooses the best fixed point until that time ($u_t$); we can say that the FTL algorithm always predicts, at time step $t$, $u_{t-1}$. As an example, suppose the three first elements of the sequence that will be predicted are $\{0.6, 0.4, 0.8\}$. The FTL always predict $0$ at the first time step, then, the loss at the first time step is $||0 - 0.6||^2 = 0.36$. At the second time step, FTL predicts $\frac{0 + 0.6}{2} = 0.30$, and the loss at time step $2$ is $||0.3-0.4||^2 = 0.01$ and so on...

Despite being extremely simple, both Cesa-Bianchi, Lugosi; and Szepesvari et al. show that the FTL has an upper bound proportional to $\log n$. This is a very good performance, since $\lim_{n\rightarrow \infty} \frac{\log n}{n} = 0$. Thus, along the time, the regret becomes irrelevant.

Halving Algorithm

A different situation we can come up with is a situation in which you have a set of predictors (or forecasters) and you know, beforehand, that one of them is never wrong. Thus, the question now is how to discover it and the maximum number of mistakes one may get. Clearly, it must have something better than the FTL (Besides, this problem may be a discrete classification). We can use, for example, binary search -- if you have a binary problem, otherwise the reasoning is the same for more classes: You just count the number of forecasters that predicted class $1$ and then count the number of forecasters that selected the other class. You predict, for the current time step, as the majority voted. Two outcomes are possible: (1) you correctly predicted what you want to predict, and then you eliminate all the other predictors that gave you wrong information; or (2) you got the wrong answer for that time step, what is not so good, but at least you eliminated, at least, half of the predictors you had. This is the Halving algorithm.

Now, answering the second question: in the worst case, how many mistakes the Halving algorithm does make? Different from other proofs related to algorithms' bounds in this field, this proof is quite intuitive, the worst case possible is to always make the wrong prediction, until you discover the 'perfect forecaster', right? Otherwise, you would be eliminating forecasters without mistaking -- certainly this is not the worst case. Additionally, the worst case is not when all predictors are wrong, and just one is right, but when half of the forecasters are wrong and the other half, right. This way, using basic knowledge on binary search, half of the set is eliminated each iteration and, at the end, the worst case is to make $\log_{2}{N}$ mistakes, where $N$ is the number of experts.


Weighted Majority Algorithm

However, a more common situation (?) is to not have a perfect predictor. Then, how to approach this task? The idea is very close to the Halving algorithm, but in place of eliminating the forecasters that made a wrong decision, you just reduce your confidence on them and you predict based on a weight majority. This algorithm is called Weighted Majority Algorithm.

Its idea is to attribute a weight $w_{t, i}$ to each predictor $i$ at time step t. Initially, $1$. You then evaluate the summation of weights of the forecasters that predicted each possibility. You then choose the prediction made by the highest weights. For example, three different forecasters $f_1$, $f_2$ and $f_3$ have weights (sort of your confidence on them) equals to $0.6$, $0.3$ and $0.4$, respectively. $f_1$ predicted $A$ while $f_2$ and $f_3$ predicted $B$. Since $0.7 > 0.6$ (sum of weights), you predict $B$.

Regarding the update rule, for each predictor we define $w_{i,t}$ = $w_{i, t-1}$ . $\beta^{\mathbb{I}\{f_{i,t} \neq \ y_t}\}$. $\beta$ is a parameter, $y_t$ is the correct prediction revealed, and, most importantly: $\mathbb{I}$ is the indicator function, a function that returns the number of times the condition inside it was true. So, in this case, $\beta^\mathbb{I}\{f_{i,t} \neq \ y_t\}$ is $\beta^1$, thus $\beta$ in case the predictor was mistaken on that time step, and $\beta^0$, thus $1$ (no change!) if it was right. What about the number of mistakes?

Szepesvari, Gyorgy and Pal state that the Weighted Majority Algorithm makes at most
$$
\frac{\ln\left(\frac{1}{\beta}\right)L_n^* + \ln N}{\ln\left(\frac{2}{1+\beta}\right)}
$$
mistakes. I'm not going to present their proof here. Maybe later, in another post.

To conclude, notice that if $\beta = 0$, the Weighted Majority Algorithm becomes the Halving Algorithm. In the next post I intend to talk about Exponentially Weighted Average Forecasters, when we don't want to predict numerical values that are not necessarily discrete.

------------------
I'm sorry if this post was not so well written. I'm too busy with assignments and exams this week (and the next one).

This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andràs György and David Pál. The notation used is also the same. Of course, any mistakes are completely my fault.

Friday, September 27, 2013

The Transformation during Grad School

I've seen this today and decided to share. It is a cut and paste from Professor H.T. Kung homepage, from Harvard University. It seems reasonable...

Growth of a star (the transformation process that some students go through to become a mature researcher)--which stage are you in?
  • Knowing everything stage
    • Student: "I have designed a supercomputer even before graduate school."
    • Faculty: speechless
  • Totally beaten up stage
    • Student: speechless
    • Faculty: smiling at the student's progress so communication is possible now.
  • Confidence buildup stage
    • Student: "I am not stupid after all." (student thinks)
    • Faculty: "Uh oh, she is ready to argue." (faculty think)
  • Calling the shot stage
    • Faculty: "I am going to design an n-processor supercomputer."
    • Student: "You are crazy, because ..."
I'm pretty sure were I am now. Just hoping that I will reach the later stages.

Friday, September 20, 2013

Online Learning - Predicting Individual Sequences

This post aims at presenting the basic concepts of online learning, which is in fact a framework. Recently, this approach has been receiving much attention in top-tier conferences such as ICML, NIPS, ALT and COLT. I'm not sure whether Online Learning is a name commonly used by the ML community to define this topic, but I'm going to stay with it. A famous book in this subject is: Prediction, Learning and Games, written by Nicolò Cesa-Bianchi and Gábor Lugosi.

In the traditional statistical learning setting, some stochastic properties about the data are assumed. However, in practice, one cannot guarantee that these properties hold. Consequently, it is extremely hard to provide any theoretical guarantees for the performance of these algorithms. In this scenario, online learning is a framework that has no assumptions about the environment (I'll discuss the concept of environment below). Besides, differently from traditional approaches that generally have an universal performance measure, online learning is much more pessimistic, just intending to do as good as a the best reference predictor. These characteristics allow researchers to provide theoretical guarantees for algorithms performance. In fact, some standard approaches such as SVM have already been re-implemented in the online learning framework.

In this framework, the prediction problem is modeled as a game in which one player is the predictor (or forecaster) and the other one is the environment (your adversary). One of the main questions is how do we generate predictors that are good enough? Of course, we have to define what is good enough... To answer this question, we have to define some basic concepts of the field. These concepts are presented below, together with a description of a game round.

At each time step $t = 1, 2, ...$, the forecaster makes a prediction $p_t \in D$, where $D$ is the set of forecasters. At the sime time, the environment chooses a loss function $\ell_t : D \times \mathcal{L}$, where $\mathcal{L}$ is the set of all loss functions. Notice that both forecaster and environment do not know its adversary choice. The forecaster only knows the loss functions incurred over him in the past i.e., $\ell_{t-1}, \ell_{t-2}, ..., \ell_{1}$. Defining $\ell_t(p_t)$ as the loss incurred to prediction $p_t$, we can define the total loss of the forecaster until that moment, $\hat{L_n}$:
$$\hat{L}_n = \sum_{t=1}^{n}\ell_t(p_t)$$
Once we defined the total loss of the predictor, we should also define the total loss of a predictor $u$, if it was selected in all time steps:
$$L_n = \sum_{t=1}^{n}\ell_t(u)$$
Thus, once these concepts are defined, we can present what we really want to minimize, the regret $R_n$. It is the difference between the performance of the forecaster and the best predictor present in $D$. Formally,
$$R_n = \hat{L}_n - \inf_{u \in D}L_n(u)$$
*Notice that $\inf$ is different from $\min$. For further details, please check the definition of Infimum in the Wikipedia [link].

Basically, we want to keep the regret small while the environment wants to keep it large. To increase the regret, the environment wants to generate a way that the best predictor is hard to be discovered by the forecaster, and it is small, ensuring that the second term of the difference is high. Additionally, the environment wants to define loss functions that are high for predictors different from this chosen one, to ensure that the first element is high. The forecaster, on the other hand, does not have control over the second term, it can just select the best prediction at each time step. This is why this framework works (in a very simplistic statement). Because each prediction $p_t$ made tries to be as good as possible to minimize the regret. This is done looking for patterns and so on.

In general, what people want is to ensure that the forecaster (first term) is as good as the best available predictor along the time. Thus, we can say that, at the limit, when $t$ is arbitrarily large, we want the ratio $\frac{R_n}{n} \rightarrow 0$. In other words, we want the regret to grow sublinearly.

A final question to be answered is: why don't we just use the loss function of the forecaster to measure performance? Why do we use regret?

I have to confess that this was not clear for me at a first moment, but the answer is to ensure that we can compare our approach with something more concrete. In other words, remember that the environment plays against you, thus it wants to increase the forecaster's loss. It can do it in a pretty simple way, and we can easily show that it can make forecaster's loss arbitrarily large. For example, if the loss can assume values $0$ and $1$, the environment can easily make the forecaster suffer a loss $n$ (worst-case). So, how do we know if our algorithm is good? We have to compare it to something else, in this case, the best predictor. If our regret is $0$, we know that you cannot do any better, even if your forecaster is not good. You are simply working in a problem too hard. It is all a matter of having a basis for comparison.

An analogy that may ease the understanding of this difference are grades in a class. If we state that each student score a grade $p \in \mathbb{R} : 0 \leq p \leq 100$, we just define the range of possibilities (you can see them as losses), however, we do not say if it is possible to score all these grades. Suppose that your professor is working as your adversary (this is just an analogy!) and wants to make the course extremely challenging. Then, at the end you score $40$. It seems pretty bad, however, when you look at your classmates grades, the highest one was $42$. Your loss is huge! But your regret is small... You shouldn't be so sad about your performance, and hopefully you will still receive an A+ ;)

------------------

This post was longer than I expected, so I decided to leave some examples of this framework to the next post. I hope you were able to grasp the main concepts of this framework for predicting individual sequences.

This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andràs György and David Pál. The notation used is also the same. Of course, any mistakes are completely my fault.