Section: Research Program
Statistical analysis of time series
Many of the problems of machine learning can be seen as extensions of classical problems of mathematical statistics to their (extremely) non-parametric and model-free cases. Other machine learning problems are founded on such statistical problems. Statistical problems of sequential learning are mainly those that are concerned with the analysis of time series. These problems are as follows.
Prediction of Sequences of Structured and Unstructured Data
Given a series of observations it is required to give forecasts concerning the distribution of the future observations ; in the simplest case, that of the next outcome . Then is revealed and the process continues. Different goals can be formulated in this setting. One can either make some assumptions on the probability measure that generates the sequence , such as that the outcomes are independent and identically distributed (i.i.d.), or that the sequence is a Markov chain, that it is a stationary process, etc. More generally, one can assume that the data is generated by a probability measure that belongs to a certain set . In these cases the goal is to have the discrepancy between the predicted and the “true” probabilities to go to zero, if possible, with guarantees on the speed of convergence.
Alternatively, rather than making some assumptions on the data, one can change the goal: the predicted probabilities should be asymptotically as good as those given by the best reference predictor from a certain pre-defined set.
Another dimension of complexity in this problem concerns the nature of observations . In the simplest case, they come from a finite space, but already basic applications often require real-valued observations. Moreover, function or even graph-valued observations often arise in practice, in particular in applications concerning Web data. In these settings estimating even simple characteristics of probability distributions of the future outcomes becomes non-trivial, and new learning algorithms for solving these problems are in order.
Given a series of observations of generated by some unknown probability measure , the problem is to test a certain given hypothesis about , versus a given alternative hypothesis . There are many different examples of this problem. Perhaps the simplest one is testing a simple hypothesis “ is Bernoulli i.i.d. measure with probability of 0 equals 1/2” versus “ is Bernoulli i.i.d. with the parameter different from 1/2”. More interesting cases include the problems of model verification: for example, testing that is a Markov chain, versus that it is a stationary ergodic process but not a Markov chain. In the case when we have not one but several series of observations, we may wish to test the hypothesis that they are independent, or that they are generated by the same distribution. Applications of these problems to a more general class of machine learning tasks include the problem of feature selection, the problem of testing that a certain behaviour (such as pulling a certain arm of a bandit, or using a certain policy) is better (in terms of achieving some goal, or collecting some rewards) than another behaviour, or than a class of other behaviours.
The problem of hypothesis testing can also be studied in its general formulations: given two (abstract) hypothesis and about the unknown measure that generates the data, find out whether it is possible to test against (with confidence), and if yes then how can one do it.
Change Point Analysis
A stochastic process is generating the data. At some point, the process distribution changes. In the “offline” situation, the statistician observes the resulting sequence of outcomes and has to estimate the point or the points at which the change(s) occurred. In online setting, the goal is to detect the change as quickly as possible.
These are the classical problems in mathematical statistics, and probably among the last remaining statistical problems not adequately addressed by machine learning methods. The reason for the latter is perhaps in that the problem is rather challenging. Thus, most methods available so far are parametric methods concerning piece-wise constant distributions, and the change in distribution is associated with the change in the mean. However, many applications, including DNA analysis, the analysis of (user) behaviour data, etc., fail to comply with this kind of assumptions. Thus, our goal here is to provide completely non-parametric methods allowing for any kind of changes in the time-series distribution.
Clustering Time Series, Online and Offline
The problem of clustering, while being a classical problem of mathematical statistics, belongs to the realm of unsupervised learning. For time series, this problem can be formulated as follows: given several samples , we wish to group similar objects together. While this is of course not a precise formulation, it can be made precise if we assume that the samples were generated by different distributions.
The online version of the problem allows for the number of observed time series to grow with time, in general, in an arbitrary manner.
Online Semi-Supervised Learning
Semi-supervised learning (SSL) is a field of machine learning that studies learning from both labeled and unlabeled examples. This learning paradigm is extremely useful for solving real-world problems, where data is often abundant but the resources to label them are limited.
Furthermore, online SSL is suitable for adaptive machine learning systems. In the classification case, learning is viewed as a repeated game against a potentially adversarial nature. At each step of this game, we observe an example , and then predict its label .
The challenge of the game is that we only exceptionally observe the true label . In the extreme case, which we also study, only a handful of labeled examples are provided in advance and set the initial bias of the system while unlabeled examples are gathered online and update the bias continuously. Thus, if we want to adapt to changes in the environment, we have to rely on indirect forms of feedback, such as the structure of data.
Online Kernel and Graph-Based Methods
Large-scale kernel ridge regression is limited by the need to store a large kernel matrix. Similarly, large-scale graph-based learning is limited by storing the graph Laplacian. Furthermore, if the data come online, at some point no finite storage is sufficient and per step operations become slow.
Our challenge is to design sparsification methods that give guaranteed approximate solutions with a reduced storage requirements.