## Section: New Results

### Foundations of Machine Learning

Participants : Daniil Ryabko, Azadeh Khaleghi, Romaric Gaudel.

#### Sequence prediction in the most general form

The problem of sequence prediction consists in forecasting, on each step of time $n$, the probabilities of the next outcome of the observed sequence of data ${x}_{1},{x}_{2},\cdots ,{x}_{n},\cdots $. In the most general formulation of the problem, we assume that we are given a set $\mathcal{C}$ of probability measures (on the space of infinite sequences). We can then assume that the sequence is generated by an unknown measure $\mu $ that belongs to $\mathcal{C}$, or that the measure $\mu $ is arbitrary, but we compare the performance of our predictor to that of the best predictor in $\mathcal{C}$.

##### Relation between the realizable and non-realizable cases of the sequence prediction problem

The realizable case of the sequence prediction problem is when the measure $\mu $ belongs to an arbitrary but known class $\mathcal{C}$ of process measures. The non-realizable case is when $\mu $ is completely arbitrary, but the prediction performance is measured with respect to a given set $\mathcal{C}$ of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists, and finding these solutions. In this work [13] we show that if the quality of prediction is measured by total variation distance, then these problems coincide, while if it is measured by expected average KL-divergence, then they are different. For some of the formalizations we also show that when a solution exists, it can be obtained as a Bayes mixture over a countable subset of $\mathcal{C}$. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes.

#### Statistical inference

We continue to obtain new results using the theoretical framework developed recently for studying time series generated by stationary ergodic time series. This year, new results obtained include a topological characterizing of composite hypotheses for which consistent tests exist, as well as new results on clustering.

##### A criterion for the existence of consistent tests

The most general result that we have obtained [14] on hypothesis testing provides a complete characterization (necessary and sufficient conditions) for the existence of a consistent test for membership to an arbitrary family ${H}_{0}$ of stationary ergodic discrete-valued processes, against ${H}_{1}$ which is the complement of ${H}_{0}$ to this class of processes. The criterion is that ${H}_{0}$ has to be closed in the topology of distributional distance, and closed under taking ergodic decompositions of its elements.

#### Clustering

##### Online clustering of time series

An asymptotically consistent algorithm has been proposed for the problem of online clustering of time series. There is a growing body of time series samples, each of which grows with time. On each time step, it is required to group these time series into $k$ clusters. It is known that each of the time series is generated by one out of $k$ *unknown* stationary ergodic distributions. An algorithm is proposed that, for each fixed portion of samples, eventually (with probability 1) puts into the same group those and only those samples that were generated by the same distribution. Empirical performance of the algorithm is evaluated on synthetic and real data.

##### Clustering of ranked data

We introduced [47] a novel approach to clustering rank data on a set of possibly large cardinality $n\in {\mathbb{N}}^{*}$, relying upon Fourier representation of functions defined on the symmetric group ${\U0001d516}_{n}$. In the proposed setup, covering a wide variety of practical situations, rank data are viewed as distributions on ${\U0001d516}_{n}$. Cluster analysis aims at segmenting data into homogeneous subgroups, hopefully very dissimilar in a certain sense. Whereas considering dissimilarity measures/distances between distributions on the non commutative group ${\U0001d516}_{n}$, in a coordinate manner by viewing it as embedded in the set ${[0,1]}^{n!}$ for instance, hardly yields interpretable results and leads to face obvious computational issues, evaluating the closeness of groups of permutations in the Fourier domain may be much easier in contrast. Indeed, in a wide variety of situations, a few well-chosen Fourier (matrix) coefficients may permit to approximate efficiently two distributions on ${\U0001d516}_{n}$ as well as their degree of dissimilarity, while describing global properties in an interpretable fashion. Following in the footsteps of recent advances in automatic feature selection in the context of unsupervised learning, we propose to cast the task of clustering rankings in terms of optimization of a criterion that can be expressed in the Fourier domain in a simple manner.