## Section: New Results

### Structured Penalties for Log-linear Language Models

Participants : Anil Nelakanti [correspondent] , Cédric Archambeau, Francis Bach, Guillaume Bouchard.

Language models can be formalized as log-linear regression models where the input features represent previously observed contexts up to a certain length $m$. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in $nd$, where $n$ is the length of the training corpus and $d$ is the number of observed features. In [19] we present a model that grows logarithmically in $d$, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using tree-structured penalized objectives to avoid overfitting and achieve better generalization.

Language models are crucial parts of advanced natural language processing pipelines, such as speech recognition [45] , machine translation [47] , or information retrieval [92] . When a sequence of symbols is observed, a language model predicts the probability of occurrence of the next symbol in the sequence. Models based on so-called back-off smoothing have shown good predictive power [60] . In particular, Kneser-Ney (KN) and its variants [66] are still achieving state-of-the-art results for more than a decade after they were originally proposed. Smoothing methods are in fact clever heuristics that require tuning parameters in an ad-hoc fashion. Hence, more principled ways of learning language models have been proposed based on maximum entropy [50] or conditional random fields [81] , or by adopting a Bayesian approach [94] .

We focus on penalized maximum likelihood estimation in log-linear models.
In contrast to language models based on *unstructured* norms such as ${\ell}_{2}$ (quadratic penalties) or ${\ell}_{1}$ (absolute discounting), we use *tree-structured* norms [96] , [65] . Structured penalties have been successfully applied to various NLP tasks, including chunking and named entity recognition [74] , but not language modeling. Such penalties are particularly well-suited to this problem as they mimic the nested nature of word contexts. However, existing optimizing techniques are not scalable for large contexts $m$.

We show that structured tree norms provide an efficient framework for language modeling.
Furthermore, we give the first algorithm for structured ${\ell}_{\infty}$ tree norms with a complexity nearly linear in the number of nodes. This leads to a memory-efficient *and* time-efficient learning algorithm for generalized linear language models.