EN FR
EN FR
Bilateral Contracts and Grants with Industry
Bibliography
Bilateral Contracts and Grants with Industry
Bibliography


Section: New Results

Structured Penalties for Log-linear Language Models

Participants : Anil Nelakanti [correspondent] , Cédric Archambeau, Francis Bach, Guillaume Bouchard.

Language models can be formalized as log-linear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. In [19] we present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using tree-structured penalized objectives to avoid overfitting and achieve better generalization.

Language models are crucial parts of advanced natural language processing pipelines, such as speech recognition  [45] , machine translation  [47] , or information retrieval  [92] . When a sequence of symbols is observed, a language model predicts the probability of occurrence of the next symbol in the sequence. Models based on so-called back-off smoothing have shown good predictive power  [60] . In particular, Kneser-Ney (KN) and its variants  [66] are still achieving state-of-the-art results for more than a decade after they were originally proposed. Smoothing methods are in fact clever heuristics that require tuning parameters in an ad-hoc fashion. Hence, more principled ways of learning language models have been proposed based on maximum entropy  [50] or conditional random fields  [81] , or by adopting a Bayesian approach  [94] .

We focus on penalized maximum likelihood estimation in log-linear models. In contrast to language models based on unstructured norms such as 2 (quadratic penalties) or 1 (absolute discounting), we use tree-structured norms  [96] , [65] . Structured penalties have been successfully applied to various NLP tasks, including chunking and named entity recognition  [74] , but not language modeling. Such penalties are particularly well-suited to this problem as they mimic the nested nature of word contexts. However, existing optimizing techniques are not scalable for large contexts m.

We show that structured tree norms provide an efficient framework for language modeling. Furthermore, we give the first algorithm for structured tree norms with a complexity nearly linear in the number of nodes. This leads to a memory-efficient and time-efficient learning algorithm for generalized linear language models.