Section: New Results
Structured Penalties for Log-linear Language Models
Participants : Anil Nelakanti [correspondent] , Cédric Archambeau, Francis Bach, Guillaume Bouchard.
Language models can be formalized as log-linear regression models where the input features represent previously observed contexts up to a certain length . The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in , where is the length of the training corpus and is the number of observed features. In [19] we present a model that grows logarithmically in , making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using tree-structured penalized objectives to avoid overfitting and achieve better generalization.
Language models are crucial parts of advanced natural language processing pipelines, such as speech recognition [45] , machine translation [47] , or information retrieval [92] . When a sequence of symbols is observed, a language model predicts the probability of occurrence of the next symbol in the sequence. Models based on so-called back-off smoothing have shown good predictive power [60] . In particular, Kneser-Ney (KN) and its variants [66] are still achieving state-of-the-art results for more than a decade after they were originally proposed. Smoothing methods are in fact clever heuristics that require tuning parameters in an ad-hoc fashion. Hence, more principled ways of learning language models have been proposed based on maximum entropy [50] or conditional random fields [81] , or by adopting a Bayesian approach [94] .
We focus on penalized maximum likelihood estimation in log-linear models. In contrast to language models based on unstructured norms such as (quadratic penalties) or (absolute discounting), we use tree-structured norms [96] , [65] . Structured penalties have been successfully applied to various NLP tasks, including chunking and named entity recognition [74] , but not language modeling. Such penalties are particularly well-suited to this problem as they mimic the nested nature of word contexts. However, existing optimizing techniques are not scalable for large contexts .
We show that structured tree norms provide an efficient framework for language modeling. Furthermore, we give the first algorithm for structured tree norms with a complexity nearly linear in the number of nodes. This leads to a memory-efficient and time-efficient learning algorithm for generalized linear language models.