SIERRA - 2013 - Annual activity report

SIERRA

SIERRA - 2013

Project-Team Sierra

Members

Overall Objectives

Research Program

Application Domains

Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Structured Penalties for Log-linear Language Models

Participants : Anil Nelakanti [correspondent] , Cédric Archambeau, Francis Bach, Guillaume Bouchard.

Language models can be formalized as log-linear regression models where the input features represent previously observed contexts up to a certain length $m$ . The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in $n d$ , where $n$ is the length of the training corpus and $d$ is the number of observed features. In [19] we present a model that grows logarithmically in $d$ , making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using tree-structured penalized objectives to avoid overfitting and achieve better generalization.

Language models are crucial parts of advanced natural language processing pipelines, such as speech recognition [45] , machine translation [47] , or information retrieval [92] . When a sequence of symbols is observed, a language model predicts the probability of occurrence of the next symbol in the sequence. Models based on so-called back-off smoothing have shown good predictive power [60] . In particular, Kneser-Ney (KN) and its variants [66] are still achieving state-of-the-art results for more than a decade after they were originally proposed. Smoothing methods are in fact clever heuristics that require tuning parameters in an ad-hoc fashion. Hence, more principled ways of learning language models have been proposed based on maximum entropy [50] or conditional random fields [81] , or by adopting a Bayesian approach [94] .

We focus on penalized maximum likelihood estimation in log-linear models. In contrast to language models based on unstructured norms such as $ℓ_{2}$ (quadratic penalties) or $ℓ_{1}$ (absolute discounting), we use tree-structured norms [96] , [65] . Structured penalties have been successfully applied to various NLP tasks, including chunking and named entity recognition [74] , but not language modeling. Such penalties are particularly well-suited to this problem as they mimic the nested nature of word contexts. However, existing optimizing techniques are not scalable for large contexts $m$ .

We show that structured tree norms provide an efficient framework for language modeling. Furthermore, we give the first algorithm for structured $ℓ_{\infty}$ tree norms with a complexity nearly linear in the number of nodes. This leads to a memory-efficient and time-efficient learning algorithm for generalized linear language models.

Previous |

Home | Next next