Publications
Bryan Eikema, Germán Kruszewski, Hady Elsahar, Marc Dymetman in arXiv, 2021
Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often difficult or impossible to apply due to the need to find a proposal distribution that upper-bounds the target distribution everywhere. Approximate Markov chain Monte Carlo sampling techniques like Metropolis-Hastings are usually easier to design, exploiting a local proposal distribution that performs local edits on an evolving sample. However, these techniques can be inefficient due to the local nature of the proposal distribution and do not provide an estimate of the quality of their samples. In this work, we propose a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality, while providing explicit convergence bounds and diagnostics. QRS capitalizes on the availability of high-quality global proposal distributions obtained from deep learning models. We demonstrate the effectiveness of QRS sampling for discrete EBMs over text for the tasks of controlled text generation with distributional constraints and paraphrase generation. We show that we can sample from such EBMs with arbitrary precision at the cost of sampling efficiency.
@article{eikema-et-al-2021-sampling, author = {Bryan Eikema and Germ{\'{a}}n Kruszewski and Hady Elsahar and Marc Dymetman}, title = {Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs}, journal = {CoRR}, volume = {abs/2112.05702}, year = {2021}, url = {https://arxiv.org/abs/2112.05702}, eprinttype = {arXiv}, eprint = {2112.05702} }
Bryan Eikema and Wilker Aziz in arXiv, 2021
In neural machine translation (NMT), we search for the mode of the model distribution to form predictions. The mode as well as other high probability translations found by beam search have been shown to often be inadequate in a number of ways. This prevents practitioners from improving translation quality through better search, as these idiosyncratic translations end up being selected by the decoding algorithm, a problem known as the beam search curse. Recently, a sampling-based approximation to minimum Bayes risk (MBR) decoding has been proposed as an alternative decision rule for NMT that would likely not suffer from the same problems. We analyse this approximation and establish that it has no equivalent to the beam search curse, i.e. better search always leads to better translations. We also design different approximations aimed at decoupling the cost of exploration from the cost of robust estimation of expected utility. This allows for exploration of much larger hypothesis spaces, which we show to be beneficial. We also show that it can be beneficial to make use of strategies like beam search and nucleus sampling to construct hypothesis spaces efficiently. We show on three language pairs (English into and from German, Romanian, and Nepali) that MBR can improve upon beam search with moderate computation.
@article{eikema-aziz-2021-sampling, author = {Bryan Eikema and Wilker Aziz}, title = {Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation}, journal = {CoRR}, volume = {abs/2108.04718}, year = {2021}, url = {https://arxiv.org/abs/2108.04718}, eprinttype = {arXiv}, eprint = {2108.04718} }
Bryan Eikema and Wilker Aziz in Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020 Best Paper Award
Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we show that translation distributions do reproduce various statistics of the data well, but that beam search strays from such statistics. We show that some of the known pathologies and biases of NMT are due to MAP decoding and not to NMT's statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account the translation distribution holistically. We show that an approximation to minimum Bayes risk decoding gives competitive results confirming that NMT models do capture important aspects of translation well in expectation.
@inproceedings{eikema-aziz-2020-is, title = "Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation", author = "Eikema, Bryan and Aziz, Wilker", booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", month = dec, year = "2020", address = "Barcelona, Spain", publisher = "Association for Computational Linguistics", }
Bryan Eikema and Wilker Aziz in Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP), 2019
We present a deep generative model of bilingual sentence pairs for machine translation. The model generates source and target sentences jointly from a shared latent representation and is parameterised by neural networks. We perform efficient training using amortised variational inference and reparameterised gradients. Additionally, we discuss the statistical implications of joint modelling and propose an efficient approximation to maximum a posteriori decoding for fast test-time predictions. We demonstrate the effectiveness of our model in three machine translation scenarios: in-domain training, mixed-domain training, and learning from a mix of gold-standard and synthetic data. Our experiments show consistently that our joint formulation outperforms conditional modelling (i.e. standard neural machine translation) in all such scenarios.
@inproceedings{eikema-aziz-2019-auto, title = "Auto-Encoding Variational Neural Machine Translation", author = "Eikema, Bryan and Aziz, Wilker", booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W19-4315", doi = "10.18653/v1/W19-4315", pages = "124--141", abstract = "We present a deep generative model of bilingual sentence pairs for machine translation. The model generates source and target sentences jointly from a shared latent representation and is parameterised by neural networks. We perform efficient training using amortised variational inference and reparameterised gradients. Additionally, we discuss the statistical implications of joint modelling and propose an efficient approximation to maximum a posteriori decoding for fast test-time predictions. We demonstrate the effectiveness of our model in three machine translation scenarios: in-domain training, mixed-domain training, and learning from a mix of gold-standard and synthetic data. Our experiments show consistently that our joint formulation outperforms conditional modelling (i.e. standard neural machine translation) in all such scenarios.", }
Bryan Eikema in UvA Scripties Online, 2018
Translation data is often a byproduct of mixing different sources of data. This could be intentional such as by mixing data of different domains or including back-translated monolingual data, but often also is the result of how the bilingual dataset was constructed: a combination of different documents independently translated in different translation directions, by different translators, agencies, etc. Most neural machine translation models do not explicitly account for such variation in their probabilistic model. We attempt to model this by proposing a deep generative model that generates source and target sentences jointly from a shared sentence-level latent representation. The latent representation is designed to capture variations in the data distribution and allows the model to adjust its language and translation model accordingly. We show that such a model leads to superior performance over a strong conditional neural machine translation baseline in three settings: in-domain training where the training and test data are of the same domain, mixed-domain training where we train on a mix of domains and test on each domain separately, and in-domain training where we also include synthetic (noisy) back-translated data. We furthermore extend the model to be used in a semi-supervised setting in order to incorporate target monolingual data during training. Doing this we derive the commonly employed backtranslation heuristic in the form of a variational approximation to the posterior over the missing source sentence. This allows for training the back-translation network jointly with the rest of the model on a shared objective designed for source-to-target translation with minimal need of pre-processing. We find that the performance of this approach is not on par with the back-translation heuristic, but does lead to improvement over a model trained on bilingual data alone.