Statistical Machine Translation

Translation is increasingly important in our multilingual society. To use professional translators is expensive and often not practical. Machine translation (MT), automatic translation by a computer, could both be used as input to translators who post-edit it, or directly for instance for gisting web documents or communicating on social media. While machine translation systems have become better and better in recent years there is still plenty of room for improvements, especially for morphologically rich languages and non-standard domains.

Aims
In this project, we seek to improve statistical machine translation (SMT), which is the currently dominant MT paradigm, where translation systems are trained on large parallel corpora of translated sentences. We specifically focus on translation into morphologically rich languages, such as Finnish and German, the treatment of multi-word expressions and compounds, cross-sentence phenomena like pronoun translation, and the directionality of training data.

Methods
Most SMT models are based on surface words. To handle translation into morphologically rich languages and to handle compound words we investigate how translation can instead be based on lemmas and subwords. In order to do this we build upon state-of-the-art SMT tools, and extend their models. In order to handle cross-sentence dependencies we have also developed our own decoder that allows us to integrate models crossing sentence boundaries, which is hard using standard tools. Using these tools we build large SMT systems based on large corpora of human translations, which we then evaluate on unseen translations.

Research group

PI:Dr. Sara Stymne
Dept of Linguistics and Philology, Uppsala University
Prof. Joakim Nivre
Dept of Linguistics and Philology, Uppsala University
Dr. Fabienne Cap
Dept of Linguistics and Philology, Uppsala University
Dr. Christian Hardmeier
Dept of Linguistics and Philology, Uppsala University