Machine Translation/Statistics

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Statistical machine translation[edit | edit source]

Language models[edit | edit source]

Language models are used in MT for a) scoring arbitrary sequences of words (tokens) and b) given a sequence of tokens, they predict what token will likely to follow the sequence. Formally, language models are probability distributions over sequences of tokens in a given language.

N-gram models[edit | edit source]

Character-based models[edit | edit source]

Recently, it was shown that it is possible to use sub-words, characters or even bytes as basic units for language modelling[citation needed]. There are a few events focused particularly on such models and in general, processing language data on sub-word units, e.g. SCLem 2017.

Translation models[edit | edit source]

IBM models 1-5[edit | edit source]

Phrase-based models[edit | edit source]

Factored translation models[edit | edit source]

Syntax- and tree-based models[edit | edit source]

Synchronous phrase grammar[edit | edit source]

Parallel tree-banks[edit | edit source]

Syntactic rules extraction[edit | edit source]

Decoding[edit | edit source]

Beam search[edit | edit source]

Hybrid systems[edit | edit source]

Computer-aided translation[edit | edit source]

Translation memory[edit | edit source]