The success of deep learning NLP is often narrated as not assuming anything about the language and letting the data speak for itself. Although this is debatable on many levels, one thing is outstandingly suspicious: most state-of-the-art NLP models assume the existence of discrete tokens and use subword segmentation which combines rules with simple statistical heuristics. Avoiding explicit input segmentations is more difficult than it seems.
The first part of the talk will present neural edit distance, a novel interpretable architecture based on well-known Levenshtein distance that can be used for purely character-level tasks such as transliteration or cognate detection. In the second part of the talk, we will zoom out and have a look at character-level methods for neural machine translation. We will present how innovations in training and architectures design can improve translation quality. Despite this progress, we will show that character-level methods in machine translation still lack behind the subword-based models nearly in all respect that can be measured.