Séminaire de Recherche en Linguistique

Ce séminaire reçoit des conférenciers invités spécialisés dans différents domaines de la linguistique. Les membres du Département, les étudiants et les personnes externes intéressées sont tous cordialement invités.

Description du séminaire Print

Titre Interpretable word splits in language processing
Conférencier Tanja Samardzic (Universität Zürich)
Date mardi 04 février 2020
Heure 12h15
Salle L502 (Bâtiment Candolle) changement de salle
Description Words are commonly regarded as basic linguistic units (dictionary entries, terminal nodes in syntactic trees, minimal translation units etc.) in both language analysis and processing. The tempting idea that splitting words into smaller segments should improve automatic processing (e.g. machine translation) has been explored but hard to confirm empirically for a long time. When deep neural networks were introduced in language processing, the input started being encoded at two levels: character and word. Recently, an intermediate level, obtained by means of text compressions with Byte Pair Encoding (BPE), has been adopted as a standard pre-processing technique. In this talk, I will present experiments performed to investigate the impact of morphological segmentation on language processing, focusing on the task of automatic inflection (converting lemmas into inflected forms given a morphosyntactic definition, e.g. 'hug' + past tense -> 'hugged'). Our analysis of the output of a neural model in various settings has two goals. First, we zoom in and target specific morphological phenomena to determine whether they are captured by the version of the model that allows intermediate segments. Second, we take advantage of the highly multilingual data set to zoom out and investigate the impact of the language type on the optimal sub-word processing level.
