Séminaire de Recherche en Linguistique
Ce séminaire reçoit des conférenciers invités spécialisés dans différents domaines de la linguistique. Les membres du Département, les étudiants et les personnes externes intéressées sont tous cordialement invités. Description du séminaire 
| Titre | Probing linguistic knowledge in language models with natural and synthetic structured datasets |
| Conférencier | Giuseppe Samo |
| Date | mardi 21 avril 2026 |
| Heure | 12h15 |
| Salle | L208 (Bâtiment Candolle) |
| Description | Probing linguistic knowledge in language models with natural and synthetic structured datasets In this talk, I present recent results on the creation and evaluation of structured datasets to probe linguistic knowledge, with a special focus on verb alternation phenomena across languages (Samo & Merlo 2026b), explored through the Blackbird Language Matrices (BLMs) task (Merlo 2023, Merlo et al. 2026). I also discuss the quality of the instantiation of these structured datasets using both natural (Samo & Merlo 2026c) and synthetic data (Samo et al. 2023, Samo & Merlo 2026a). Finally, I will reflect on how these types of results may inform linguistic theory (Merlo & Samo, to appear). References Merlo, P. (2023). Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and Formal Specifications, https://arxiv.org/html/2306.11444. Merlo, P., & Samo, G. (to appear). Generative Computational Modelling, To appear in The Cambridge Handbook of Minimalism and Its Applications (E. Leivada & K.K. Grohmann &, eds.), preprint available at: https://ling.auf.net/lingbuzz/007675 Merlo, P., Jiang, C., Samo, G., & Nastase, V. (2026) Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models, https://arxiv.org/abs/2602.20966 Samo, G., & Merlo, P. (2026a). Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian, Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE), preprint available at: https://doi.org/10.48550/arXiv.2603.25227 Samo, G., & Merlo, P. (2026b). Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies, Proceedings of LREC 2026, preprint available at: https://arxiv.org/pdf/2603.15295v1 Samo, G., & Merlo, P. (2026c). Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew. In Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026), pages 82–94, Rabat, Morocco. Association for Computational Linguistics. Samo, G., Nastase, V., Jiang, C., & Merlo, P. (2023). BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12276–12287, Singapore. Association for Computational Linguistics.
|
| Document(s) joint(s) |
- |