E-BULLETIN

Entretien

SPOTLIGHT ON… RAPHAËL RUBINO

Raphaël Rubino is an expert in machine learning and translation. He was awarded a PhD in Computer Science by the University of Avignon in 2011 and spent the next fifteen years or so at various institutions in Europe, Asia and North America, working on joint projects in Natural Language Processing and Artificial Intelligence. His research has been widely published in journals and he is a frequent speaker at conferences and workshops, reflecting his international collaborations. He joined the Department of Translation Technology in 2023 under the direction of Professor Pierrette Bouillon. In this interview, he presents the project he is currently working on: RCNum.

Lisez cet entretien en français

Can you tell us about the RCnum project?

The SNSF-funded research project’s full name is “Une édition sémantique et multilingue en ligne des registres du Conseil de Genève (1545-1550)” (A Semantic and Multilingual Online Edition of the Geneva Council Registers from 1545 to 1550), RCnum for short. It was launched in 2023 by professors Pierrette Bouillon, Laurent Moccozet and Stéphane Marchand-Maillet and is due to complete in 2027. The aim is to develop tools enabling scholars to access and study the Geneva Council registers from the time of Calvin. The registers, previously available in print for the period from 1536 to 1544, are a peerless source for political, legal, economic, social, and religious life in Geneva and, more broadly, how Genevans thought about the world in the sixteenth century. They are also very interesting from a linguistic point of view. RCnum will put online an enriched, normalized, modernized version of the Geneva Council registers for the years 1545 to 1550, with translations into several languages, including English, German, and Italian, among others.

What is the aim of the project?

Normalization will reduce variant spellings, while modernization will provide intralingual translation into modern-day French. The resource will be available in open-access format on a stable, ergonomic platform to meet the needs of a range of user profiles with their own specific requirements and expectations. RCnum will open up access to a huge number of documents crucial to both local and international history, in a variety of formats.

Who are the project partners?

The project partners are the University of Geneva’s Faculty of Translation and Interpreting (FTI) and the computing centre, the Centre universitaire d’informatique (CUI). At the FTI, Pierrette Bouillon, Mathilde Fontanet, Johanna Gerlach, and Jonathan Mutal are modernizing and translating the texts and developing the data interfaces for users to consult. At the CUI, Gilles Falquet, Stéphane Marchand-Maillet, Laurent Moccozet, Christophe Chazalon, Marco Sorbi and Hélène de Ribaupierre are working on data enrichment and data visualization. We have also joined forces with Sandra Coram-Mekkey, an expert paleographer and historian at the Fondation de l'Encyclopédie de Genève, who is on the transcription team, while Christophe Chazalon is also bringing his skills as a historian to the modernization team.

What initially drew you to the project?

Its pluridisciplinary nature and the post-completion applications make it a very attractive project. Involving various partners expands the possibilities of AI by combining large language models (LLMs) with the expertise of historians, linguists, and translation specialists. It also foregrounds the limits of the current models: we need the expertise of project partners for tasks like modernization and translation. Applying AI to the Geneva Council registers was groundbreaking and adds to our sum of knowledge in digital humanities by establishing methods specific to data from registers, while at the same time developing approaches that can be applied regardless of the language and document type. I also think it’s very important to preserve our historical heritage.

What is your specific role on the project?

My speciality is natural language processing, especially machine translation. I use machine translation techniques to produce normalized versions of the Geneva Council registers for experts and modernized versions for a broader audience.

What does your work involve?

At the moment, I am exploring the potential of LLMs, which can be defined as artificial neural networks trained on vast quantities of data. We are studying how they adapt to the tasks of normalizing and modernizing the Geneva Council registers into modern-day French and translating them into various languages. The latter two tasks are based on contemporary versions of natural language. Normalization calls for in-depth research into data preparation and how the models work and also means generating synthetic data.

What have been the main hurdles you have had to overcome as a team?

The vast majority of current approaches are data-based, i.e. a huge quantity of text is used to derive a modelization that meets a mathematically expressed objective. Normalizing the content of the Geneva Council registers to current publishing standards is well beyond the capacity of AI tools for a general audience. Adapting LLMs is one of the main planks of my work, using data generated by other members of the project team.

What future directions might the research take?

One might be combining several sources of data, such as the Council registers, their surface annotations and semantics via knowledge graphs, to make the most of the work of the CUI. Another might be based on continual learning, which improves language models via an iterative protocol alternating between refining the models and manual corrections. This is the method we are using at the moment to normalize the Council registers: Sandra Coram-Mekkey is manually post-editing the text and we are looking at applying the same process to the modernization stage, drawing on the expertise of Christophe Chazalon, Mathilde Fontanet and Pierrette Bouillon.