Corps enseignant

Master projects

Master and Certificat Projects

I list below current topics of investigation, in computational linguistics and NLP,  that would be appropriate for Master's mémoire or a certificate mémoire.

They require roughly one semester's worth of work.

Please contact me if you are interested.

 

For students in linguistics (these projects require linguistic training)

 

  • Greenberg's word order universals

Current explanations of word order universals have moved away from
trying to define possible or impossible word orders and are
concentrating on the frequency distributions of attested word orders,
to provide theories that could resist the discovery of new, but
clearly infrequent, word orders (Cysouw, 2010). However, since these
probabilistic typological theories have concentrated on predicting
dominant word order (which is, often, although not always, the most
frequent), very important information is lost about how dominant the
order really is. By looking at collections of annotated corpora, we
can work with complex frequency distributions without the simplifying
concept of a dominant word order.

Project: We verify Greenberg's universals based on corpora.

 

  • Movement theories of Universals

 

We develop quantitative models for some core claims about the cost of
operations in generative grammar; for example, types of movement
operations, as in Cinque (2005, 2013), and verify their empirical
validity based on large amounts of text and across languages. We will
extend here the classification/regression method pioneered in Cysouw
(2010), Merlo (2015), Merlo and Ouwayda (forthcoming) , to apply
it to other explanations.

Project: can different word order universals, as in Greenberg, be
explained by the same movement costs?

 

  • Crowd-sourcing/Web data collection of the notion of spontaneity

Some verbs in some languages participate in the causative alternation
while their counterparts in other languages do not. The results of a
corpus-based study suggest that the property which underlies this
variation is the spontaneity of the event described by the verb.
There are two ways, among others, in which spontaneity can be
quantitatively assessed: first, by observing the typological
distribution of causative and anticausative morphological marking
across a wide range of languages; second, by the frequency
distribution of causative and anti-causative uses of the alternating
verbs in a corpus of a single language. Our study shows that these two
measures are correlated (Samardzic and Merlo 2014, to appear).

Project: The corpus based measures need to be validated by a human study that
measures the perceived spontaneity of the event as perceived by
speakers.

 

  • Other possible topics

 Prepositions

 Different measures of locality: different measures of interference and similarity

 

For students in computer science (these projects require good programming skills) 

  • Humour: implement a model of humour based on existing proposals

Artificial Intelligence and Natural Language Processing have made
great progress in recent years. IBM's Watson can answer quiz
questions as well as any person, Google Translate is one of the most
used applications of Google, and Siri allows you to talk to your
phone. While fact-seeking and fact-providing in natural language, in
any language, appears to be finally within reach, the next step in
artificial intelligence has shifted to affective expressions, in
language or other means of communication. One of the most common, and
most enjoyable, affective expression in language is humour. Many
Siri's users indicate that a better sense of humour would be a very
desirable feature. While there is at the moment certainly no funny
computer programme, avenues to learn some simple elements of humour
from large amounts of text can be envisaged. Some approaches have
been proposed based on the notion of concomitant ambiguity and
incongruity as necessary ingredients of a joke.

Project: We will focus here on understanding simple jokes.

 

  • NLP/Parsing topics

 

Our group has developed some of the most accurate neural network
syntactic and semantic parsers available today, for many languages.
Use by more naive or simply occasioanl users, however, is hampered by
the complex and unfriendy interface and the unwieldy process needed to
retrain the parser. Each of the topic below will be an interesting
topic of software engineering and HCI.

 

Projects

 

  • build an interface for SSN parser to easily train and test parsers for all languages

 

  • build interface to be able to train/use parser in Universal Dependencies format  for all existing languages

 

  • provide ability to manipulate parameters for parser, add word embeddings