Machine learning

Automatic document classification & clustering

To extract valuable information from textual data, one primordial step is to evaluate automatically in which category the document lies in.

When subtle categories needs to be defined a posteriori (not present when generating the description/metadata of the document) automatic classification algorithm allow to tremendously save manual work and enable finer and quicker knowledge extraction.

Possible applications range from automatic classification of documents (i.e. retain only radiology reports describing the presence of new a scaphoid fracture), to classification of shorter textual items (i.e automatic classification of medical concepts into international or local medical classifications such as ICD-10, CHOP, etc).

    • State-of-the-art methods: Naïve Bayes, SVM, linear classifiers
    • More advanced hybrid approaches

Language specific word embeddings for clinical & medical data

With “You shall know a word by the company it keeps” J. R. Firth stated already in 1957 the base for developing modern word representations also known as “word embeddings”.

The basic idea is to represent a part of textual data (for example a word) as a vector instead of an index in a vocabulary, to enable automatic and unsupervised “learning” from co-occurrences in large textual corpora.

This technique shows very good performances to deal with words with different meanings (i.e. a “fish bank” versus “bank account”), or understand similarities between concepts (“broken bones” will be represented similarly to “fractured bones” or “fracture of a bone”). French Medical narratives are very specific and call for advanced and specialized word embeddings.

    • Word2Vec
    • Pretrained models, fine-tuned on clinical notes
    • Pretrained models, fine-tuned on medical literature in French

Information retrieval

Data generation is increasing exponentially and automatic tools to extract meaningful information become critical in many fields. In care, up to 80% of valuable information is hidden in free text.

As an example, automatic detection of adverse drug effect improves patient safety and permits to find correlations from large collections of patients.

Information retrieval tools aims to find some specific piece of information from large data collections, which can be later analyzed manually within a reasonable timeframe, or used in various clinical decision support systems.

    • Side effect detection, named entity recognition, etc...