Machine Learning

Automatic classification

To extract valuable information from textual data, one primordial step is to evaluate automatically in which category the document lies in.

When subtle categories needs to be defined a posteriori (not present when generating the description/metadata of the document) automatic classification algorithm allow to tremendously save manual work and enable finer and quicker knowledge extraction.

Possible applications range from automatic classification of documents (i.e. retain only radiology reports describing the presence of new a scaphoid fracture), to classification of shorter textual items (i.e automatic classification of medical concepts into international or local medical classifications such as ICD-10, CHOP, etc).

Word Embeddings

With “You shall know a word by the company it keeps” J. R. Firth stated already in 1957 the base for developing modern word representations also known as “word embeddings”.

The basic idea is to represent a part of textual data (for example a word) as a vector instead of an index in a vocabulary, to enable automatic and unsupervised “learning” from co-occurrences in large textual corpora.

This technique shows very good performances to deal with words with different meanings (i.e. a “fish bank” versus “bank account”), or understand similarities between concepts (“broken bones” will be represented similarly to “fractured bones” or “fracture of a bone”). French Medical narratives are very specific and call for advanced and specialized word embeddings.

Information Retrieval

Data generation is increasing exponentially and automatic tools to extract meaningful information become critical in many fields. In care, up to 80% of valuable information is hidden in free text.

As an example, automatic detection of adverse drug effect improves patient safety and permits to find correlations from large collections of patient.

Information retrieval tools aims to find some specific piece of information from large data collections, which can be later analyzed manually within a reasonable timeframe, or used in various clinical decision support systems.