Machine Learning - Biomedical Data Science Facility (BDSF)

Discovery of biomarkers and predictive models

We perform machine learning (ML) analyses to help you reveal patterns in complex biomedical data (e.g. bulk or single-cell transcriptomics, image analysis, questionnaires, etc.) and extract scientifically relevant information. Our approach emphasises data processing quality, bias prevention (by avoiding overfitting), robust validation (cross-validation, bootstrap) and biological interpretation.

What we offer

Predictive modelling (regularised regression, Random Forest, XGBoost gradient boosting, neural networks) based on the trade-off between performance and interpretability.
Signature extraction (lists of genes for simplified biological interpretation)
Variable selection (initial filtering by amplitude of variation, embedded methods such as glmnet / sPLS)
Multi-cohort data harmonisation (normalisation, batch effect correction, clinical annotation harmonisation).
Model validation and adjustment using anti-overfitting procedures (separation into training and test datasets, cross-validation, bootstrapping)

What we need from you

The question and phenotype to be predicted, with a clear definition of the ‘gold standard’.

The data (expression, metadata/annotations) and the sharing and access constraints.
A scientific contact to validate the choices (metrics, interpretability vs. complexity trade-offs).

Example: predictive biomarkers with PAGEpy

PAGEpy (Predictive Analysis of Gene Expression in Python) is an open-source Python programme that can be used to quickly test whether a multi-layer neural network can predict a target variable from a gene expression dataset. This tool integrates a train/test separation pipeline, variable gene selection and selection optimisation using a Particle Swarm Optimisation (PSO) system.