From thousands of pages of illustrated printed material to the study of images in circulation.
Since January 2021, the team has been collecting as large a corpus as possible of illustrated print items available in digital form. The corpus now amounts to about 120,000 items (journals, magazines, complete collections, or simple posters), spread over 120 countries. Most of the sources retrieved are iiif compliant – and we are working to turn the non-interoperable images to iiif .
The next step was to extract the illustrations of each page, and to identify the images that have been reproduced the most.
Thanks to Robin Champenois, we have developed a platform, VisualContagions/explore, which allows us to treat our corpus. From the iiif urls that give access to an image and its metadata, we can:
- Retrive the images in the documents - all the pages concerned are thus isolated, without losing the metadata already associated with the medium (i.e. the date, place of publication, and the title of the source if applicable).
- We separate each image from its support (this is called segmentation)
- Ccharacterize each image according to a vector defined by a comparison algorithm
- So that we can compare similar images and group them into clusters
Thanks to a group of 10 students of the "Cours transversal sur le numérique", with the assistance of Dr. Anna Scius-Bertrand, first tests were carried out during Spring Semester 2021, on a limited corpus of images already available in iiif, published between 1920 and 1939. See the CTN2's report. The results of the work of CTN2 were presented on May 27 and are available for replay.
In parallel, the team deployed several jupyter notebooks allowing the analysis of the recovered clusters: classification of the clusters according to their number of images, the number of unique cities represented, according to the date of appearance of the oldest image (developments carried out by Cédric Viaccoz, Anna Scius-Bertrand, and Béatrice Joyeux-Prunel)
July 2021 : The team extended the methodology to the whole corpus. This implied :
- converting a large part of our corpus to iiif format
- passing it into the Explore platform
- linking similar images retrieved from Explore, according to the RDF model of the project (CIDOC-CRM and CIDOC/VIR ontology).
Winter 2021 and Spring 2022 were dedicated to the treatment and analysis of a bigger corpus of 120.000 items (3 million images extracted).
Exposing Results and Questions