
Hervé Le Borgne
Researcher at CEA-List, Univ. Paris Saclay
- Saclay, France
- Github
- Google Scholar
- ORCID
Publications
A quite comple list can be found on my Google Scholar profile.
Here is a selected list (in construction):

Excellent! Next you can
create a new website with this list, or
embed it in an existing web page by copying & pasting
any of the following snippets.
JavaScript
(easiest)
PHP
iFrame
(not recommended)
<script src="https://bibbase.org/show?bib=https://hleborgne.github.io/files/hleborgne-publications.bib&jsonp=1&nocache=1&theme=default&authorFirst=1&titleLinks=false&jsonp=1"></script>
<?php
$contents = file_get_contents("https://bibbase.org/show?bib=https://hleborgne.github.io/files/hleborgne-publications.bib&jsonp=1&nocache=1&theme=default&authorFirst=1&titleLinks=false");
print_r($contents);
?>
<iframe src="https://bibbase.org/show?bib=https://hleborgne.github.io/files/hleborgne-publications.bib&jsonp=1&nocache=1&theme=default&authorFirst=1&titleLinks=false"></iframe>
For more details see the documention.
This is a preview! To use this list on your own web site
or create a new web site from it,
create a free account. The file will be added
and you will be able to edit it in the File Manager.
We will show you instructions once you've created your account.
To the site owner:
Action required! Mendeley is changing its API. In order to keep using Mendeley with BibBase past April 14th, you need to:
- renew the authorization for BibBase on Mendeley, and
- update the BibBase URL in your page the same way you did when you initially set up this page.
2025
(4)
Lahlali, S.; Kara, S.; Ammar, H.; Chabot, F.; Granger, N.; Le Borgne, H.; and Pham, Q.
Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion.
In Computer Vision and Pattern Recognition (CVPR), 2025.
pdf
code
bibtex
abstract
@inproceedings{lahlali2025xmod, author = {Lahlali, Saad and Kara, Sandra and Ammar, Hejer and Chabot, Florian and Granger, Nicolas and Le Borgne, Herv{\'e} and Pham, Quoc-Cuong}, booktitle = {Computer Vision and Pattern Recognition (CVPR)}, title = {Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion}, year = {2025}, url_PDF = {https://arxiv.org/abs/2503.15022}, url_code = {https://github.com/CEA-LIST/xMOD}, abstract = {Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score.}, keywords = {frugal-learning} }
Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score.
Adjali, O.; Ferret, O.; Ghannay, S.; and Le Borgne, H.
Entity-Aware Cross-Modal Pretraining for Knowledge-based Visual Question Answering.
In European Conference on Information Retrieval (ECIR), 2025.
hal
bibtex
abstract
1 download
@inproceedings{adjali2025entitiy_aware, title = {Entity-Aware Cross-Modal Pretraining for Knowledge-based Visual Question Answering}, author = {Adjali, Omar and Ferret, Olivier and Ghannay, Sahar and Le Borgne, Herv{\'e}}, booktitle = {European Conference on Information Retrieval (ECIR)}, year = {2025}, url_HAL = {https://hal-lara.archives-ouvertes.fr/SHARP/cea-04910767}, abstract = {Knowledge-Aware Visual Question Answering about Entities (KVQAE) is a recent multimodal task aiming to answer visual questions about named entities from a multimodal knowledge base. In this context, we focus more particularly on cross-modal retrieval and propose to inject information about entities in the representations of both texts and images during their building through two pretraining auxiliary tasks, namely entity-level masked language modeling and entity type prediction. We show competitive results over existing approaches on 3 KVQAE standard benchmarks, revealing the benefit of raising entity awareness during cross-modal pretraining, specifically for the KVQAE task.}, keywords = {kvqae} }
Knowledge-Aware Visual Question Answering about Entities (KVQAE) is a recent multimodal task aiming to answer visual questions about named entities from a multimodal knowledge base. In this context, we focus more particularly on cross-modal retrieval and propose to inject information about entities in the representations of both texts and images during their building through two pretraining auxiliary tasks, namely entity-level masked language modeling and entity type prediction. We show competitive results over existing approaches on 3 KVQAE standard benchmarks, revealing the benefit of raising entity awareness during cross-modal pretraining, specifically for the KVQAE task.
Fournier-Montgieux, A.; Soumm, M.; Popescu, A.; Luvison, B.; and Le Borgne, H.
Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification.
In Winter Conference on Applications of Computer Vision (WACV), pages 2788-2798, Tucson, Arizona, USA, February 2025.
pdf
supp
code1
code2
bibtex
abstract
3 downloads
@inproceedings{afm2025fairer_analysis, author = {Fournier-Montgieux, Alexandre and Soumm, Michael and Popescu, Adrian and Luvison, Bertrand and Le Borgne, Herv{\'e}}, title = {Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, address = {Tucson, Arizona, USA}, month = {February}, year = {2025}, pages = {2788-2798}, url_PDF = {https://openaccess.thecvf.com/content/WACV2025/papers/Fournier-Montgieux_Fairer_Analysis_and_Demographically_Balanced_Face_Generation_for_Fairer_Face_WACV_2025_paper.pdf}, url_supp = {https://openaccess.thecvf.com/content/WACV2025/supplemental/Fournier-Montgieux_Fairer_Analysis_and_WACV_2025_supplemental.pdf}, url_code1 = {https://github.com/afm215/FaVGen}, url_code2 = {https://github.com/MSoumm/FaVFA}, abstract = {Face recognition and verification are two computer vision tasks whose performances have advanced with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive nature of face data and biases in real-world training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems remain. Using the existing DCFace SOTA framework, we introduce a new controlled generation pipeline that improves fairness. Through classical fairness metrics and a proposed in-depth statistical analysis based on logit models and ANOVA, we show that our generation pipeline improves fairness more than other bias mitigation approaches while slightly improving raw performance.}, keywords = {generative-models,trustworthy-AI} }
Face recognition and verification are two computer vision tasks whose performances have advanced with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive nature of face data and biases in real-world training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems remain. Using the existing DCFace SOTA framework, we introduce a new controlled generation pipeline that improves fairness. Through classical fairness metrics and a proposed in-depth statistical analysis based on logit models and ANOVA, we show that our generation pipeline improves fairness more than other bias mitigation approaches while slightly improving raw performance.
Lahlali, S.; Granger, N.; Le Borgne, H.; and Pham, Q.
ALPI: Auto-Labeller with Proxy Injection for 3D Object Detection using 2D Labels Only.
In Winter Conference on Applications of Computer Vision (WACV), pages 2185-2194, Tucson, Arizona, USA, February 2025.
pdf
supp
code
bibtex
abstract
3 downloads
@inproceedings{lahlali2025alpi, author = {Lahlali, Saad and Granger, Nicolas and Le Borgne, Herv{\'e} and Pham, Quoc-Cuong}, title = {ALPI: Auto-Labeller with Proxy Injection for 3D Object Detection using 2D Labels Only}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, address = {Tucson, Arizona, USA}, month = {February}, year = {2025}, pages = {2185-2194}, url_PDF = {https://openaccess.thecvf.com/content/WACV2025/papers/Lahlali_ALPI_Auto-Labeller_with_Proxy_Injection_for_3D_Object_Detection_using_WACV_2025_paper.pdf}, url_supp = {https://openaccess.thecvf.com/content/WACV2025/supplemental/Lahlali_ALPI_Auto-Labeller_with_WACV_2025_supplemental.pdf}, url_code = {https://github.com/CEA-LIST/ALPI}, abstract = {3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality. However, training 3D detectors requires a costly precise annotation, which is a hindrance to scaling annotation to large datasets. To address this challenge, we propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors. One major problem is that supervising a 3D detection model using only 2D boxes is not reliable due to ambiguities between different 3D poses and their identical 2D projection. We introduce a simple yet effective and generic solution: we build 3D proxy objects with annotations by construction and add them to the training dataset. Our method requires only size priors to adapt to new classes. To better align 2D supervision with 3D detection, our method ensures depth invariance with a novel expression of the 2D losses. Finally, to detect more challenging instances, our annotator follows an offline pseudo-labelling scheme which gradually improves its 3D pseudo-labels. Extensive experiments on the KITTI dataset demonstrate that our method not only performs on-par or above previous works on the 'Car' category, but also achieves performance close to fully supervised methods on more challenging classes. We further demonstrate the effectiveness and robustness of our method by being the first to experiment on the more challenging nuScenes dataset. We additionally propose a setting where weak labels are obtained from a 2D detector pre-trained on MS-COCO instead of human annotations.}, keywords = {frugal-learning} }
3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality. However, training 3D detectors requires a costly precise annotation, which is a hindrance to scaling annotation to large datasets. To address this challenge, we propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors. One major problem is that supervising a 3D detection model using only 2D boxes is not reliable due to ambiguities between different 3D poses and their identical 2D projection. We introduce a simple yet effective and generic solution: we build 3D proxy objects with annotations by construction and add them to the training dataset. Our method requires only size priors to adapt to new classes. To better align 2D supervision with 3D detection, our method ensures depth invariance with a novel expression of the 2D losses. Finally, to detect more challenging instances, our annotator follows an offline pseudo-labelling scheme which gradually improves its 3D pseudo-labels. Extensive experiments on the KITTI dataset demonstrate that our method not only performs on-par or above previous works on the 'Car' category, but also achieves performance close to fully supervised methods on more challenging classes. We further demonstrate the effectiveness and robustness of our method by being the first to experiment on the more challenging nuScenes dataset. We additionally propose a setting where weak labels are obtained from a 2D detector pre-trained on MS-COCO instead of human annotations.
2024
(5)
Adjali, O.; Ferret, O.; Ghannay, S.; and Le Borgne, H.
Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering.
In Empirical Methods in Natural Language Processing (EMNLP), 2024.
pdf
code
bibtex
abstract
4 downloads
@inproceedings{adjali2024emnlp, title = {Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering}, author = {Adjali, Omar and Ferret, Olivier and Ghannay, Sahar and {Le Borgne}, Herv{\'e}}, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, location = {Miami, Florida, USA}, year = {2024}, url_PDF = {https://aclanthology.org/2024.emnlp-main.922.pdf}, url_code = {https://github.com/OA256864/MiRAG}, abstract = {The Knowledge-Aware Visual Question Answering about Entity task aims to disambiguate entities using textual and visual information, as well as knowledge. It usually relies on two independent steps, information retrieval then reading comprehension, that do not benefit each other. Retrieval Augmented Generation (RAG) offers a solution by using generated answers as feedback for retrieval training. RAG usually relies solely on pseudo-relevant passages retrieved from external knowledge bases which can lead to ineffective answer generation. In this work, we propose a multi-level information RAG approach that enhances answer generation through entity retrieval and query expansion. We formulate a joint-training RAG loss such that answer generation is conditioned on both entity and passage retrievals. We show through experiments new state-of-the-art performance on the VIQuAE KB-VQA benchmark and demonstrate that our approach can help retrieve more actual relevant knowledge to generate accurate answers.}, keywords = {kvqae,vision-language} }
The Knowledge-Aware Visual Question Answering about Entity task aims to disambiguate entities using textual and visual information, as well as knowledge. It usually relies on two independent steps, information retrieval then reading comprehension, that do not benefit each other. Retrieval Augmented Generation (RAG) offers a solution by using generated answers as feedback for retrieval training. RAG usually relies solely on pseudo-relevant passages retrieved from external knowledge bases which can lead to ineffective answer generation. In this work, we propose a multi-level information RAG approach that enhances answer generation through entity retrieval and query expansion. We formulate a joint-training RAG loss such that answer generation is conditioned on both entity and passage retrievals. We show through experiments new state-of-the-art performance on the VIQuAE KB-VQA benchmark and demonstrate that our approach can help retrieve more actual relevant knowledge to generate accurate answers.
Cornet, C.; Aumaître, H.; Besançon, R.; Olivier, J.; Faucher, T.; and Borgne, H. L.
Automatic Die Studies for Ancient Numismatics.
In 3rd International Workshop on Artificial Intelligence for Digital Humanities (AI4DH@ECCV), 2024.
code
pdf
bibtex
abstract
2 downloads
@inproceedings{cornet2024automaticdiestudiesancient, title = {Automatic Die Studies for Ancient Numismatics}, author = {Clément Cornet and Héloïse Aumaître and Romaric Besançon and Julien Olivier and Thomas Faucher and Hervé Le Borgne}, year = {2024}, booktitle = {3rd International Workshop on Artificial Intelligence for Digital Humanities (AI4DH@ECCV)}, url_code = {https://github.com/ClementCornet/Auto-Die-Studies}, url_PDF = {https://arxiv.org/abs/2407.20876}, abstract = {Die studies are fundamental to quantifying ancient monetary production, providing insights into the relationship between coinage, politics, and history. The process requires tedious manual work, which limits the size of the corpora that can be studied. Few works have attempted to automate this task, and none have been properly released and evaluated from a computer vision perspective. We propose a fully automatic approach that introduces several innovations compared to previous methods. We rely on fast and robust local descriptors matching that is set automatically. Second, the core of our proposal is a clustering-based approach that uses an intrinsic metric (that does not need the ground truth labels) to determine its critical hyper-parameters. We validate the approach on two corpora of Greek coins, propose an automatic implementation and evaluation of previous baselines, and show that our approach significantly outperforms them.}, keywords = {ai4humanities} }
Die studies are fundamental to quantifying ancient monetary production, providing insights into the relationship between coinage, politics, and history. The process requires tedious manual work, which limits the size of the corpora that can be studied. Few works have attempted to automate this task, and none have been properly released and evaluated from a computer vision perspective. We propose a fully automatic approach that introduces several innovations compared to previous methods. We rely on fast and robust local descriptors matching that is set automatically. Second, the core of our proposal is a clustering-based approach that uses an intrinsic metric (that does not need the ground truth labels) to determine its critical hyper-parameters. We validate the approach on two corpora of Greek coins, propose an automatic implementation and evaluation of previous baselines, and show that our approach significantly outperforms them.
Grimal, P.; Le Borgne, H.; Ferret, O.; and Tourille, J.
TIAM – A Metric for Evaluating Alignment in Text-to-Image Generation.
In Winter Conference on Applications of Computer Vision (WACV), 2024.
Paper
pdf
code
video
bibtex
abstract
8 downloads
@inproceedings{grimal2024tiam, title = {TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation}, author = {Paul Grimal and Hervé {Le Borgne} and Olivier Ferret and Julien Tourille}, year = {2024}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, url = {https://arxiv.org/abs/2307.05134}, url_PDF = {https://arxiv.org/pdf/2307.05134.pdf}, url_Code = {https://github.com/CEA-LIST/TIAMv2}, url_video = {https://www.youtube.com/watch?v=668sw3hlYDo}, abstract = {The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the latent noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some latent seeds that produce better images than others, opening novel directions of research on this understudied topic.}, keywords = {generative-models} }
The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the latent noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some latent seeds that produce better images than others, opening novel directions of research on this understudied topic.
Doubinsky, P.; Audebert, N.; Crucianu, M.; and Le Borgne, H.
Semantic Generative Augmentations for Few-Shot Counting.
In Winter Conference on Applications of Computer Vision (WACV), 2024.
pdf
supp
hal
code
video
bibtex
abstract
4 downloads
@inproceedings{doubinsky2024semantic_augmentation_fsc, title = {Semantic Generative Augmentations for Few-Shot Counting}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2024}, author = {Perla Doubinsky and Nicolas Audebert and Michel Crucianu and Herv{\'{e}} {Le Borgne}}, url_PDF = {https://openaccess.thecvf.com/content/WACV2024/papers/Doubinsky_Semantic_Generative_Augmentations_for_Few-Shot_Counting_WACV_2024_paper.pdf}, url_supp = {https://openaccess.thecvf.com/content/WACV2024/supplemental/Doubinsky_Semantic_Generative_Augmentations_WACV_2024_supplemental.pdf}, url_HAL = {https://hal.science/hal-04259058}, url_Code = {https://github.com/perladoubinsky/SemAug}, url_video = {https://www.youtube.com/watch?v=6coxHpfxxDA}, abstract = {With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.}, keywords = {generative-models} }
With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.
Staron, C.; Le Borgne, H.; Mitteau, R.; Allezard, N.; and Grelier, E.
Detection of Thermal Events by Semi-Supervised Learning for Tokamak First Wall Safety.
IEEE Transactions on Instrumentation and Measurement, 73: 1-9. 2024.
Paper
hal
pdf
bibtex
abstract
1 download
@article{staron2024ssl, title = {{Detection of Thermal Events by Semi-Supervised Learning for Tokamak First Wall Safety}}, author = {Staron, Christian and Le Borgne, Herv{\'e} and Mitteau, Rapha{\"e}l and Allezard, Nicolas and Grelier, Erwan}, journal = {IEEE Transactions on Instrumentation and Measurement}, year = {2024}, volume = {73}, pages = {1-9}, url = {https://doi.org/10.1109/TIM.2024.3368486}, url_HAL = {https://cea.hal.science/cea-04099065/file/HoDetec_article.pdf}, url_PDF = {https://arxiv.org/pdf/2401.10958}, abstract = {This paper explores a semi-supervised object detection approach to detect thermal events on the internal wall of tokamaks. A huge amount of data is produced during an experimental campaign by the infrared (IR) viewing systems used to monitor the inner thermal shields during machine operation. The amount of data to be processed and analyzed is such that protecting the first wall is an overwhelming job. Automatizing this job with artificial intelligence (AI) is an attractive solution, but AI requires large labelled datasets that are not readily available for tokamak walls. Semi-supervised learning (SSL) is a possible solution to being able to train deep learning models with a small amount of labelled data and a large amount of unlabelled data. SSL is explored as a possible tool to rapidly adapt a model trained on an experimental campaign A of tokamak WEST to a new experimental campaign B by using labelled data from campaign A , a little labelled data from campaign B and a lot of unlabelled data from campaign B . Model performance is evaluated on two labelled datasets and two methods including semi-supervised learning. Semi-supervised learning increased the mAP metric by over six percentage points on the first smaller-scale database and over four percentage points on the second larger-scale dataset depending on the method employed.}, keywords = {frugal-learning} }
This paper explores a semi-supervised object detection approach to detect thermal events on the internal wall of tokamaks. A huge amount of data is produced during an experimental campaign by the infrared (IR) viewing systems used to monitor the inner thermal shields during machine operation. The amount of data to be processed and analyzed is such that protecting the first wall is an overwhelming job. Automatizing this job with artificial intelligence (AI) is an attractive solution, but AI requires large labelled datasets that are not readily available for tokamak walls. Semi-supervised learning (SSL) is a possible solution to being able to train deep learning models with a small amount of labelled data and a large amount of unlabelled data. SSL is explored as a possible tool to rapidly adapt a model trained on an experimental campaign A of tokamak WEST to a new experimental campaign B by using labelled data from campaign A , a little labelled data from campaign B and a lot of unlabelled data from campaign B . Model performance is evaluated on two labelled datasets and two methods including semi-supervised learning. Semi-supervised learning increased the mAP metric by over six percentage points on the first smaller-scale database and over four percentage points on the second larger-scale dataset depending on the method employed.
2023
(4)
Karaliolios, N.; Chabot, F.; Dupont, C.; Le Borgne, H.; Audigier, R.; and Pham, Q.
Generalized pseudo-labeling in consistency regularization for semi-supervised learning.
In International Conference on Image Processing (ICIP), 2023.
pdf
doi
bibtex
@inproceedings{karaliolios2023gpl, author = {Nikolaos Karaliolios and Florian Chabot and Camille Dupont and Herv{\'{e}} {Le Borgne} and Romaric Audigier and Quoc-Cuong Pham}, booktitle = {International Conference on Image Processing (ICIP)}, title = {Generalized pseudo-labeling in consistency regularization for semi-supervised learning}, year = {2023}, doi = {10.1109/ICIP49359.2023.10221965}, url_PDF = {https://cea.hal.science/cea-04503203/file/GPL_ICIP_NoteIEEE.pdf}, keywords = {frugal-learning} }
Doubinsky, P.; Audebert, N.; Crucianu, M.; and Le Borgne, H.
Wasserstein loss for Semantic Editing in the Latent Space of GANs.
In International Conference on Content-Based Multimedia Indexing, 2023.
pdf
code
bibtex
abstract
@inproceedings{doubinsky2023wasserstein, title = {Wasserstein loss for Semantic Editing in the Latent Space of GANs}, booktitle = {International Conference on Content-Based Multimedia Indexing}, year = {2023}, author = {Perla Doubinsky and Nicolas Audebert and Michel Crucianu and Herv{\'{e}} {Le Borgne}}, url_PDF = {https://arxiv.org/pdf/2304.10508.pdf}, url_Code = {https://github.com/perladoubinsky/latent-wasserstein}, abstract = {The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2.}, keywords = {generative-models} }
The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2.
Hanouti, C.; and Le Borgne, H.
Learning Semantic Ambiguities for Zero-Shot Learning.
Multimedia Tools and Applications, 82: 40745–40759. 2023.
pdf
Paper
doi
bibtex
abstract
1 download
@article{hanouti2023lsa_zsl, author = {Hanouti, Celina and {Le Borgne}, Herv{\'{e}} }, title = {Learning Semantic Ambiguities for Zero-Shot Learning}, journal = {Multimedia Tools and Applications}, volume = {82}, pages = {40745–40759}, year = {2023}, doi = {10.1007/s11042-023-14877-1}, url_PDF = {https://arxiv.org/pdf/2201.01823.pdf}, url = {https://link.springer.com/article/10.1007/s11042-023-14877-1}, abstract = {Zero-shot learning (ZSL) aims at recognizing classes for which no visual sample is available at training time. To address this issue, one can rely on a semantic description of each class. A typical ZSL model learns a mapping between the visual samples of seen classes and the corresponding semantic descriptions, in order to do the same on unseen classes at test time. State of the art approaches rely on generative models that synthesize visual features from the prototype of a class, such that a classifier can then be learned in a supervised manner. However, these approaches are usually biased towards seen classes whose visual instances are the only one that can be matched to a given class prototype. We propose a regularization method that can be applied to any conditional generative-based ZSL method, by leveraging only the semantic class prototypes. It learns to synthesize discriminative features for possible semantic description that are not available at training time, that is the unseen ones. The approach is evaluated for ZSL and GZSL on four datasets commonly used in the literature, either in inductive and transductive settings, with results on-par or above state of the art approaches.}, keywords = {zero-shot-learning,generative-models} }
Zero-shot learning (ZSL) aims at recognizing classes for which no visual sample is available at training time. To address this issue, one can rely on a semantic description of each class. A typical ZSL model learns a mapping between the visual samples of seen classes and the corresponding semantic descriptions, in order to do the same on unseen classes at test time. State of the art approaches rely on generative models that synthesize visual features from the prototype of a class, such that a classifier can then be learned in a supervised manner. However, these approaches are usually biased towards seen classes whose visual instances are the only one that can be matched to a given class prototype. We propose a regularization method that can be applied to any conditional generative-based ZSL method, by leveraging only the semantic class prototypes. It learns to synthesize discriminative features for possible semantic description that are not available at training time, that is the unseen ones. The approach is evaluated for ZSL and GZSL on four datasets commonly used in the literature, either in inductive and transductive settings, with results on-par or above state of the art approaches.
Adjali, O.; Grimal, P.; Ferret, O.; Ghannay, S.; and Le Borgne, H.
Explicit Knowledge Integration for Knowledge-Aware Visual Question Answering about Named Entities.
In International Conference on Multimedia Retrieval (ICMR), 2023.
hal
Paper
doi
bibtex
abstract
3 downloads
@inproceedings{adjali2023icmr, title = {Explicit Knowledge Integration for Knowledge-Aware Visual Question Answering about Named Entities}, author = {Adjali, Omar and Grimal, Paul and Ferret, Olivier and Ghannay, Sahar and {Le Borgne}, Herv{\'e}}, booktitle = {International Conference on Multimedia Retrieval (ICMR)}, location = {Thessaloniki, Greece}, year = {2023}, url_HAL = {https://universite-paris-saclay.hal.science/cea-04172061/}, url = {https://dl.acm.org/doi/abs/10.1145/3591106.3592227}, doi = {10.1145/3591106.3592227}, abstract = {Recent years have shown an unprecedented growth of interest in Vision-Language related tasks, with the need to address the inherent challenges of integrating linguistic and visual information to solve real-world applications. Such a typical task is Visual Question Answering (VQA), which aims at answering questions about visual content. The limitations of the VQA task in terms of question redundancy and poor linguistic variability encouraged researchers to propose Knowledge-aware Visual Question Answering tasks as a natural extension of VQA. In this paper, we tackle the KVQAE (Knowledge-based Visual Question Answering about named Entities) task, which proposes to answer questions about named entities defined in a knowledge base and grounded in a visual content. In particular, beside the textual and visual information, we propose to leverage the structural information extracted from syntactic dependency trees and external knowledge graphs to help answer questions about a large spectrum of entities of various types. Thus, by combining contextual and graph-based representations using Graph Convolutional Networks (GCNs), we are able to learn meaningful embeddings for information retrieval tasks. Experiments on the KVQAE public dataset show how our approach improves the state-of-the art baselines while demonstrating the interest of injecting external knowledge to enhance multimodal information retrieval.}, keywords = {kvqae} }
Recent years have shown an unprecedented growth of interest in Vision-Language related tasks, with the need to address the inherent challenges of integrating linguistic and visual information to solve real-world applications. Such a typical task is Visual Question Answering (VQA), which aims at answering questions about visual content. The limitations of the VQA task in terms of question redundancy and poor linguistic variability encouraged researchers to propose Knowledge-aware Visual Question Answering tasks as a natural extension of VQA. In this paper, we tackle the KVQAE (Knowledge-based Visual Question Answering about named Entities) task, which proposes to answer questions about named entities defined in a knowledge base and grounded in a visual content. In particular, beside the textual and visual information, we propose to leverage the structural information extracted from syntactic dependency trees and external knowledge graphs to help answer questions about a large spectrum of entities of various types. Thus, by combining contextual and graph-based representations using Graph Convolutional Networks (GCNs), we are able to learn meaningful embeddings for information retrieval tasks. Experiments on the KVQAE public dataset show how our approach improves the state-of-the art baselines while demonstrating the interest of injecting external knowledge to enhance multimodal information retrieval.
2022
(3)
Bojko, A.; Dupont, R.; Tamaazousti, M.; and Le Borgne, H.
Self-Improving SLAM in Dynamic Environments: Learning When to Mask.
In British Machine Vision Conference (BMVC), 2022.
pdf
poster
bibtex
@inproceedings{bojko22bmvc, title = {Self-Improving SLAM in Dynamic Environments: Learning When to Mask}, author = {Adrian Bojko and Romain Dupont and Mohamed Tamaazousti and Herv{\'e} {Le Borgne}}, booktitle = "British Machine Vision Conference (BMVC)", year = {2022}, url_PDF = {https://bmvc2022.mpi-inf.mpg.de/0654.pdf}, url_Poster= {https://bmvc2022.mpi-inf.mpg.de/0654_poster.pdf}, keywords = {slam} }
Lerner, P.; Ferret, O.; Guinaudeau, C.; Le Borgne, H.; Besançon, R.; Moreno, J. G.; and Lovón Melgarejo, J.
ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities.
In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, of SIGIR '22, pages 3108–3120, New York, NY, USA, 2022. Association for Computing Machinery
Paper
hal
code
doi
bibtex
abstract
3 downloads
@inproceedings{lerner2022viquae, author = {Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille and {Le Borgne}, Herv\'{e} and Besan\c{c}on, Romaric and Moreno, Jose G. and Lov\'{o}n Melgarejo, Jes\'{u}s}, title = {ViQuAE, a Dataset for Knowledge-Based Visual Question Answering about Named Entities}, year = {2022}, isbn = {9781450387323}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531753}, doi = {10.1145/3477495.3531753}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {3108–3120}, numpages = {13}, location = {Madrid, Spain}, series = {SIGIR '22}, url_HAL = {https://universite-paris-saclay.hal.science/hal-03650618/document}, url_Code = {https://github.com/PaulLerner/ViQuAE}, abstract = {Whether to retrieve, answer, translate, or reason, multimodality opens up new challenges and perspectives. In this context, we are interested in answering questions about named entities grounded in a visual context using a Knowledge Base (KB). To benchmark this task, called KVQAE (Knowledge-based Visual Question Answering about named Entities), we provide ViQuAE, a dataset of 3.7K questions paired with images. This is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). The dataset is annotated using a semi-automatic method. We also propose a KB composed of 1.5M Wikipedia articles paired with images. To set a baseline on the benchmark, we address KVQAE as a two-stage problem: Information Retrieval and Reading Comprehension, with both zero- and few-shot learning methods. The experiments empirically demonstrate the difficulty of the task, especially when questions are not about persons. This work paves the way for better multimodal entity representations and question answering. The dataset, KB, code, and semi-automatic annotation pipeline are freely available at https://github.com/PaulLerner/ViQuAE.}, keywords = {kvqae} }
Whether to retrieve, answer, translate, or reason, multimodality opens up new challenges and perspectives. In this context, we are interested in answering questions about named entities grounded in a visual context using a Knowledge Base (KB). To benchmark this task, called KVQAE (Knowledge-based Visual Question Answering about named Entities), we provide ViQuAE, a dataset of 3.7K questions paired with images. This is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). The dataset is annotated using a semi-automatic method. We also propose a KB composed of 1.5M Wikipedia articles paired with images. To set a baseline on the benchmark, we address KVQAE as a two-stage problem: Information Retrieval and Reading Comprehension, with both zero- and few-shot learning methods. The experiments empirically demonstrate the difficulty of the task, especially when questions are not about persons. This work paves the way for better multimodal entity representations and question answering. The dataset, KB, code, and semi-automatic annotation pipeline are freely available at https://github.com/PaulLerner/ViQuAE.
Doubinsky, P.; Audebert, N.; Crucianu, M.; and Le Borgne, H.
Multi-Attribute Balanced Sampling for Disentangled GAN Controls.
Pattern Recognition Letters, 162: 56-62. 2022.
Paper
arxiv
pdf
code
doi
bibtex
abstract
4 downloads
@article{doubinsky2022prl, author = {Perla Doubinsky and Nicolas Audebert and Michel Crucianu and Herv{\'{e}} {Le Borgne}}, title = {Multi-Attribute Balanced Sampling for Disentangled {GAN} Controls}, journal = {Pattern Recognition Letters}, volume = {162}, pages = {56-62}, year = {2022}, doi = {https://doi.org/10.1016/j.patrec.2022.08.012}, url = "https://www.sciencedirect.com/science/article/abs/pii/S0167865522002501", url_arXiv = "https://arxiv.org/abs/2111.00909", url_PDF = "https://arxiv.org/pdf/2111.00909.pdf", url_Code = "https://github.com/perladoubinsky/balanced_sampling_gan_controls", abstract = "Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of latent codes, then train classifiers in the latent space to identify the controls. Since the data generated by GANs reflects the biases of the original dataset, so do the resulting semantic controls. We propose to address disentanglement by subsampling the generated data to remove over-represented co-occuring attributes thus balancing the semantics of the dataset before training the classifiers. We demonstrate the effectiveness of this approach by extracting disentangled linear directions for face manipulation on two popular GAN architectures, PGGAN and StyleGAN, and two datasets, CelebAHQ and FFHQ. We show that this approach outperforms state-of-the-art classifier-based methods while avoiding the need for disentanglement-enforcing post-processing.", keywords = {generative-models} }
Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of latent codes, then train classifiers in the latent space to identify the controls. Since the data generated by GANs reflects the biases of the original dataset, so do the resulting semantic controls. We propose to address disentanglement by subsampling the generated data to remove over-represented co-occuring attributes thus balancing the semantics of the dataset before training the classifiers. We demonstrate the effectiveness of this approach by extracting disentangled linear directions for face manipulation on two popular GAN architectures, PGGAN and StyleGAN, and two datasets, CelebAHQ and FFHQ. We show that this approach outperforms state-of-the-art classifier-based methods while avoiding the need for disentanglement-enforcing post-processing.
2021
(3)
Le Cacheux, Y.; Le Borgne, H.; and Crucianu, M.
Zero-shot Learning with Deep Neural Networks for Object Recognition.
In Pineau, J. B.; and Zemmari, A., editor(s), Multi-faceted Deep Learning, 6, pages 273–288. Springer, 2021.
pdf
doi
bibtex
abstract
@incollection{lecacheux2021gzsl, title = {Zero-shot Learning with Deep Neural Networks for Object Recognition}, chapter = {6}, author = {Le Cacheux, Yannick and {Le Borgne}, Herv{\'e} and Crucianu, Michel}, booktitle = {Multi-faceted Deep Learning}, pages = {273--288}, year = {2021}, editor = {J. Benois Pineau and A. Zemmari}, publisher = {Springer}, doi = {10.1007/978-3-030-74478-6_6}, url_PDF = {https://arxiv.org/pdf/2102.03137.pdf}, abstract = {Zero-shot learning deals with the ability to recognize objects without any visual training sample. To counterbalance this lack of visual data, each class to recognize is associated with a semantic prototype that reflects the essential features of the object. The general approach is to learn a mapping from visual data to semantic prototypes, then use it at inference to classify visual samples from the class prototypes only. Different settings of this general configuration can be considered depending on the use case of interest, in particular whether one only wants to classify objects that have not been employed to learn the mapping or whether one can use unlabelled visual examples to learn the mapping. This chapter presents a review of the approaches based on deep neural networks to tackle the ZSL problem. We highlight findings that had a large impact on the evolution of this domain and list its current challenges.}, keywords = {zero-shot-learning} }
Zero-shot learning deals with the ability to recognize objects without any visual training sample. To counterbalance this lack of visual data, each class to recognize is associated with a semantic prototype that reflects the essential features of the object. The general approach is to learn a mapping from visual data to semantic prototypes, then use it at inference to classify visual samples from the class prototypes only. Different settings of this general configuration can be considered depending on the use case of interest, in particular whether one only wants to classify objects that have not been employed to learn the mapping or whether one can use unlabelled visual examples to learn the mapping. This chapter presents a review of the approaches based on deep neural networks to tackle the ZSL problem. We highlight findings that had a large impact on the evolution of this domain and list its current challenges.
Plumerault, A.; Le Borgne, H.; and Hudelot, C.
AVAE: Adversarial Variational Auto Encoder.
In International Conference on Pattern Recognition (ICPR), 2021.
pdf
bibtex
abstract
@inproceedings{plumerault2020icpr, title = {AVAE: Adversarial Variational Auto Encoder}, author = {Antoine Plumerault and Herv{\'e} {Le Borgne} and C{\'e}line Hudelot}, booktitle = {International Conference on Pattern Recognition (ICPR)}, year = {2021}, url_PDF = {https://arxiv.org/pdf/2012.11551.pdf}, abstract = {Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.}, keywords = {generative-models} }
Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.
Bojko, A.; Dupont, R.; Tamaazousti, M.; and Le Borgne, H.
Learning to Segment Dynamic Objects Using SLAM Outliers.
In International Conference on Pattern Recognition (ICPR), pages 9780–9787, 2021.
pdf
doi
bibtex
abstract
1 download
@inproceedings{bojko2020icpr, title = {Learning to Segment Dynamic Objects Using SLAM Outliers}, author = {Adrian Bojko and Romain Dupont and Mohamed Tamaazousti and Herv{\'e} {Le Borgne}}, booktitle = "International Conference on Pattern Recognition (ICPR)", pages = {9780--9787}, doi = {10.1109/ICPR48806.2021.9412341}, url_PDF = {https://arxiv.org/pdf/2011.06259}, year = {2021}, abstract = {We present a method to automatically learn to segment dynamic objects using SLAM outliers. It requires only one monocular sequence per dynamic object for training and consists in localizing dynamic objects using SLAM outliers, creating their masks, and using these masks to train a semantic segmentation network. We integrate the trained network in ORB-SLAM 2 and LDSO. At runtime we remove features on dynamic objects, making the SLAM unaffected by them. We also propose a new stereo dataset and new metrics to evaluate SLAM robustness. Our dataset includes consensus inversions, i.e., situations where the SLAM uses more features on dynamic objects that on the static background. Consensus inversions are challenging for SLAM as they may cause major SLAM failures. Our approach performs better than the State-of-the-Art on the TUM RGB-D dataset in monocular mode and on our dataset in both monocular and stereo modes. }, keywords = {slam} }
We present a method to automatically learn to segment dynamic objects using SLAM outliers. It requires only one monocular sequence per dynamic object for training and consists in localizing dynamic objects using SLAM outliers, creating their masks, and using these masks to train a semantic segmentation network. We integrate the trained network in ORB-SLAM 2 and LDSO. At runtime we remove features on dynamic objects, making the SLAM unaffected by them. We also propose a new stereo dataset and new metrics to evaluate SLAM robustness. Our dataset includes consensus inversions, i.e., situations where the SLAM uses more features on dynamic objects that on the static background. Consensus inversions are challenging for SLAM as they may cause major SLAM failures. Our approach performs better than the State-of-the-Art on the TUM RGB-D dataset in monocular mode and on our dataset in both monocular and stereo modes.
2020
(6)
Le Cacheux, Y.; Le Borgne, H.; and Crucianu, M.
Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning.
In TASK-CV Workshop - ECCV 2020, Online, 2020.
pdf
doi
bibtex
abstract
@inproceedings{lecacheux2020taskcv, author = {Le Cacheux, Yannick and {Le Borgne}, Herv{\'e} and Crucianu, Michel}, title = {Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning}, booktitle = {TASK-CV Workshop - ECCV 2020}, year = {2020}, address = {Online}, doi = {10.1007/978-3-030-66415-2_42}, url_PDF = {https://arxiv.org/pdf/2010.02959.pdf}, abstract = {Zero-shot learning aims to recognize instances of unseen classes, for which no visual instance is available during training, by learning multimodal relations between samples from seen classes and corresponding class semantic representations. These class representations usually consist of either attributes, which do not scale well to large datasets, or word embeddings, which lead to poorer performance. A good trade-off could be to employ short sentences in natural language as class descriptions. We explore different solutions to use such short descriptions in a ZSL setting and show that while simple methods cannot achieve very good results with sentences alone, a combination of usual word embeddings and sentences can significantly outperform current state-of-the-art. }, keywords = {zero-shot-learning} }
Zero-shot learning aims to recognize instances of unseen classes, for which no visual instance is available during training, by learning multimodal relations between samples from seen classes and corresponding class semantic representations. These class representations usually consist of either attributes, which do not scale well to large datasets, or word embeddings, which lead to poorer performance. A good trade-off could be to employ short sentences in natural language as class descriptions. We explore different solutions to use such short descriptions in a ZSL setting and show that while simple methods cannot achieve very good results with sentences alone, a combination of usual word embeddings and sentences can significantly outperform current state-of-the-art.
Le Cacheux, Y.; Popescu, A.; and Le Borgne, H.
Webly Supervised Semantic Embeddings for Large Scale Zero-Shot Learning.
In Asian Conference on Computer Vision (ACCV), 2020.
Paper
bibtex
@inproceedings{lecacheux2020accv, author = {Le Cacheux, Yannick and Popescu, Adrian and {Le Borgne}, Herv{\'e}}, title = {Webly Supervised Semantic Embeddings for Large Scale Zero-Shot Learning}, booktitle = "Asian Conference on Computer Vision (ACCV)", year = {2020}, url = {https://openaccess.thecvf.com/content/ACCV2020/papers/Le_Cacheux_Webly_Supervised_Semantic_Embeddings_for_Large_Scale_Zero-Shot_Learning_ACCV_2020_paper.pdf}, keywords = {zero-shot-learning} }
Plumerault, A.; Le Borgne, H.; and Hudelot, C.
Controlling generative models with continuous factors of variations.
In International Conference on Learning Representations (ICLR), 2020.
Paper
pdf
code
bibtex
abstract
2 downloads
@inproceedings{plumerault2020iclr, title = {Controlling generative models with continuous factors of variations}, author = {Antoine Plumerault and Herv{\'e} {Le Borgne} and C{\'e}line Hudelot}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020}, url = {https://openreview.net/forum?id=H1laeJrKDB}, url_PDF = {https://arxiv.org/pdf/2001.10238.pdf}, url_Code = {https://github.com/AntoinePlumerault/Controlling-generative-models-with-continuous-factors-of-variations}, abstract = {Recent deep generative models are able to provide photo-realistic images as well as visual or textual content embeddings useful to address various tasks of computer vision and natural language processing. Their usefulness is nevertheless often limited by the lack of control over the generative process or the poor understanding of the learned representation. To overcome these major issues, very recent work has shown the interest of studying the semantics of the latent space of generative models. In this paper, we propose to advance on the interpretability of the latent space of generative models by introducing a new method to find meaningful directions in the latent space of any generative model along which we can move to control precisely specific properties of the generated image like the position or scale of the object in the image. Our method does not require human annotations and is particularly well suited for the search of directions encoding simple transformations of the generated image, such as translation, zoom or color variations. We demonstrate the effectiveness of our method qualitatively and quantitatively, both for GANs and variational auto-encoders.}, keywords = {generative-models} }
Recent deep generative models are able to provide photo-realistic images as well as visual or textual content embeddings useful to address various tasks of computer vision and natural language processing. Their usefulness is nevertheless often limited by the lack of control over the generative process or the poor understanding of the learned representation. To overcome these major issues, very recent work has shown the interest of studying the semantics of the latent space of generative models. In this paper, we propose to advance on the interpretability of the latent space of generative models by introducing a new method to find meaningful directions in the latent space of any generative model along which we can move to control precisely specific properties of the generated image like the position or scale of the object in the image. Our method does not require human annotations and is particularly well suited for the search of directions encoding simple transformations of the generated image, such as translation, zoom or color variations. We demonstrate the effectiveness of our method qualitatively and quantitatively, both for GANs and variational auto-encoders.
Adjali, O.; Besançon, R.; Ferret, O.; Le Borgne, H.; and Grau, B.
Multimodal Entity Linking for Tweets.
In European Conference on Information Retrieval (ECIR), Lisbon, Portugal, 4 2020.
pdf
code
doi
bibtex
abstract
1 download
@inproceedings{adjali2020ecir, title = {Multimodal Entity Linking for Tweets}, author = {Adjali, Omar and Besan\c{c}on, Romaric and Ferret, Olivier and {Le Borgne}, Herv{\'e} and Grau, Brigitte}, booktitle = {European Conference on Information Retrieval (ECIR)}, year = {2020}, month = {4}, day = {14--17}, address = {Lisbon, Portugal}, doi = {10.1007/978-3-030-45439-5_31}, url_PDF = {https://arxiv.org/pdf/2104.03236.pdf}, url_Code = {https://github.com/OA256864/MEL_Tweets}, abstract = {In many information extraction applications, entity linking (EL) has emerged as a crucial task that allows leveraging information about named entities from a knowledge base. In this paper, we address the task of multimodal entity linking (MEL), an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB). First, we propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB. Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts. We demonstrate the effectiveness of the proposed model by evaluating it on the proposed dataset and highlight the importance of leveraging visual information when it is available.}, keywords = {vision-language} }
In many information extraction applications, entity linking (EL) has emerged as a crucial task that allows leveraging information about named entities from a knowledge base. In this paper, we address the task of multimodal entity linking (MEL), an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB). First, we propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB. Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts. We demonstrate the effectiveness of the proposed model by evaluating it on the proposed dataset and highlight the importance of leveraging visual information when it is available.
Adjali, O.; Besançon, R.; Ferret, O.; Le Borgne, H.; and Grau, B.
Building a Multimodal Entity Linking Dataset From Tweets.
In International Conference on Language Resources and Evaluation (LREC), Marseille, France, 5 2020.
Paper
pdf
code
bibtex
abstract
@inproceedings{adjali2020lrec, title = {Building a Multimodal Entity Linking Dataset From Tweets}, author = {Adjali, Omar and Besan\c{c}on, Romaric and Ferret, Olivier and {Le Borgne}, Herv{\'e} and Grau, Brigitte}, booktitle = {International Conference on Language Resources and Evaluation (LREC)}, year = {2020}, month = {5}, day = {11--16}, address = {Marseille, France}, url = {https://aclanthology.org/2020.lrec-1.528/}, url_PDF = {https://aclanthology.org/2020.lrec-1.528.pdf}, url_Code = {https://github.com/OA256864/MEL_Tweets}, abstract = {The task of Entity linking, which aims at associating an entity mention with a unique entity in a knowledge base (KB), is useful for advanced Information Extraction tasks such as relation extraction or event detection. Most of the studies that address this problem rely only on textual documents while an increasing number of sources are multimedia, in particular in the context of social media where messages are often illustrated with images. In this article, we address the Multimodal Entity Linking (MEL) task, and more particularly the problem of its evaluation. To this end, we propose a novel method to quasi-automatically build annotated datasets to evaluate methods on the MEL task. The method collects text and images to jointly build a corpus of tweets with ambiguous mentions along with a Twitter KB defining the entities. We release a new annotated dataset of Twitter posts associated with images. We study the key characteristics of the proposed dataset and evaluate the performance of several MEL approaches on it.}, keywords = {vision-language} }
The task of Entity linking, which aims at associating an entity mention with a unique entity in a knowledge base (KB), is useful for advanced Information Extraction tasks such as relation extraction or event detection. Most of the studies that address this problem rely only on textual documents while an increasing number of sources are multimedia, in particular in the context of social media where messages are often illustrated with images. In this article, we address the Multimodal Entity Linking (MEL) task, and more particularly the problem of its evaluation. To this end, we propose a novel method to quasi-automatically build annotated datasets to evaluate methods on the MEL task. The method collects text and images to jointly build a corpus of tweets with ambiguous mentions along with a Twitter KB defining the entities. We release a new annotated dataset of Twitter posts associated with images. We study the key characteristics of the proposed dataset and evaluate the performance of several MEL approaches on it.
Tamaazousti, Y.; Le Borgne, H.; Hudelot, C.; Seddik, M. E.; and Tamaazousti, M.
Learning More Universal Representations for Transfer-Learning.
IEEE T. Pattern Analysis and Machine Intelligence (PAMI), 42(9): 2212-2224. 2020.
(online pre-print, 30 april 2019)
pdf
code
doi
bibtex
abstract
@article{tamaazousti2020pami, author = {Youssef Tamaazousti and Le Borgne, Herv{\'e} and C{\'e}line Hudelot and Mohammed El-Amine Seddik and Mohammed Tamaazousti}, title = {Learning More Universal Representations for Transfer-Learning}, journal = {IEEE T. Pattern Analysis and Machine Intelligence (PAMI)}, year = {2020}, volume = {42}, number = {9}, pages = {2212-2224}, publisher={IEEE}, doi = {10.1109/TPAMI.2019.2913857}, url_PDF = {https://arxiv.org/pdf/1712.09708.pdf}, url_code = {https://github.com/youssefTamaazousti/MuldipNet-tensorflow}, note = {(online pre-print, 30 april 2019)}, abstract = {A representation is supposed universal if it encodes any element of the visual world (e.g., objects, scenes) in any configuration (e.g., scale, context). While not expecting pure universal representations, the goal in the literature is to improve the universality level, starting from a representation with a certain level. To do so, the state-of-the-art consists in learning CNN-based representations on a diversified training problem (e.g., ImageNet modified by adding annotated data). While it effectively increases universality, such approach still requires a large amount of efforts to satisfy the needs in annotated data. In this work, we propose two methods to improve universality, but pay special attention to limit the need of annotated data. We also propose a unified framework of the methods based on the diversifying of the training problem. Finally, to better match Atkinson’s cognitive study about universal human representations, we proposed to rely on the transfer-learning scheme as well as a new metric to evaluate universality. This latter, aims us to demonstrates the interest of our methods on 10 target-problems, relating to the classification task and a variety of visual domains.}, keywords = {frugal-learning} }
A representation is supposed universal if it encodes any element of the visual world (e.g., objects, scenes) in any configuration (e.g., scale, context). While not expecting pure universal representations, the goal in the literature is to improve the universality level, starting from a representation with a certain level. To do so, the state-of-the-art consists in learning CNN-based representations on a diversified training problem (e.g., ImageNet modified by adding annotated data). While it effectively increases universality, such approach still requires a large amount of efforts to satisfy the needs in annotated data. In this work, we propose two methods to improve universality, but pay special attention to limit the need of annotated data. We also propose a unified framework of the methods based on the diversifying of the training problem. Finally, to better match Atkinson’s cognitive study about universal human representations, we proposed to rely on the transfer-learning scheme as well as a new metric to evaluate universality. This latter, aims us to demonstrates the interest of our methods on 10 target-problems, relating to the classification task and a variety of visual domains.
2019
(2)
Le Cacheux, Y.; Le Borgne, H.; and Crucianu, M.
Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning.
In IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 2019.
pdf
supp
code
bibtex
abstract
@inproceedings{lecacheux2019iccv, author = {Le Cacheux, Yannick and Le Borgne, Herv{\'e} and Crucianu, Michel}, title = {Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning}, booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2019}, address = {Seoul, Korea}, url_PDF = {https://openaccess.thecvf.com/content_ICCV_2019/papers/Le_Cacheux_Modeling_Inter_and_Intra-Class_Relations_in_the_Triplet_Loss_for_ICCV_2019_paper.pdf}, url_supp = {https://openaccess.thecvf.com/content_ICCV_2019/supplemental/Le_Cacheux_Modeling_Inter_and_ICCV_2019_supplemental.pdf}, url_Code = {https://github.com/yannick-lc/iccv2019-triplet-loss}, abstract = {Recognizing visual unseen classes, i.e. for which no training data is available, is known as Zero Shot Learning (ZSL). Some of the best performing methods apply the triplet loss to seen classes to learn a mapping between visual representations of images and attribute vectors that constitute class prototypes. They nevertheless make several implicit assumptions that limit their performance on real use cases, particularly with fine-grained datasets comprising a large number of classes. We identify three of these assumptions and put forward corresponding novel contributions to address them. Our approach consists in taking into account both inter-class and intra-class relations, respectively by being more permissive with confusions between similar classes, and by penalizing visual samples which are atypical to their class. The approach is tested on four datasets, including the large-scale ImageNet, and exhibitsperformances significantly above recent methods, even generative methods based on more restrictive hypotheses.}, keywords = {zero-shot-learning} }
Recognizing visual unseen classes, i.e. for which no training data is available, is known as Zero Shot Learning (ZSL). Some of the best performing methods apply the triplet loss to seen classes to learn a mapping between visual representations of images and attribute vectors that constitute class prototypes. They nevertheless make several implicit assumptions that limit their performance on real use cases, particularly with fine-grained datasets comprising a large number of classes. We identify three of these assumptions and put forward corresponding novel contributions to address them. Our approach consists in taking into account both inter-class and intra-class relations, respectively by being more permissive with confusions between similar classes, and by penalizing visual samples which are atypical to their class. The approach is tested on four datasets, including the large-scale ImageNet, and exhibitsperformances significantly above recent methods, even generative methods based on more restrictive hypotheses.
Le Cacheux, Y.; Le Borgne, H.; and Crucianu, M.
From Classical to Generalized Zero-Shot Learning: a Simple Adaptation Process.
In International Conference on MultiMedia Modeling (MMM), Thessaloniki, Grece, 1 2019.
pdf
doi
bibtex
abstract
@inproceedings{lecacheux2019mmm, author = {Le Cacheux, Yannick and Le Borgne, Herv{\'e} and Crucianu, Michel}, title = {From Classical to Generalized Zero-Shot Learning: a Simple Adaptation Process}, booktitle = {International Conference on MultiMedia Modeling (MMM)}, year = {2019}, address = {Thessaloniki, Grece}, month = {1}, url_PDF = {https://arxiv.org/pdf/1809.10120.pdf}, doi = {10.1007/978-3-030-05716-9_38}, abstract = {Zero-shot learning (ZSL) is concerned with the recognition of previously unseen classes. It relies on additional semantic knowledge for which a mapping can be learned with training examples of seen classes. While classical ZSL considers the recognition performance on unseen classes only, generalized zero-shot learning (GZSL) aims at maximizing performance on both seen and unseen classes. In this paper, we propose a new process for training and evaluation in the GZSL setting; this process addresses the gap in performance between samples from unseen and seen classes by penalizing the latter, and enables to select hyper-parameters well-suited to the GZSL task. It can be applied to any existing ZSL approach and leads to a significant performance boost: the experimental evaluation shows that GZSL performance, averaged over eight state-of-the-art methods, is improved from 28.5 to 42.2 on CUB and from 28.2 to 57.1 on AwA2.}, keywords = {zero-shot-learning} }
Zero-shot learning (ZSL) is concerned with the recognition of previously unseen classes. It relies on additional semantic knowledge for which a mapping can be learned with training examples of seen classes. While classical ZSL considers the recognition performance on unseen classes only, generalized zero-shot learning (GZSL) aims at maximizing performance on both seen and unseen classes. In this paper, we propose a new process for training and evaluation in the GZSL setting; this process addresses the gap in performance between samples from unseen and seen classes by penalizing the latter, and enables to select hyper-parameters well-suited to the GZSL task. It can be applied to any existing ZSL approach and leads to a significant performance boost: the experimental evaluation shows that GZSL performance, averaged over eight state-of-the-art methods, is improved from 28.5 to 42.2 on CUB and from 28.2 to 57.1 on AwA2.
2018
(1)
Girard, J.; Tamaazousti, Y.; Le Borgne, H.; and Hudelot, C.
Learning Finer-class Networks for Universal Representation.
In British Machine Vision Conference (BMVC), Newcastle Upon Tyne (UK), 2018.
pdf
supp
bibtex
abstract
@inproceedings{girard18bmvc, author = {Girard, Julien and Tamaazousti, Youssef and Le Borgne, Herv{\'e} and C{\'e}line Hudelot}, title = {Learning Finer-class Networks for Universal Representation}, booktitle = {British Machine Vision Conference (BMVC)}, year = {2018}, address = {Newcastle Upon Tyne (UK)}, url_PDF = {http://bmvc2018.org/contents/papers/1021.pdf}, url_supp = {http://bmvc2018.org/contents/supplementary/pdf/1021_supp.pdf}, abstract = {Many real-world visual recognition use-cases can not directly benefit from state-of-the-art CNN-based approaches because of the lack of many annotated data. The usual approach to deal with this is to transfer a representation pre-learned on a large annotated source-task onto a target-task of interest. This raises the question of how well the original representation is “universal”, that is to say directly adapted to many different target-tasks. To improve such universality, the state-of-the-art consists in training networks on a diversified source problem, that is modified either by adding generic or specific categories to the initial set of categories. In this vein, we proposed a method that exploits finer-classes than the most specific ones existing, for which no annotation is available. We rely on unsupervised learning and a bottom-up split and merge strategy. We show that our method learns more universal representations than state-of-the-art, leading to significantly better results on 10 target-tasks from multiple domains, using several network architectures, either alone or combined with networks learned at a coarser semantic level.}, keywords = {frugal-learning} }
Many real-world visual recognition use-cases can not directly benefit from state-of-the-art CNN-based approaches because of the lack of many annotated data. The usual approach to deal with this is to transfer a representation pre-learned on a large annotated source-task onto a target-task of interest. This raises the question of how well the original representation is “universal”, that is to say directly adapted to many different target-tasks. To improve such universality, the state-of-the-art consists in training networks on a diversified source problem, that is modified either by adding generic or specific categories to the initial set of categories. In this vein, we proposed a method that exploits finer-classes than the most specific ones existing, for which no annotation is available. We rely on unsupervised learning and a bottom-up split and merge strategy. We show that our method learns more universal representations than state-of-the-art, leading to significantly better results on 10 target-tasks from multiple domains, using several network architectures, either alone or combined with networks learned at a coarser semantic level.
2017
(7)
Tamaazousti, Y.; Le Borgne, H.; Popescu, A.; Gadeski, E.; Ginsca, A. L.; and Hudelot, C.
Vision-language integration using constrained local semantic features.
Computer Vision and Image Understanding, 163(Supplement C): 41 - 57. 2017.
Paper
doi
bibtex
abstract
1 download
@article{tamaazousti2017cviu, title = {Vision-language integration using constrained local semantic features}, journal = {Computer Vision and Image Understanding}, author = {Tamaazousti, Youssef and Le Borgne, Herv\'{e} and Popescu, Adrian and Gadeski, Etienne and Ginsca, Alexandru Lucian and Hudelot, C{\'e}line}, volume = {163}, number = {Supplement C}, pages = {41 - 57}, year = {2017}, issn = {1077-3142}, doi = {10.1016/j.cviu.2017.05.017}, url = {https://www.sciencedirect.com/science/article/abs/pii/S1077314217301121}, abstract = {This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.}, keywords = {vision-language} }
This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.
Vo, P.; Ginsca, A. L.; Le Borgne, H.; and Popescu, A.
Harnessing Noisy Web Images for Deep Representation.
Computer Vision and Image Understanding. 1 2017.
on line jan 2017
doi bibtex
doi bibtex
@article{vo2017cviu, author = {Vo, Phong and Ginsca, Alexandru Lucian and Le Borgne, Herv\'{e} and Popescu, Adrian}, title = {Harnessing Noisy Web Images for Deep Representation}, journal = {Computer Vision and Image Understanding}, year = {2017}, month = {1}, note = {on line jan 2017}, doi = {10.1016/j.cviu.2017.01.009}, keywords = {frugal-learning} }
Gadeski, E.; Le Borgne, H.; and Popescu, A.
Fast and robust duplicate image detection on the web.
Multimedia Tools and Applications, 76: 11839-11858. 5 2017.
pdf
doi
bibtex
abstract
@article{gadeski2017mtap, author = {Gadeski, Etienne and Le Borgne, Herv{\'e} and Popescu, Adrian}, title = {Fast and robust duplicate image detection on the web}, year = {2017}, journal = {Multimedia Tools and Applications}, volume = {76}, pages = {11839-11858}, month = {5}, doi = {10.1007/s11042-016-3619-4}, url_PDF = {https://hal.science/hal-01845526v1/document}, abstract = {Social media intelligence is interested in detecting the massive propagation of similar visual content. It can be seen, under certain conditions, as a problem of detecting near duplicate images in a stream of web data. However, in the context considered, it requires not only an efficient indexing and searching algorithm but also to be fast to compute the image description, since the total time of description and searching must be short enough to satisfy the constraint induced by the web stream flow rate. While most of methods of the state of the art focus on the efficiency at searching time, we propose a new descriptor satisfying the aforementioned requirements. We evaluate our method on two different datasets with the use of different sets of distractor images, leading to large-scale image collections (up to 100 million images). We compare our method to the state of the art and show it exhibits among the best detection performances but is much faster (one to two orders of magnitude).}, keywords = {frugal-learning} }
Social media intelligence is interested in detecting the massive propagation of similar visual content. It can be seen, under certain conditions, as a problem of detecting near duplicate images in a stream of web data. However, in the context considered, it requires not only an efficient indexing and searching algorithm but also to be fast to compute the image description, since the total time of description and searching must be short enough to satisfy the constraint induced by the web stream flow rate. While most of methods of the state of the art focus on the efficiency at searching time, we propose a new descriptor satisfying the aforementioned requirements. We evaluate our method on two different datasets with the use of different sets of distractor images, leading to large-scale image collections (up to 100 million images). We compare our method to the state of the art and show it exhibits among the best detection performances but is much faster (one to two orders of magnitude).
Popescu, A.; Ginsca, A. L.; and Le Borgne, H.
Scale-Free Content Based Image Retrieval (or Nearly So).
In IEEE International Conference on Computer Vision (ICCV) workshop on Web-Scale Vision and Social Media, pages 280-288, Venice, 2017.
pdf
doi
bibtex
abstract
@InProceedings{popescu2017iccv_w, author = {Popescu, Adrian and Ginsca, Alexandru Lucian and Le Borgne, Herv{\'e}}, title = {Scale-Free Content Based Image Retrieval (or Nearly So)}, booktitle = {IEEE International Conference on Computer Vision (ICCV) workshop on Web-Scale Vision and Social Media}, year = {2017}, pages = {280-288}, address = {Venice}, doi = {10.1109/ICCVW.2017.42}, url_PDF = {https://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w5/Popescu_Scale-Free_Content_Based_ICCV_2017_paper.pdf}, abstract = {When textual annotations of Web and social media images are poor or missing, content-based image retrieval is an interesting way to access them. Finding an optimal trade-off between accuracy and scalability for CBIR is challenging in practice. We propose a retrieval method whose complexity is nearly independent of the collection scale and does not degrade results quality. Images are represented with sparse semantic features that can be stored as an inverted index. Search complexity is drastically reduced by (1) considering the query feature dimensions independently and thus turning search into a concatenation operation and (2) pruning the index in function of a retrieval objective. To improve precision, the inverted index look-up is complemented with an exhaustive search over a fixed size list of intermediary results. We run experiments with three public collections and results show that our much faster method slightly outperforms an exhaustive search done with two competitive baselines.}, keywords = {} }
When textual annotations of Web and social media images are poor or missing, content-based image retrieval is an interesting way to access them. Finding an optimal trade-off between accuracy and scalability for CBIR is challenging in practice. We propose a retrieval method whose complexity is nearly independent of the collection scale and does not degrade results quality. Images are represented with sparse semantic features that can be stored as an inverted index. Search complexity is drastically reduced by (1) considering the query feature dimensions independently and thus turning search into a concatenation operation and (2) pruning the index in function of a retrieval objective. To improve precision, the inverted index look-up is complemented with an exhaustive search over a fixed size list of intermediary results. We run experiments with three public collections and results show that our much faster method slightly outperforms an exhaustive search done with two competitive baselines.
Chami, I.; Tamaazousti, Y.; and Le Borgne, H.
AMECON: Abstract Meta-Concept Features for Text-Illustration.
In ACM International Conference on Multimedia Retrieval (ICMR), Bucharest, 2017.
pdf
slides
doi
bibtex
abstract
@InProceedings{chami2017icmr, author = {Chami,Ines and Tamaazousti, Youssef and Le Borgne, Herv{\'e}}, title = {AMECON: Abstract Meta-Concept Features for Text-Illustration}, booktitle = {ACM International Conference on Multimedia Retrieval (ICMR)}, year = {2017}, address = {Bucharest}, doi = {10.1145/3078971.3078993}, url_PDF = {http://people.csail.mit.edu/ytamaaz/files/pdf/AMECON_Abstract_Meta_Concept_Features_for_Text_Illustration.pdf}, url_slides = {http://people.csail.mit.edu/ytamaaz/files/slides/Chami_Tamaazousti_LeBorgne_ICMR17.pdf}, abstract = {Cross-media retrieval is a problem of high interest that is at the frontier between computer vision and natural language processing. The state-of-the-art in the domain consists of learning a common space with regard to some constraints of correlation or similarity from two textual and visual modalities that are processed in parallel and possibly jointly. This paper proposes a different approach that considers the cross-modal problem as a supervised mapping of visual modalities to textual ones. Each modality is thus seen as a particular projection of an abstract meta-concept, each of its dimension subsuming several semantic concepts (``meta'' aspect) but may not correspond to an actual one (``abstract'' aspect). In practice, the textual modality is used to generate a multi-label representation, further used to map the visual modality through a simple shallow neural network. While being quite easy to implement, the experiments show that our approach significantly outperforms the state-of-the-art on Flickr-8K and Flickr-30K datasets for the text-illustration task}, keywords = {vision-language} }
Cross-media retrieval is a problem of high interest that is at the frontier between computer vision and natural language processing. The state-of-the-art in the domain consists of learning a common space with regard to some constraints of correlation or similarity from two textual and visual modalities that are processed in parallel and possibly jointly. This paper proposes a different approach that considers the cross-modal problem as a supervised mapping of visual modalities to textual ones. Each modality is thus seen as a particular projection of an abstract meta-concept, each of its dimension subsuming several semantic concepts (meta'' aspect) but may not correspond to an actual one (abstract'' aspect). In practice, the textual modality is used to generate a multi-label representation, further used to map the visual modality through a simple shallow neural network. While being quite easy to implement, the experiments show that our approach significantly outperforms the state-of-the-art on Flickr-8K and Flickr-30K datasets for the text-illustration task
Tamaazousti, Y.; Le Borgne, H.; and Hudelot, C.
MuCaLe-Net: Multi Categorical-Level Networks to Generate More Discriminating Features.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017.
pdf
supp
code
doi
bibtex
abstract
@InProceedings{tamaazousti2017cvpr, author = {Tamaazousti, Youssef and Le Borgne, Herv{\'e} and Hudelot,C{\'e}line}, title = {MuCaLe-Net: Multi Categorical-Level Networks to Generate More Discriminating Features}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2017}, address = {Honolulu}, url_PDF = {https://openaccess.thecvf.com/content_cvpr_2017/papers/Tamaazousti_MuCaLe-Net_Multi_Categorical-Level_CVPR_2017_paper.pdf}, url_supp = {https://openaccess.thecvf.com/content_cvpr_2017/supplemental/Tamaazousti_MuCaLe-Net_Multi_Categorical-Level_2017_CVPR_supplemental.pdf}, url_code = {https://github.com/youssefTamaazousti/MuldipNet-tensorflow}, doi = {10.1109/CVPR.2017.561}, abstract = {In a transfer-learning scheme, the intermediate layers of a pre-trained CNN are employed as universal image representation to tackle many visual classification problems. The current trend to generate such representation is to learn a CNN on a large set of images labeled among the most specific categories. Such processes ignore potential relations between categories, as well as the categorical-levels used by humans to classify. In this paper, we propose Multi Categorical-Level Networks (MuCaLe-Net) that include human-categorization knowledge into the CNN learning process. A MuCaLe-Net separates generic categories from each other while it independently distinguishes specific ones. It thereby generates different features in the intermediate layers that are complementary when combined together. Advantageously, our method does not require additive data nor annotation to train the network. The extensive experiments over four publicly available benchmarks of image classification exhibit state-of-the-art performances.}, keywords = {frugal-learning} }
In a transfer-learning scheme, the intermediate layers of a pre-trained CNN are employed as universal image representation to tackle many visual classification problems. The current trend to generate such representation is to learn a CNN on a large set of images labeled among the most specific categories. Such processes ignore potential relations between categories, as well as the categorical-levels used by humans to classify. In this paper, we propose Multi Categorical-Level Networks (MuCaLe-Net) that include human-categorization knowledge into the CNN learning process. A MuCaLe-Net separates generic categories from each other while it independently distinguishes specific ones. It thereby generates different features in the intermediate layers that are complementary when combined together. Advantageously, our method does not require additive data nor annotation to train the network. The extensive experiments over four publicly available benchmarks of image classification exhibit state-of-the-art performances.
Daher, H.; Besançon, R.; Ferret, O.; Le Borgne, H.; Daquo, A.; and Tamaazousti, Y.
Supervised Learning of Entity Disambiguation Models by Negative Sample Selection.
In 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), Budapest, Hungary, 2017.
17 – 23 avril
pdf
doi
bibtex
abstract
@inproceedings{daher2017cicling, author = {Daher, Hani and Besan\c{c}on, Romaric and Ferret, Olivier and Le Borgne, Herv{\'e} and Daquo,Anne-Laure and Tamaazousti, Youssef}, title = {Supervised Learning of Entity Disambiguation Models by Negative Sample Selection}, booktitle = {18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing)}, year = {2017}, address = {Budapest, Hungary}, note = {17 -- 23 avril}, doi = {10.1007/978-3-319-77113-7_26}, url_PDF = {http://people.csail.mit.edu/ytamaaz/files/pdf/Supervised_Learning_of_Entity_Disambiguation_Models_by_Negative_Sample_Selection.pdf}, abstract = {The objective of Entity Linking is to connect an entity mention in a text to a known entity in a knowledge base. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and determine, in a second step, the best one. This paper focuses on this last step and proposes a method based on learning a function that discriminates an entity from its most ambiguous ones. Our contribution lies in the strategy to learn efficiently such a model while keeping it compatible with large knowledge bases. We propose three strategies with different efficiency/performance trade-off, that are experimentally validated on six datasets of the TAC evaluation campaigns by using Freebase and DBpedia as reference knowledge bases.}, keywords = {} }
The objective of Entity Linking is to connect an entity mention in a text to a known entity in a knowledge base. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and determine, in a second step, the best one. This paper focuses on this last step and proposes a method based on learning a function that discriminates an entity from its most ambiguous ones. Our contribution lies in the strategy to learn efficiently such a model while keeping it compatible with large knowledge bases. We propose three strategies with different efficiency/performance trade-off, that are experimentally validated on six datasets of the TAC evaluation campaigns by using Freebase and DBpedia as reference knowledge bases.
2016
(3)
Tran, T. Q. N.; Le Borgne, H.; and Crucianu, M.
Aggregating Image and Text Quantized Correlated Components.
In IEEE Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 6 2016.
pdf
bibtex
abstract
@inproceedings{tran16cvpr, author = {Tran, Thi Quynh Nhi and Le Borgne, Herv{\'e} and Crucianu, Michel}, title = {Aggregating Image and Text Quantized Correlated Components}, booktitle = {IEEE Computer Vision and Pattern Recognition (CVPR)}, year = {2016}, address = {Las Vegas, USA}, month = {6}, url_PDF = {https://openaccess.thecvf.com/content_cvpr_2016/papers/Tran_Aggregating_Image_and_CVPR_2016_paper.pdf}, abstract = {Cross-modal tasks occur naturally for multimedia content that can be described along two or more modalities like visual content and text. Such tasks require to" translate" information from one modality to another. Methods like kernelized canonical correlation analysis (KCCA) attempt to solve such tasks by finding aligned subspaces in the description spaces of different modalities. Since they favor correlations against modality-specific information, these methods have shown some success in both cross-modal and bi-modal tasks. However, we show that a direct use of the subspace alignment obtained by KCCA only leads to coarse translation abilities. To address this problem, we first put forward here a new representation method that aggregates information provided by the projections of both modalities on their aligned subspaces. We further suggest a method relying on neighborhoods in these subspaces to complete uni-modal information. Our proposal exhibits state-of-the-art results for bi-modal classification on Pascal VOC07 and for cross-modal retrieval on FlickR 8K and FlickR 30K.}, keywords = {vision-language} }
Cross-modal tasks occur naturally for multimedia content that can be described along two or more modalities like visual content and text. Such tasks require to" translate" information from one modality to another. Methods like kernelized canonical correlation analysis (KCCA) attempt to solve such tasks by finding aligned subspaces in the description spaces of different modalities. Since they favor correlations against modality-specific information, these methods have shown some success in both cross-modal and bi-modal tasks. However, we show that a direct use of the subspace alignment obtained by KCCA only leads to coarse translation abilities. To address this problem, we first put forward here a new representation method that aggregates information provided by the projections of both modalities on their aligned subspaces. We further suggest a method relying on neighborhoods in these subspaces to complete uni-modal information. Our proposal exhibits state-of-the-art results for bi-modal classification on Pascal VOC07 and for cross-modal retrieval on FlickR 8K and FlickR 30K.
Tamaazousti, Y.; Le Borgne, H.; and Popescu, A.
Constrained Local Enhancement of Semantic Features by Content-Based Sparsity.
In ACM International Conference on Multimedia Retrieval (ICMR), New York, USA, 6 2016.
pdf
bibtex
@inproceedings{tamaazousti16icmr_cbs, author = {Tamaazousti, Youssef and Le Borgne, Herv{\'e} and Popescu, Adrian}, title = {Constrained Local Enhancement of Semantic Features by Content-Based Sparsity}, booktitle = {ACM International Conference on Multimedia Retrieval (ICMR)}, year = {2016}, address = {New York, USA}, month = {6}, url_PDF = {http://people.csail.mit.edu/ytamaaz/files/pdf/Constrained_Local_Enhancement_of_Semantic_Features_by_Content-Based-Sparsity.pdf}, keywords = {frugal-learning,vision-language} }
Tamaazousti, Y.; Le Borgne, H.; and Hudelot, C.
Diverse Concept-Level Features for Multi-Object Classification.
In ACM International Conference on Multimedia Retrieval (ICMR), New York, USA, 6 2016.
pdf
bibtex
@inproceedings{tamaazousti16icmr_dcl, author = {Tamaazousti, Youssef and Le Borgne, Herv{\'e} and Hudelot, C{\'e}line}, title = {Diverse Concept-Level Features for Multi-Object Classification}, booktitle = {ACM International Conference on Multimedia Retrieval (ICMR)}, year = {2016}, address = {New York, USA}, month = {6}, url_PDF = {http://people.csail.mit.edu/ytamaaz/files/pdf/Diverse_Concept-Level_Features_for_Multi-Object_Classification.pdf}, keywords = {frugal-learning,vision-language} }
2015
(3)
Ginsca, A. L.; Popescu, A.; Le Borgne, H.; Ballas, N.; Vo, P.; and Kanellos, I.
Large-scale Image Mining with Flickr Groups.
In 21st international conference on Multimedia Modelling (MMM), 2015.
Best paper award
pdf
bibtex
@inproceedings{ginsca15semfeat, author = {Ginsca, Alexandru Lucian and Popescu, Adrian and Le Borgne, Herv\'{e} and Ballas, Nicolas and Vo, Phong and Kanellos, Ioannis}, title = {Large-scale Image Mining with Flickr Groups}, booktitle = {$21^{st}$ international conference on Multimedia Modelling (MMM)}, year = {2015}, note = {Best paper award}, url_PDF = {https://hal.science/hal-01172319/document}, keywords = {frugal-learning,vision-language} }
Popescu, A.; Hildebrandt, M.; Breuer, J.; Claeys, L.; Papadopoulos, S.; Petkos, G.; Michalareas, T.; Lund, D.; Heyman, R.; van der Graaf, S.; Gadeski, E.; Le Borgne, H.; deVries , K.; Kastrinogiannis, T.; Kousaridas, A.; and Padyab, A.
Increasing Transparency and Privacy for Online Social Network Users – USEMP Value Model, Scoring Framework and Legal.
In Berendt, B.; Engel, T.; Ikonomou, D.; Le Métayer, D.; and Schiffner, S., editor(s), Privacy Technologies and Policy: Third Annual Privacy Forum, APF 2015, Luxembourg, Luxembourg, October 7-8, 2015, Revised Selected Papers, pages 38–59, Cham, 2015. Springer International Publishing
bibtex
bibtex
@inproceedings{popescu15privacy, author = {Popescu, Adrian and Hildebrandt, Mireille and Breuer, Jurgen and Claeys, Laurence and Papadopoulos, Symeon and Petkos, Giorgos and Michalareas, T. and Lund, David and Heyman, Rob and van der Graaf, S. and Gadeski, Etienne and Le Borgne, Herv{\'e} and deVries, Katia and Kastrinogiannis, Timotheos and Kousaridas, A. and Padyab, A.}, editor = {Berendt, Bettina and Engel, Thomas and Ikonomou, Demosthenes and Le M{\'e}tayer, Daniel and Schiffner, Stefan}, title = {Increasing Transparency and Privacy for Online Social Network Users -- USEMP Value Model, Scoring Framework and Legal}, bookTitle = {Privacy Technologies and Policy: Third Annual Privacy Forum, APF 2015, Luxembourg, Luxembourg, October 7-8, 2015, Revised Selected Papers}, year = {2015}, publisher = {Springer International Publishing}, address = {Cham}, pages = {38--59}, isbn = {978-3-319-31456-3}, keywords = {trustworthy-AI} }
Vo, P. D.; Ginsca, A.; Le Borgne, H.; and Popescu, A.
Effective training of convolutional networks using noisy Web images.
In 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pages 1-6, 2015.
pdf
doi
bibtex
abstract
@inproceedings{vo2015effective_training_conv, author = {Vo, Phong D. and Ginsca, Alexandru and Le Borgne, Hervé and Popescu, Adrian}, booktitle = {13th International Workshop on Content-Based Multimedia Indexing (CBMI)}, title = {Effective training of convolutional networks using noisy Web images}, year = {2015}, pages = {1-6}, url_PDF = {https://hleborgne.github.io/files/vo2015cbmi.pdf}, doi = {10.1109/CBMI.2015.7153607}, abstract = {Deep convolutional networks have recently shown very interesting performance in a variety of computer vision tasks. Besides network architecture optimization, a key contribution to their success is the availability of training data. Network training is usually done with manually validated data but this approach has a significant cost and poses a scalability problem. Here we introduce an innovative pipeline that combines weakly-supervised image reranking methods and network fine-tuning to effectively train convolutional networks from noisy Web collections. We evaluate the proposed training method versus the conventional supervised training on cross-domain classification tasks. Results show that our method outperforms the conventional method in all of the three datasets. Our findings open opportunities for researchers and practitioners to use convolutional networks with inexpensive training cost.}, keywords = {frugal-learning} }
Deep convolutional networks have recently shown very interesting performance in a variety of computer vision tasks. Besides network architecture optimization, a key contribution to their success is the availability of training data. Network training is usually done with manually validated data but this approach has a significant cost and poses a scalability problem. Here we introduce an innovative pipeline that combines weakly-supervised image reranking methods and network fine-tuning to effectively train convolutional networks from noisy Web collections. We evaluate the proposed training method versus the conventional supervised training on cross-domain classification tasks. Results show that our method outperforms the conventional method in all of the three datasets. Our findings open opportunities for researchers and practitioners to use convolutional networks with inexpensive training cost.
2014
(1)
Gadeski, E.; Fard, H. O.; and Le Borgne, H.
GPU deformable part model for object recognition.
Journal of Real-Time Image Processing,1-13. 2014.
bibtex
bibtex
@Article{gadeski2014gpu_dpm, author = {Gadeski, Etienne and Fard, Hamidreza Odabai and Le Borgne, Herv{\'e}}, title = {GPU deformable part model for object recognition}, year = {2014}, journal = {Journal of Real-Time Image Processing}, publisher = {Springer Berlin Heidelberg}, pages = {1-13}, keywords = {} }
2013
(1)
Znaidia, A.; Le Borgne, H.; and Hudelot, C.
Tag completion based on belief theory and neighbor voting.
In Proceedings of the 3rd ACM conference on International Conference on Multimedia Retrieval (ICMR), pages 49–56, 2013.
pdf
slides
bibtex
@inproceedings{znaidia2013icmr, title = {Tag completion based on belief theory and neighbor voting}, author = {Znaidia, Amel and Le Borgne, Herv{\'e} and Hudelot, C{\'e}line}, booktitle = {Proceedings of the 3rd ACM conference on International Conference on Multimedia Retrieval (ICMR)}, pages = {49--56}, year = {2013}, url_PDF = {https://hleborgne.github.io/files/znaidia2013icmr.pdf}, url_Slides= {https://hleborgne.github.io/files/znaidia2013icmr_slides.pdf}, keywords = {} }
2012
(3)
Le Borgne, H.; and Honnorat, N.
Fast shared boosting for large-scale concept detection.
Multimedia Tools and Applications, 60(2): 389-402. 2012.
doi bibtex
doi bibtex
@article{leborgne2012mtap, author = {Le Borgne, Herv{\'e} and Honnorat, Nicolas}, title = {Fast shared boosting for large-scale concept detection}, journal = {Multimedia Tools and Applications}, year = {2012}, pages = {389-402}, volume = {60}, number = {2}, issn = {1380-7501}, doi = {10.1007/s11042-010-0607-y}, keywords = {} }
Shabou, A.; and Le Borgne, H.
Locality-constrained and spatially regularized coding for scene categorization.
In IEEE Computer Vision and Pattern Recognition (CVPR), pages 3618–3625, 2012.
pdf
bibtex
abstract
@inproceedings{shabou2012cvpr, title = {Locality-constrained and spatially regularized coding for scene categorization}, author = {Shabou, Aymen and Le Borgne, Herv{\'e}}, booktitle = {IEEE Computer Vision and Pattern Recognition (CVPR)}, pages = {3618--3625}, year = {2012}, location = {Providence, Rhode Island, USA}, url_PDF = {https://www.researchgate.net/profile/Herve-Le-Borgne/publication/229076368_Locality-constrained_and_spatially_regularized_coding_for_scene_categorization/links/0fcfd502d2f375470a000000/Locality-constrained-and-spatially-regularized-coding-for-scene-categorization.pdf}, abstract = {Improving coding and spatial pooling for bag-of-words based feature design have gained a lot of attention in recent works addressing object recognition and scene classification. Regarding the coding step in particular, properties such as sparsity, locality and saliency have been investigated. The main contribution of this work consists in taking into acount the local spatial context of an image into the usual coding strategies proposed in the state-of-the-art. For this purpose, given an imgae, dense local features are extracted and structured in a lattice. The latter is endowed with a neighborhood system and pairwise interactions. We propose a new objective function to encode local features, which preserves locality constraints both in the feature space and the spatial domain of the image. In addition, an appropriate efficient optimization algorithm is provided, inspired from the graph-cut framework. In conjunction with the maximum-pooling operation and the spatial pyramid matching, that reflects a global spatial layout, the proposed method improves the performances of several state-of-the-art coding schemes for scene classification on three publicly available benchmarks (UIUC 8-sport, Scene-15 and Caltech-101).}, keywords = {} }
Improving coding and spatial pooling for bag-of-words based feature design have gained a lot of attention in recent works addressing object recognition and scene classification. Regarding the coding step in particular, properties such as sparsity, locality and saliency have been investigated. The main contribution of this work consists in taking into acount the local spatial context of an image into the usual coding strategies proposed in the state-of-the-art. For this purpose, given an imgae, dense local features are extracted and structured in a lattice. The latter is endowed with a neighborhood system and pairwise interactions. We propose a new objective function to encode local features, which preserves locality constraints both in the feature space and the spatial domain of the image. In addition, an appropriate efficient optimization algorithm is provided, inspired from the graph-cut framework. In conjunction with the maximum-pooling operation and the spatial pyramid matching, that reflects a global spatial layout, the proposed method improves the performances of several state-of-the-art coding schemes for scene classification on three publicly available benchmarks (UIUC 8-sport, Scene-15 and Caltech-101).
Znaidia, A.; Shabou, A.; Le Borgne, H.; Hudelot, C.; and Paragios, N.
Bag-of-multimedia-words for image classification.
In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1509–1512, 2012.
pdf
bibtex
abstract
1 download
@inproceedings{znaidia2012icpr, title = {Bag-of-multimedia-words for image classification}, author = {Znaidia, Amel and Shabou, Aymen and Le Borgne, Herv{\'e} and Hudelot, C{\'e}line and Paragios, Nikos}, booktitle = {Pattern Recognition (ICPR), 2012 21st International Conference on}, pages = {1509--1512}, year = {2012}, url_PDF = {https://hleborgne.github.io/files/znaidia2012icpr.pdf}, abstract = {We introduce the bag-of-multimedia-words model that tightly combines the heterogeneous information coming from the text and the pixel-based information of a multimedia document. The proposed multimedia feature generation process is generic for any multi-modality and aims at enriching a multimedia document description with compact and discriminative signatures well appropriate to linear classifiers. It is evaluated on the Pascal VOC 2007 classification challenge, outperforming the state-of-the-art bag-of-visual-words or bag-of-tag-words based classification approaches.}, keywords = {vision-language} }
We introduce the bag-of-multimedia-words model that tightly combines the heterogeneous information coming from the text and the pixel-based information of a multimedia document. The proposed multimedia feature generation process is generic for any multi-modality and aims at enriching a multimedia document description with compact and discriminative signatures well appropriate to linear classifiers. It is evaluated on the Pascal VOC 2007 classification challenge, outperforming the state-of-the-art bag-of-visual-words or bag-of-tag-words based classification approaches.
2007
(1)
Le Borgne, H.; Guérin-Dugué, A.; and O'Connor, N. E.
Learning Midlevel Image Features for Natural Scene and Texture Classification.
IEEE Transaction on Circuits and Systems for Video Technologies, 17(3): 286-297. 6 2007.
pdf
doi
bibtex
@article{leborgne2007csvt, author = {Le Borgne, Herv{\'e} and Gu{\'e}rin-Dugu{\'e}, Anne and O'Connor, Noel E.}, journal = {IEEE Transaction on Circuits and Systems for Video Technologies}, pages = {286-297}, title = {Learning Midlevel Image Features for Natural Scene and Texture Classification.}, volume = {17}, number = {3}, month = {6}, year = {2007}, doi = {10.1109/TCSVT.2007.890635}, url_PDF = {https://hleborgne.github.io/files/leborgne2007mid_level_image_texture.pdf}, keywords = {ica-image} }
2004
(1)
Le Borgne, H.; Guérin-Dugué, A.; and Antoniadis, A.
Representation of images for classification with independent features.
Pattern Recognition Letters, 25(2): 141–154. 2004.
pdf
doi
bibtex
@article{leborgne2004prl, title = {Representation of images for classification with independent features}, author = {Le Borgne, Herv{\'e} and Gu{\'e}rin-Dugu{\'e}, Anne and Antoniadis, Anestis}, journal = {Pattern Recognition Letters}, volume = {25}, number = {2}, pages = {141--154}, year = {2004}, publisher = {Elsevier}, url_PDF = {https://hleborgne.github.io/files/hlb2004ica_image.pdf}, doi = {10.1016/j.patrec.2003.09.011}, keywords = {ica-image} }
2001
(1)
Le Borgne, H.; and Guérin-Dugué
Sparse-dispersed coding and images discrimination with independent component analysis.
In International Conference on ICA and BSS, 2001.
pdf
bibtex
abstract
1 download
@inproceedings{leborgne2001ica, title = {Sparse-dispersed coding and images discrimination with independent component analysis}, author = {Le Borgne, Herv{\'e} and Gu{\'e}rin-Dugu{\'e}}, booktitle = {International Conference on ICA and BSS}, location = {San Diego, California, USA}, year = {2001}, url_PDF = {https://hleborgne.github.io/files/hlb2001ica.pdf}, abstract = {Independent Component Analysis applied to a set of natural images provides band-pass-oriented filters, similar to simple cells of the primary visual cortex. We applied two types of pre-processing to the images, a low-pass and a whitening one in a multiresolution grid, and examine the properties of the detectors extracted by ICA. These detectors composed a new basis function set in which images are encoded. On one hand, the properties (sparseness and dispersal) of the resulting coding are compared for both pre-processing strategies. On the other hand, this new coding by independent features is used for discriminating natural images, that is a very challenging domain in image analysis and retrieval. We show that a criterion based on the dispersal property enhances the efficiency of the discrimination by selecting the most dispersed detectors coding the image database. This behaviour is well enhanced with whitened images.}, keywords = {ica-image} }
Independent Component Analysis applied to a set of natural images provides band-pass-oriented filters, similar to simple cells of the primary visual cortex. We applied two types of pre-processing to the images, a low-pass and a whitening one in a multiresolution grid, and examine the properties of the detectors extracted by ICA. These detectors composed a new basis function set in which images are encoded. On one hand, the properties (sparseness and dispersal) of the resulting coding are compared for both pre-processing strategies. On the other hand, this new coding by independent features is used for discriminating natural images, that is a very challenging domain in image analysis and retrieval. We show that a criterion based on the dispersal property enhances the efficiency of the discrimination by selecting the most dispersed detectors coding the image database. This behaviour is well enhanced with whitened images.