Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Key Contributions

🎯 Main Contributions

First work to explore training-free personalization for Vision Language Models
Novel Retrieval and Reasoning for Personalization (R2P) method using concept fingerprints
Introduction of PerVA dataset - Personal Concepts with Visual Ambiguity
Cross-modal verification and pairwise matching to reduce hallucinations
State-of-the-art performance across multiple benchmarks without any training

Abstract

Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users.

We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class.

When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query.

We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks.

Method Overview

R2P: Retrieval and Reasoning for Personalization

Our approach operates in two main phases: (1) Personal database creation where we extract concept fingerprints using VLM-based VQA, and (2) Concept inference with retrieval-reasoning that uses attribute-focused Chain-of-Thought reasoning and cross-modal verification to accurately identify personalized concepts in query images.

Figure 1: Overview of our R2P method showing the two-phase approach: personal database creation and concept inference with retrieval-reasoning.

PerVA Dataset

🎯 Personal Concepts with Visual Ambiguity (PerVA)

We introduce PerVA, a new benchmark specifically designed to evaluate personalization methods on visually ambiguous concepts. This dataset highlights the challenges in distinguishing between similar-looking personal items and concepts, providing a comprehensive evaluation framework for training-free personalization approaches.

Figure 2: Examples from our PerVA dataset showing personal concepts with visual ambiguity across different categories.

Qualitative Results

Figure 3: Qualitative results showing our method's ability to correctly identify personalized concepts through retrieval and reasoning, with examples of concept inference and cross-modal verification.

Citation

@article{das2025training,
  title={Training-Free Personalization via Retrieval and Reasoning on Fingerprints},
  author={Das, Deepayan and Talon, Davide and Wang, Yiming and Mancini, Massimiliano and Ricci, Elisa},
  journal={arXiv preprint arXiv:2503.18623},
  year={2025}
}