Personalizing MLLMs via Reinforced Multimodal Reference Game

Key Contributions

Main Contributions

Novel Reinforced Reference Game (RRG) framework that trains MLLMs to generate discriminative, invariant concept descriptions
MLLM plays both speaker (describing) and listener (identifying) roles in a contrastive game setting
Verifiable contrastive reward over hard positives (different views of same concept) and hard negatives (visually similar but different concepts)
State-of-the-art across multiple tasks on three personalization benchmarks
Generalizes to unseen domains and outperforms personalization-specific RL frameworks

Abstract

Personalizing Multimodal Large Language Models (MLLMs) aims to recognize users' unique concepts from visual data and provide personalized responses. Although prior work has shown the benefit of concept descriptions and reasoning for this task, MLLM descriptions often include information, such as state and context, that does not help and may in fact hinder the unique identification of the target concept among other visually similar items.

Effective descriptions of personal concepts should instead be accurate, discriminative, and free of distracting details. To achieve such descriptions, we introduce Reinforced Reference Game (RRG), a learning framework that promotes discriminative descriptions through a novel reinforced multimodal reference game.

The MLLM plays both the roles of speaker and listener in a contrastive game setting, whose goal is to effectively communicate discriminative information about a target concept. Our approach formulates a verifiable contrastive reward over hard positives (dissimilar views of the same concept) and hard negatives (visually similar but different concepts).

Empirically, RRG achieves state-of-the-art across multiple tasks on three personalization benchmarks. RRG generalizes to unseen domains and outperforms existing methods based on concept descriptions and personalization-specific RL frameworks.

Method Overview

The Reinforced Reference Game

We frame personalization as a reference game. The MLLM acts as both the speaker — generating a description of a target concept — and the listener — identifying the target concept from a set of candidates using that description. This contrastive game encourages the speaker to produce descriptions that are distinctive and invariant, going beyond standard dense captioning toward attributes that uniquely identify the personal concept.

RRG: Reinforced Reference Game Framework

Our approach formulates a verifiable contrastive reward over two challenging settings: hard positives — dissimilar views of the same concept that must be correctly associated — and hard negatives — visually similar but distinct concepts that must be differentiated. This reward signal drives the MLLM to learn discriminative, context-free concept descriptions through reinforcement learning.

Method

Figure 1: Overview of our Reinforced Reference Game (RRG) framework. The MLLM plays both the speaker (generating discriminative descriptions) and the listener (identifying the target concept among candidates), trained with a verifiable contrastive reward over hard positives and hard negatives.

Qualitative Results

Figure 2: Qualitative results showing RRG's ability to generate accurate, discriminative descriptions for personalized concepts and correctly identify them across diverse visual scenarios.

Citation

@inproceedings{das2026rrg,
  title={Personalizing {MLLMs} via Reinforced Multimodal Reference Game},
  author={Das, Deepayan and Mancini, Massimiliano and Ricci, Elisa},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}