TextME: Bridging Unseen Modalities Through Text Descriptions

TextME: Bridging Unseen Modalities
Through Text Descriptions

ICML 2026

Soyeon Hong¹, Jinchan Kim¹, Jaegook You¹, Seungtaek Choi², Suha Kwak³, Hyunsouk Cho⁴

¹Department of Artificial Intelligence, Ajou University, ²Hankuk University of Foreign Studies, ³POSTECH, ⁴Department of Software, Ajou University

Abstract

Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text–image, text–audio, text–3D, text–molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.

Key Highlights

74.5%

Avg. Performance Preservation Ratio

>95%

Reduction in Data Requirements

~5K

Unpaired Samples for Offset Estimation

Text-Only Training

Train projection networks using only ~100K text descriptions per modality — no paired cross-modal data needed.

6 Diverse Modalities

Image Video Audio 3D X-ray Molecule

Emergent Cross-Modal Retrieval

Enables retrieval between unseen modality pairs (Audio→3D, Molecule→Image) without any paired training.

Method

TextME operates in two phases, leveraging the consistent modality gap — a systematic offset between text and modal embeddings in pretrained contrastive encoders:

1. Offset Computation

Estimate modality-specific centroids (μ^text and μ^modal) from a small set of unpaired samples (~5K). Centering with these offsets creates an interchangeable space where text and modal embeddings become functionally equivalent.

2. Text-to-Anchor Alignment

Train lightweight two-layer MLP projection networks using only text descriptions, mapping centered text embeddings into a shared LLM embedding anchor space via contrastive loss with hard negative mining.

3. Zero-Shot Transfer

At inference, center modal embeddings with the precomputed offset and pass through the text-trained projector. This enables direct cross-modal retrieval and classification without ever seeing modal samples during training.

Results

Zero-Shot Performance Across All Benchmarks

PPR (Performance Preservation Ratio) measures the percentage of pretrained encoder performance retained by each method. TextME requires zero paired data and zero labeled target data.

Method	Image		Video		Audio		Mol.	Audio Cls.		3D Cls.		X-ray	Emergent X→X		Data Req.
Method	COCO	Flkr.	MSR.	DiDe.	ACaps	Clo.	Drug.	ASet.	ESC	MN40	Scan.	RSNA	A→I	3D→I
Pretrained	48.29	77.70	37.00	51.06	31.27	22.47	16.90	79.19	9.32	85.20	67.75	42.21	52.64	×	×
LanguageBind	44.53	73.42	45.30	65.22	36.85	12.42	11.32	×	18.33	94.00	×	×	×	×	10M pairs
Ex-MCR	40.24	71.89	×	×	×	19.07	7.01	×	6.67	71.20	66.53	40.31	×	1.57	1M pairs*
COX†	0.02	0.20	5.10	0.00	0.10	0.08	0.11	7.63	1.26	2.00	4.05	2.84	22.53	0.02	10K labels
TextME (Ours)	28.63	51.66	26.40	45.82	24.10	15.35	7.81	34.75	5.80	77.25	70.86	42.15	46.59	1.06	100K text
PPR (%)	59.3	66.5	71.4	89.7	77.1	68.3	46.2	43.9	62.2	90.7	104.6	99.9	88.5	×

* Indirect: uses overlapping modality from existing MCR spaces. † Our reproduction. Bold indicates best among unpaired methods.

Emergent Cross-Modal Retrieval

TextME enables retrieval between modality pairs never seen during training. Audio queries retrieve semantically related 3D objects, and molecular structures retrieve contextually appropriate images — demonstrating that text-anchored alignment creates semantic bridges across arbitrary modalities.

Audio→3D retrieval without paired supervision. Audio queries retrieve semantically related 3D objects. TextME retrieves coherent results while the Naive baseline fails entirely.

Embedding Visualization

t-SNE visualizations show how TextME progressively aligns modal embeddings with text embeddings in the LLM anchor space across training stages.

@article{hong2026textme, title={TextME: Bridging Unseen Modalities Through Text Descriptions}, author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk}, journal={arXiv preprint arXiv:2602.03098}, year={2026} }

TextME: Bridging Unseen ModalitiesThrough Text Descriptions