TextME: Bridging Unseen Modalities
Through Text Descriptions

ICML 2026

Soyeon Hong1, Jinchan Kim1, Jaegook You1, Seungtaek Choi2, Suha Kwak3, Hyunsouk Cho4
1Department of Artificial Intelligence, Ajou University, 2Hankuk University of Foreign Studies, 3POSTECH, 4Department of Software, Ajou University
TextME Overview

Overview of the TextME pipeline. (a) Offset computation estimates modality-specific centroids from unpaired samples. (b) During training, projection networks are learned by aligning centered text embeddings with a unified LLM anchor space, requiring only text descriptions. (c) At inference, centering modal embeddings with the precomputed offset enables zero-shot cross-modal transfer without paired data.

Abstract

Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text–image, text–audio, text–3D, text–molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.

Key Highlights

74.5%
Avg. Performance Preservation Ratio
>95%
Reduction in Data Requirements
~5K
Unpaired Samples for Offset Estimation

Text-Only Training

Train projection networks using only ~100K text descriptions per modality — no paired cross-modal data needed.

6 Diverse Modalities

Image Video Audio 3D X-ray Molecule

Emergent Cross-Modal Retrieval

Enables retrieval between unseen modality pairs (Audio→3D, Molecule→Image) without any paired training.

Method

TextME operates in two phases, leveraging the consistent modality gap — a systematic offset between text and modal embeddings in pretrained contrastive encoders:

1. Offset Computation

Estimate modality-specific centroids (μtext and μmodal) from a small set of unpaired samples (~5K). Centering with these offsets creates an interchangeable space where text and modal embeddings become functionally equivalent.

2. Text-to-Anchor Alignment

Train lightweight two-layer MLP projection networks using only text descriptions, mapping centered text embeddings into a shared LLM embedding anchor space via contrastive loss with hard negative mining.

3. Zero-Shot Transfer

At inference, center modal embeddings with the precomputed offset and pass through the text-trained projector. This enables direct cross-modal retrieval and classification without ever seeing modal samples during training.

Results

Zero-Shot Performance Across All Benchmarks

PPR (Performance Preservation Ratio) measures the percentage of pretrained encoder performance retained by each method. TextME requires zero paired data and zero labeled target data.

Method Image Video Audio Mol. Audio Cls. 3D Cls. X-ray Emergent X→X Data Req.
COCOFlkr. MSR.DiDe. ACapsClo. Drug. ASet.ESC MN40Scan. RSNA A→I3D→I
Pretrained 48.2977.70 37.0051.06 31.2722.47 16.90 79.199.32 85.2067.75 42.21 52.64× ×
LanguageBind 44.5373.42 45.3065.22 36.8512.42 11.32 ×18.33 94.00× × ×× 10M pairs
Ex-MCR 40.2471.89 ×× ×19.07 7.01 ×6.67 71.2066.53 40.31 ×1.57 1M pairs*
COX† 0.020.20 5.100.00 0.100.08 0.11 7.631.26 2.004.05 2.84 22.530.02 10K labels
TextME (Ours) 28.6351.66 26.4045.82 24.1015.35 7.81 34.755.80 77.2570.86 42.15 46.591.06 100K text
PPR (%) 59.366.5 71.489.7 77.168.3 46.2 43.962.2 90.7104.6 99.9 88.5×

* Indirect: uses overlapping modality from existing MCR spaces. † Our reproduction. Bold indicates best among unpaired methods.

Emergent Cross-Modal Retrieval

TextME enables retrieval between modality pairs never seen during training. Audio queries retrieve semantically related 3D objects, and molecular structures retrieve contextually appropriate images — demonstrating that text-anchored alignment creates semantic bridges across arbitrary modalities.

Emergent cross-modal retrieval: Audio to 3D

Audio→3D retrieval without paired supervision. Audio queries retrieve semantically related 3D objects. TextME retrieves coherent results while the Naive baseline fails entirely.

Embedding Visualization

t-SNE visualizations show how TextME progressively aligns modal embeddings with text embeddings in the LLM anchor space across training stages.

BibTeX

@article{hong2026textme,
  title={TextME: Bridging Unseen Modalities Through Text Descriptions},
  author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
  journal={arXiv preprint arXiv:2602.03098},
  year={2026}
}