Daily Shaarli

All links of one day in a single page.

November 14, 2024

[2411.04997] LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

LLM2CLIP: Research showing how to improve CLIP's image-text matching abilities by replacing its text encoder with a frozen LLM (like Llama) and a trainable adapter. The key innovation is fine-tuning the LLM first to make its outputs more discriminative, then using it to help CLIP's vision encoder better understand language. Results show major improvements in matching detailed descriptions to images, handling long text, and even working across languages, while requiring relatively little training time and compute.