How Sentence Transformers v5.4 Unlocks Multimodal AI: Text, Image, Audio & Video in One API

The AI landscape is rapidly evolving beyond text-only models, and the latest update to Sentence Transformers brings this multimodal future within reach for developers. With the release of v5.4, this popular Python library now enables you to encode and compare texts, images, audio, and videos using the same familiar API that revolutionized text embeddings. This isn’t just an incremental update—it’s a paradigm shift that opens up entirely new categories of AI applications.

For years, developers have relied on Sentence Transformers for semantic search, retrieval-augmented generation (RAG), and other text-based applications. The new multimodal capabilities mean you can now build systems that understand relationships between different types of content. Imagine searching your video library with text descriptions, finding images that match audio descriptions, or building RAG pipelines that work seamlessly across all media types.

What Are Multimodal Models and Why Do They Matter?

Traditional embedding models convert text into fixed-size vectors that capture semantic meaning. You’ve probably used these for tasks like finding similar documents or powering semantic search. Multimodal embedding models extend this concept by mapping inputs from different modalities—text, images, audio, or video—into a shared embedding space.

Think of it this way: instead of having separate systems for text search and image search, you now have a unified space where “a red sports car” (text) sits close to actual images of red sports cars, audio descriptions of engine sounds, and video clips of cars racing. This shared representation enables truly cross-modal applications.

Similarly, while traditional reranker models (Cross Encoders) compute relevance scores between pairs of texts, multimodal rerankers can score pairs where one or both elements are images, combined text-image documents, or other modalities. This is crucial for refining search results in complex multimodal systems.

Getting Started: Installation and Requirements

Before diving into code, you’ll need to install the appropriate dependencies. The base Sentence Transformers package now supports optional extras for different modalities:

# Install with image support
pip install -U "sentence-transformers[image]"

# Install with audio support
pip install -U "sentence-transformers"

# Install with video support
pip install -U "sentence-transformers"

# Or install everything at once
pip install -U "sentence-transformers[image,video,train]"

Important hardware consideration: Vision-Language Models (VLMs) like Qwen3-VL-2B require significant GPU resources. The 2B parameter version needs approximately 8 GB of VRAM, while the 8B variants require around 20 GB. If you don’t have access to a local GPU, consider using cloud GPU services or Google Colab. For CPU-only environments, text-only models or smaller CLIP variants are more practical choices.

Working with Multimodal Embedding Models

Loading Your First Multimodal Model

The beauty of the Sentence Transformers API is its consistency. Loading a multimodal model works exactly like loading a text-only model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

Note: The revision argument is currently required because integration pull requests for these models are still pending. Once merged, you’ll be able to load them without specifying a revision.

The model automatically detects which modalities it supports, so there’s no extra configuration needed—though you can customize settings like image resolution or model precision if required.

Encoding Images and Computing Cross-Modal Similarity

Here’s where things get exciting. With a multimodal model loaded, model.encode() accepts images alongside text. You can provide images as URLs, local file paths, or PIL Image objects:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)

Now let’s compute similarities between text and images:

text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A red car driving on a highway",
    "A bee on a pink flower",
    "A wasp on a wooden table",
])

similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)

You’ll notice something interesting in the results: “A green car parked in front of a yellow building” shows highest similarity to the car image (around 0.51), while “A bee on a pink flower” matches best with the bee image (around 0.67). The mismatched pairs correctly receive lower scores.

Understanding the modality gap: You might wonder why these similarity scores aren’t closer to 1.0. This phenomenon, known as the “modality gap,” occurs because embeddings from different modalities tend to cluster in separate regions of the vector space. While cross-modal similarities are typically lower than within-modal similarities (like text-to-text), the relative ordering is preserved, making retrieval systems work effectively.

Best Practices for Retrieval Tasks

For production retrieval systems, use the specialized methods encode_query() and encode_document(). Many advanced models apply different instruction prompts depending on whether the input is a query or a document—similar to how chat models use different system prompts. This optimization can significantly improve retrieval accuracy.

Multimodal Reranker Models in Action

While embedding models create vector representations for individual items, reranker models evaluate pairs of items to determine their relevance. Multimodal rerankers extend this capability to mixed-modality pairs.

Ranking Mixed-Modality Documents

Consider a scenario where you have a text query and need to rank a collection of mixed documents (some text, some images, some combinations). Multimodal rerankers can handle this seamlessly:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("mixedbread-ai/mxbai-rerank-large-v1")

query = "Find images of historical architecture"
documents = [
    "A photo of the Colosseum in Rome",
    "https://example.com/colosseum.jpg",
    "Roman architecture characteristics and history",
    "Modern skyscrapers in Dubai",
]

scores = reranker.predict([(query, doc) for doc in documents])

The reranker evaluates each query-document pair, returning relevance scores that help you identify the most appropriate results regardless of their modality.

Practical Use Cases and Applications

The real power of multimodal Sentence Transformers becomes apparent when you consider practical applications:

Visual Document Retrieval: Search through PDFs, slides, or documents containing both text and images using natural language queries.

Cross-Modal E-commerce Search: Customers can describe products in text and find matching images, or upload product photos to find similar items.

Multimodal RAG Pipelines: Build retrieval-augmented generation systems that can pull relevant information from text documents, images, audio transcripts, and video clips to generate comprehensive answers.

Content Moderation at Scale: Automatically flag inappropriate content by analyzing relationships between images and their captions or surrounding text.

Accessibility Tools: Create systems that can describe images for visually impaired users or generate alt-text automatically.

Input Formats and Configuration

Sentence Transformers v5.4 supports a wide range of input formats:

Text: Strings or lists of strings
Images: URLs, file paths, PIL Images, numpy arrays, or PyTorch tensors
Audio: File paths, URLs, or raw audio arrays
Video: File paths or pre-processed video frames

You can check which modalities a model supports:

print(model.supported_modalities)
# Output might be: {'text', 'image'}

For advanced configuration, you can pass processor and model kwargs to control aspects like image resolution, audio sampling rate, or model precision settings.

Current Model Landscape and Future Directions

Several cutting-edge models already support these multimodal capabilities:

Qwen3-VL-Embedding models: Strong performance on text-image tasks
mixedbread-ai rerankers: Excellent for mixed-modality ranking
CLIP variants: Established models with proven cross-modal capabilities
AudioCLIP and VideoCLIP: Extending the paradigm to temporal media

As the field evolves, expect to see models that better bridge the modality gap, more efficient architectures for real-time applications, and specialized models for industry-specific use cases.

Getting the Most from Multimodal Sentence Transformers

Here are my recommendations for successfully implementing these new capabilities:

Start with a clear use case: Don’t add multimodality just because you can. Identify specific problems where cross-modal understanding provides real value.

Understand the modality gap: Design your similarity thresholds and ranking logic with the understanding that cross-modal scores will be lower than within-modal scores.

Consider hybrid approaches: Sometimes combining specialized single-modality models with multimodal models yields better results than relying on a single multimodal model for everything.

Plan for computational resources: Multimodal models, especially those handling video or high-resolution images, can be computationally intensive. Factor this into your infrastructure planning.

Evaluate thoroughly: Use metrics that reflect your actual use case. For retrieval systems, consider metrics like recall@k or mean reciprocal rank that account for ranking quality.

The Future Is Multimodal

The v5.4 update to Sentence Transformers represents more than just new features—it signals a fundamental shift in how we approach AI applications. By providing a unified API for text, images, audio, and video, it lowers the barrier to building truly intelligent systems that understand our multimodal world.

As AI continues to evolve, the ability to process and relate information across different modalities will become increasingly crucial. Whether you’re building the next generation of search engines, creating more accessible technology, or developing innovative content recommendation systems, multimodal capabilities are no longer optional—they’re essential.

The tools are now in your hands. What will you build?