Cross-Modal Understanding in LLMs: SVG and ASCII Art Reveal Shared Visual Features

Models & Research

The Engineer

27 Oct 2025 · 4 min read

Researchers uncover how large language models interpret complex visual features through SVG and ASCII art, revealing insights into LLMs' ability to understand images beyond basic text structure.

In a recent update from the Anthropic interpretability team, researchers Julius Tarng, Purvi Goel, and Isaac Kauvar explored how large language models (LLMs) perceive and understand visual features across different text-based modalities. This work builds on their earlier research that delved into the mechanisms LLMs use to process low-level visual properties of text, such as line breaking and table formatting. The team now turns their attention to higher-level semantic concepts encoded visually in text.

Key Findings

Cross-Modal Feature Activation: The same feature that activates over the eyes in an ASCII face also activates for eyes across diverse text-based modalities, including SVG code and prose in various languages.
Context Dependency: These features depend on the surrounding context within the visual depiction. For example, an SVG circle element only activates "eye" features when positioned within a larger structure that activates "face" features.
Feature Steering: Steering on a subset of these features during generation can modify text-based art in ways that correspond to the feature's semantic meaning, such as turning ASCII frowns to smiles or adding wrinkles to SVG faces.

Methodology

The researchers generated ASCII and SVG smiley faces using Claude, an LLM developed by Anthropic. They then examined the internal representations of these visual depictions within the model. Here’s a breakdown of their approach:

ASCII Art Generation: They created simple ASCII art, such as :-) (smiley face) and :-( (frowny face).
SVG Code Generation: They generated SVG code for similar visual depictions, like <circle cx="50" cy="50" r="10" /> for an eye within a larger face structure.
Feature Analysis: Using sparse autoencoders trained on a middle layer of the model, they identified and analyzed the features that activated over these visual elements.

Results

The team found that LLMs develop cross-modal features that recognize specific concepts across different text-based modalities. Here are some key observations:

Eyes in ASCII and SVG: The feature that activates for eyes in an ASCII face also activates for eyes in SVG code, even though the underlying representations are quite different.
Contextual Activation: An SVG circle element only triggers "eye" features when it is part of a larger structure that the model recognizes as a face. This indicates that the model's understanding of visual elements is context-dependent.
Feature Steering for Art Generation: By steering on specific features during text generation, they could modify ASCII and SVG art in meaningful ways. For example, steering on "smile" features turned frowns into smiles, and steering on "wrinkle" features added age-related details to faces.

Implications

These findings provide valuable insights into the internal representations that LLMs use to process and generate text-based visual content. Here are a few implications:

Cross-Modal Understanding: The ability of LLMs to recognize and understand visual features across different modalities suggests that they have developed a more generalized understanding of these concepts.
Contextual Awareness: The context-dependent activation of features indicates that models are not just recognizing individual elements but are also aware of the broader structure and meaning of the content.
Enhanced Creativity in Art Generation: The ability to steer on specific features during generation opens up new possibilities for creating and modifying text-based art, potentially leading to more creative and contextually appropriate outputs.

Conclusion

This research from the Anthropic interpretability team sheds light on how LLMs process and understand visual content across different modalities. By identifying and analyzing cross-modal features, the team has provided valuable insights into the internal mechanisms of these models. Future work in this area could further explore the generalization capabilities of LLMs and their potential applications in creative and interpretive tasks.