Latest LLMs Show Improved Character Manipulation and Counting Abilities

Models & Research

The Engineer

15 Oct 2025 · 3 min read

Newer LLMs like GPT-5 and Claude 4.5 excel at character-level tasks, surpassing earlier models in counting, manipulating individual characters, and solving complex encoding challenges.

Recently, I’ve been diving into how the latest generations of large language models (LLMs) handle character-level tasks. Specifically, I tested their ability to count characters, manipulate characters within sentences, and solve encoding and ciphers. The results were surprising: the newest models like GPT-5 and Claude 4.5 have made significant strides in these areas compared to their predecessors.

Character Manipulation

One of the key challenges for LLMs has been handling individual characters. This is because all text is encoded as tokens via the model’s tokenizer, which typically represents clusters of characters or even full words (especially common ones in the training dataset). This tokenization makes it difficult for models to perform tasks that require a more granular level of control.

To illustrate this, I tested several OpenAI models with the prompt: "Replace all letters 'r' in the sentence 'I really love a ripe strawberry' with the letter 'l', and then convert all letters 'l' to 'r'." Here’s how they performed:

gpt-3.5-turbo: I lealll rove a liple strallbeelly
gpt-4-turbo: I rearry rove a ripe strawberly
gpt-4o: I rearry rove a ripe strawberrry
gpt-4.1: I rearry rove a ripe strawberry
gpt-5-nano: I really love a ripe strawberry
gpt-5-mini: I rearry rove a ripe strawberry
gpt-5: I rearry rove a ripe strawberry

For this test, I disabled reasoning for the GPT-5 models to ensure a fair comparison. Reasoning can significantly aid in similar tasks (some models use chain of thought directly in the output), but I wanted to focus on the generational improvements from raw model enhancements. Notably, GPT-5 Nano is the only new generation model that made a mistake, likely due to its smaller size. Starting with GPT 4.1, models consistently completed this task without issues. For context, Claude Sonnet 4 from Anthropic was also able to handle this task around the same time as GPT 4.1.

Counting Characters

Counting characters is another area where LLMs have historically struggled. I tested several models on the sentence: "I wish I could come up with a better example sentence." The goal was to count the total number of characters in the sentence accurately. Here’s what I found:

GPT-4.1: Correctly counted all characters.
Other Models: While some models correctly counted the number of characters in individual words, they often fumbled when adding up the totals.

This highlights a significant improvement with GPT-4.1 and newer generations. The ability to accurately count characters is crucial for various applications, from data validation to text processing tasks that require precise character-level control.

Implications for Practitioners

These improvements in character manipulation and counting are not just academic curiosities; they have practical implications for developers and researchers working with LLMs. For instance:

Data Validation: Accurate character counting can help ensure data integrity, especially in applications where the length of input strings is critical.
Text Processing: Tasks like text normalization, encoding conversion, and cipher solving benefit from better character-level control.
Enhanced Reasoning: The ability to perform these tasks without explicit reasoning suggests that newer models have a more robust understanding of language at a granular level.

Conclusion

The latest generations of LLMs are showing significant improvements in handling character-level tasks. This progress is driven by advancements in model architecture and training techniques, making it easier for practitioners to leverage these models for precise text manipulation and processing. As the field continues to evolve, we can expect even more sophisticated capabilities from future models.