Enhancing Agentic Security: A Maliciousness Classifier Based on LLM Internals

Security & Risk

The Analyst

19 Feb 2026 · 3 min read

This article delves into creating a maliciousness classifier that scrutinizes the internal workings of language models to enhance security in AI-driven systems, addressing critical questions about monitoring and testing.

Being Serious About Agentic Security

In the rapidly evolving landscape of AI agents, ensuring robust security is paramount. As developers and organizations integrate AI into their workflows, it's crucial to ask the right questions about the security systems in place:

What does it monitor? Is it limited to user input and agent output, or does it also examine the rich internal state of the language model (LLM) driving the agent?
What was it trained on? Does the training data adequately represent real-world scenarios?
How well was it tested? Can it handle out-of-distribution (OOD) inputs effectively?
What are the operational risks? How likely is it to fail under unexpected conditions?

Building in the Open: A Transparent Approach

Zenity Labs has taken a transparent approach to addressing these questions. In this article, we delve into their research and tools aimed at enhancing agentic security:

Full Research Paper: Available on ArXiv
Infrastructure for Mechanistic Interpretability: Accessible on GitHub, this tool allows researchers to train and analyze classifiers.
Benchmark Datasets: Comprising 18 open datasets, ranging from benign data to harmful requests, jailbreaks, and indirect prompt injections in code, email, and tools.

The System: Classifying Malicious Inputs

The core business case for the system is straightforward: classify user inputs (whether single prompts or multi-turn conversations) as malicious or benign. This classification enables real-time alerting or blocking based on predefined severity levels.

Definition of "Malicious":

Any input that attempts to manipulate the agent.
Efforts to extract secret information.
Harmful requests of any kind.
Inappropriate usage.

System Architecture

Input Feeding:
- User inputs are fed into a small LLM (Llama-3.1-8B-Instruct).
Activations Collection:
- The system collects activations, which are the raw numerical outputs from the LLM's internal layers, representing the input.

Classification:
- These activations are then fed into a lightweight classifier (logistic regression), which produces a score indicating the likelihood of the input being malicious.

Data and Testing

The robustness of the system hinges on its training data and testing procedures:

Training Data: The model is trained on a diverse set of 18 open datasets, ensuring it can handle a wide range of inputs.
Out-of-Distribution (OOD) Testing: To evaluate the system's performance under unexpected conditions, it undergoes rigorous OOD testing. This ensures that the classifier remains effective even when encountering data that differs significantly from its training dataset.

Interpretability: Understanding the Classification

Transparency in AI decision-making is crucial for building trust and ensuring accountability. The system provides insights into why an input was classified as malicious or benign, enhancing interpretability:

Mechanistic Interpretability: By analyzing the activations and the classifier's decisions, researchers can gain a deeper understanding of the model's reasoning process.

Key Risks

Data Representation:
- Ensuring that the training data is representative of real-world scenarios is critical. Inadequate representation can lead to false negatives or positives.
OOD Performance:
- The system must be robust enough to handle OOD inputs without significant degradation in performance.
Operational Challenges:
- Continuous monitoring and updating of the classifier are necessary to adapt to evolving threats and data patterns.

The Opportunity

By leveraging advanced LLM internals and transparent methodologies, organizations can significantly enhance their agentic security. This approach not only improves the reliability and safety of AI agents but also fosters a culture of openness and collaboration within the