Researchers Unveil New Attack to Steal Parts of Production Language Models

Security & Risk

The Engineer

13 Mar 2024 · 3 min read

Scientists have devised a novel method to extract specific data from popular language models like ChatGPT and PaLM-2, raising serious concerns about model security and privacy.

In a groundbreaking study, researchers from leading institutions have developed the first model-stealing attack capable of extracting precise information from black-box production language models like OpenAI's ChatGPT and Google's PaLM-2. This new technique, detailed in a paper titled "Stealing Part of a Production Language Model," reveals how attackers can recover critical components of these models using typical API access.

What Changed Technically?

The researchers introduced an attack that targets the embedding projection layer of transformer models. The embedding projection layer is crucial because it maps input tokens (words or subwords) into high-dimensional vectors, which are then processed by the model's layers. By recovering this layer, attackers can gain insights into the model's internal structure and potentially use this information for various malicious purposes.

Key Findings

Embedding Projection Layer Recovery: The attack successfully recovers the embedding projection matrix of a transformer model up to symmetries (meaning it captures the essential structure but may have some indistinguishable permutations).
Cost-Effectiveness: For under $20 USD, the researchers were able to extract the entire projection matrix of OpenAI's Ada and Babbage models. These models have hidden dimensions of 1024 and 2048, respectively.
GPT-3.5-Turbo Insights: The exact hidden dimension size of the gpt-3.5-turbo model was also recovered. The researchers estimate that it would cost under $2,000 to extract the entire projection matrix for this more complex model.

Technical Details

API Access: The attack leverages typical API access, which is often available to users of these language models. This means attackers do not need any special privileges or insider knowledge.
Query Efficiency: The researchers optimized their queries to minimize the number of API calls required. For example, they used a combination of random and targeted queries to efficiently map out the embedding space.
Symmetry Handling: To handle the symmetries in the recovered projection matrix, the researchers employed techniques from linear algebra and optimization to ensure that the recovered matrix is as accurate as possible.

Implications

This attack has significant implications for the security and integrity of production language models. By exposing the internal structure of these models, attackers could:

Reproduce Models: Use the extracted information to build cheaper or more efficient versions of the original models.
Data Theft: Gain insights into the training data used by the model, potentially leading to data leakage.
Malicious Usage: Develop targeted attacks that exploit specific vulnerabilities in the model.

Potential Defenses and Mitigations

The researchers suggest several potential defenses:

Rate Limiting: Implement rate limiting on API access to prevent attackers from making a large number of queries in a short period.
Query Anomaly Detection: Use machine learning techniques to detect and block anomalous query patterns that might indicate an attack.
Model Obfuscation: Apply obfuscation techniques to the model's internal structure to make it harder for attackers to extract meaningful information.

Conclusion

This research highlights the ongoing challenges in securing AI models, particularly those deployed as black-box services. As language models continue to play a crucial role in various applications, understanding and mitigating these security risks is essential for maintaining their integrity and trustworthiness.