A major security vulnerability in DeepSeek, the breakthrough Chinese AI model, has been uncovered by researchers, exposing the platform’s entire system prompt through a sophisticated jailbreak technique.
This discovery has raised serious concerns about AI security and model training transparency.
Wallarm’s security research team successfully exploited DeepSeek’s bias-based AI response logic to extract its hidden system prompt, revealing the model’s core operational instructions.
The breach exposed not only the model’s behavioral guidelines but also suggested potential connections to OpenAI’s technology in its training process.
DeepSeek’s Exposes Full System Prompt
Researchers exploited bias-based AI response logic to bypass DeepSeek’s built-in restrictions. While standard queries like “What is your system prompt?” typically trigger security denials, the researchers developed a sophisticated jailbreak method that circumvented these protections.
The attack utilized several technical approaches: prompt injection techniques, direct system prompt requests using misleading formatting, role-play manipulation through simulated debugging scenarios, and recursive questioning to trigger unintended disclosures.
Researchers crafted specialized inputs that confused the model into bypassing its security restrictions. The attack proved highly effective, requiring minimal technical expertise to execute.
Techniques like “Deceptive Delight” and “Bad Likert Judge” have been employed to progressively manipulate the model’s responses. These methods achieved complete bypass of safety mechanisms, access to restricted information and extraction of internal system parameters.
Broader Security Implications
The breach has exposed critical vulnerabilities in DeepSeek’s security architecture. The exposed database contained:
- Over 1 million lines of log streams
- Chat histories
- Secret keys
- Backend operational details
The discovery has prompted immediate action across the AI industry. Australia has banned DeepSeek from government devices, citing “unacceptable risks” to national security, with similar measures being adopted by Italy, Taiwan, South Korea, and France.
Recent testing by Cisco and the University of Pennsylvania revealed alarming statistics:
- DeepSeek failed to block 100% of harmful prompts
- Compared to OpenAI’s o1 system, which only responded to 26% of malicious prompts
- Claude 3.5 Sonnet showed a 36% attack success rate
This security breach underscores the urgent need for robust safety measures in AI development and deployment, particularly as these systems become increasingly integrated into critical applications.
PCI DSS 4.0 & Supply Chain Attack Prevention – Free Webinar