Cyber Security News

DeepSeek Data Leak – 12,000 Hardcoded Live API keys and Passwords Exposed

A recent analysis uncovered 11,908 live DeepSeek API keys, passwords, and authentication tokens embedded in publicly scraped web data.

According to cybersecurity firm Truffle Security, the study highlights how AI models trained on unfiltered internet snapshots risk internalizing and potentially reproducing insecure coding patterns.

The findings follow earlier revelations that LLMs frequently suggest hardcoding credentials in codebases, raising questions about the role of training data in reinforcing these behaviors.

Root AWS key (Source: Truffle Security)

DeepSeek Data Exposed

Truffle Security scanned 400 terabytes of Common Crawl’s December 2024 dataset, comprising 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers identified:

  • 11,908 verified live secrets that authenticate to services like AWS, Slack, and Mailchimp.
  • 2.76 million web pages containing exposed credentials, with 63% of keys reused across multiple domains.
  • A single WalkScore API key recurring 57,029 times across 1,871 subdomains, illustrating widespread credential reuse.

Notably, the dataset included high-risk exposures like AWS root keys in front-end HTML and 17 unique Slack webhooks hardcoded into a single webpage’s chat feature.

Mailchimp API keys dominated the leaks (1,500+ instances). They were often embedded directly in client-side JavaScript, a practice that enabled phishing campaigns and data exfiltration.

Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses from crawled sites.

Truffle Security deployed a 20-node AWS cluster to process the archive, splitting files using awk and scanning each segment with TruffleHog’s verification engine.

The tool differentiated live secrets (authenticated against their services) from inert strings—a critical step given that LLMs cannot discern valid credentials during training.

Researchers faced infrastructural hurdles: WARC’s streaming inefficiencies initially slowed processing, while AWS optimizations reduced download times by 5–6x.

WARC File (Source: Truffle Security)

Despite these challenges, the team prioritized ethical disclosure by collaborating with vendors like Mailchimp to revoke thousands of keys, avoiding spam-like outreach to individual website owners.

The study underscores a growing dilemma: LLMs trained on publicly accessible data inherit its security flaws. While models like DeepSeek utilize additional safeguards fine-tuning, alignment techniques, and prompt constraints—the prevalence of hardcoded secrets in training corpora risks normalizing unsafe practices.

Non-functional credentials (e.g., placeholder tokens) contribute to this issue, as LLMs cannot contextually evaluate their validity during code generation.

Truffle Security warns that developers who reuse API keys across client projects face heightened risks. In one case, a software firm’s shared Mailchimp key exposed all client domains linked to its account, a goldmine for attackers.

Mitigations

To curb AI-generated vulnerabilities, Truffle Security recommends:

  1. Integrating security guardrails into AI coding tools via platforms like GitHub Copilot’s Custom Instructions, which can enforce policies against hardcoding secrets.
  2. Expanding secret-scanning programs to include archived web data as historical leaks resurface in training datasets.
  3. Adopting Constitutional AI techniques to align models with security best practices, reducing inadvertent exposure of sensitive patterns.

With LLMs increasingly shaping software development, securing their training data is no longer optional—it’s foundational to building a safer digital future.

Collect Threat Intelligence on the Latest Malware and Phishing Attacks with ANY.RUN TI Lookup -> Try for free

Guru Baran

Gurubaran is a co-founder of Cyber Security News and GBHackers On Security. He has 10+ years of experience as a Security Consultant, Editor, and Analyst in cybersecurity, technology, and communications.

Recent Posts

PupkinStealer Attacks Windows System to Steal Login Credentials & Desktop Files

A new information-stealing malware dubbed "PupkinStealer" has been identified by cybersecurity researchers, targeting sensitive user…

7 hours ago

Malware Defense 101 – Identifying and Removing Modern Threats

The cybersecurity landscape in 2025 is defined by increasingly sophisticated malware threats, with attackers leveraging…

16 hours ago

AI Security Frameworks – Ensuring Trust in Machine Learning

As artificial intelligence transforms industries and enhances human capabilities, the need for strong AI security…

17 hours ago

Preventing Phishing Attacks on Cryptocurrency Exchanges

Cryptocurrency exchanges are intensifying security measures in 2025 to focus on preventing phishing attacks, as…

19 hours ago

Adversarial Machine Learning – Securing AI Models

As AI systems using adversarial machine learning integrate into critical infrastructure, healthcare, and autonomous technologies,…

22 hours ago

10 Best NGINX Monitoring Tools – 2025

NGINX monitoring tools ensure NGINX web servers' optimal performance and reliability. These tools provide comprehensive…

23 hours ago