A recent analysis uncovered 11,908 live DeepSeek API keys, passwords, and authentication tokens embedded in publicly scraped web data.
According to cybersecurity firm Truffle Security, the study highlights how AI models trained on unfiltered internet snapshots risk internalizing and potentially reproducing insecure coding patterns.
The findings follow earlier revelations that LLMs frequently suggest hardcoding credentials in codebases, raising questions about the role of training data in reinforcing these behaviors.
Truffle Security scanned 400 terabytes of Common Crawl’s December 2024 dataset, comprising 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers identified:
Notably, the dataset included high-risk exposures like AWS root keys in front-end HTML and 17 unique Slack webhooks hardcoded into a single webpage’s chat feature.
Mailchimp API keys dominated the leaks (1,500+ instances). They were often embedded directly in client-side JavaScript, a practice that enabled phishing campaigns and data exfiltration.
Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses from crawled sites.
Truffle Security deployed a 20-node AWS cluster to process the archive, splitting files using awk
and scanning each segment with TruffleHog’s verification engine.
The tool differentiated live secrets (authenticated against their services) from inert strings—a critical step given that LLMs cannot discern valid credentials during training.
Researchers faced infrastructural hurdles: WARC’s streaming inefficiencies initially slowed processing, while AWS optimizations reduced download times by 5–6x.
Despite these challenges, the team prioritized ethical disclosure by collaborating with vendors like Mailchimp to revoke thousands of keys, avoiding spam-like outreach to individual website owners.
The study underscores a growing dilemma: LLMs trained on publicly accessible data inherit its security flaws. While models like DeepSeek utilize additional safeguards fine-tuning, alignment techniques, and prompt constraints—the prevalence of hardcoded secrets in training corpora risks normalizing unsafe practices.
Non-functional credentials (e.g., placeholder tokens) contribute to this issue, as LLMs cannot contextually evaluate their validity during code generation.
Truffle Security warns that developers who reuse API keys across client projects face heightened risks. In one case, a software firm’s shared Mailchimp key exposed all client domains linked to its account, a goldmine for attackers.
To curb AI-generated vulnerabilities, Truffle Security recommends:
With LLMs increasingly shaping software development, securing their training data is no longer optional—it’s foundational to building a safer digital future.
Collect Threat Intelligence on the Latest Malware and Phishing Attacks with ANY.RUN TI Lookup -> Try for free
A new information-stealing malware dubbed "PupkinStealer" has been identified by cybersecurity researchers, targeting sensitive user…
The cybersecurity landscape in 2025 is defined by increasingly sophisticated malware threats, with attackers leveraging…
As artificial intelligence transforms industries and enhances human capabilities, the need for strong AI security…
Cryptocurrency exchanges are intensifying security measures in 2025 to focus on preventing phishing attacks, as…
As AI systems using adversarial machine learning integrate into critical infrastructure, healthcare, and autonomous technologies,…
NGINX monitoring tools ensure NGINX web servers' optimal performance and reliability. These tools provide comprehensive…