The MITRE Corporation has unveiled a groundbreaking evaluation framework designed to quantify the risks posed by large language models (LLMs) in offensive cyber operations (OCO).
Dubbed OCCULT (Operational Evaluation Framework for Cyber Security Risks in AI), the methodology aims to standardize testing of AI systems’ capabilities to autonomously execute or assist in cyberattacks, addressing growing concerns over LLMs’ potential to democratize sophisticated hacking techniques.
Traditional cybersecurity assessments often rely on static benchmarks like Capture-the-Flag (CTF) challenges or multiple-choice knowledge tests.
The OCCULT Framework
However, these fail to capture the dynamic, multi-stage nature of real-world cyber campaigns. OCCULT introduces a three-dimensional evaluation philosophy:
OCO Capability Areas: Tests align with MITRE ATT&CK® tactics, such as lateral movement, privilege escalation, and credential access.
LLM Use Cases: Evaluates models as knowledge assistants (human-augmented tools), co-orchestration agents (collaborating with platforms like Caldera™), or autonomous operators.
Reasoning Power: Measures planning, environmental perception, action iteration, and task generalization—critical for adapting to evolving network defenses.
Core Test Cases Demonstrating OCCULT’s Rigor
MITRE’s research team validated OCCULT through three novel benchmarks:
Threat Actor Competency Test for LLMs (TACTL)
This scenario-based multiple-choice benchmark assesses knowledge of 44 ATT&CK techniques across 30 questions.
Dynamic variables (e.g., randomized IP addresses and usernames) prevent memorization. In preliminary tests, DeepSeek-R1 (685B parameters) achieved 100% accuracy, outperforming models like Mixtral 8x22B (60%).
The TACTL-183 extended benchmark revealed LLMs struggle most with Brute Force: Password Spraying (T1110.003) and Kerberoasting (T1558.003), averaging <50% accuracy.
Synthetic Active Directory environments challenged LLMs to match BloodHound’s graph-based analysis of attack paths.
Models like Llama 3.1-405B identified 52.5% of high-value targets (e.g., Domain Admins) but faltered on complex queries like “Find Kerberoastable Users with Most Privileges” (35% accuracy).
The test highlighted LLMs’ nascent ability to replace specialized tools without explicit training.
High-fidelity network emulations tested autonomous decision-making. In the Worm Scenario, agents aimed to laterally move using esentutl.exe (T1570) while avoiding detection.
Llama 3.1-70B completed objectives in 8 steps with 32 artifacts—outperforming Command R+ (34 steps, 200 artifacts). However, all models exhibited “noisy” behavior, triggering 3–5x more alerts than human operators.
Key Findings and Implications
DeepSeek-R1 demonstrated unprecedented OCO proficiency, solving 91.8% of TACTL-183 challenges, though its 31-minute inference time limits real-time use.
Open-source models (Llama 3.1-405B, Mixtral 8x22B) showed marked improvements over predecessors but lagged in multi-step planning.
GPT-4o balanced speed (8-second inference) with 85.2% accuracy, suggesting viability as a red-team co-pilot.
MITRE’s researchers caution that while no LLM achieved full autonomy, their ability to automate reconnaissance and exploit chaining poses tangible risks.
“Models like DeepSeek-R1 could lower the barrier for APT-level attacks,” warned lead investigator Michael Kouremetis.
MITRE plans to open-source OCCULT’s test cases by 2025, inviting community contributions to expand coverage of cloud, IoT, and zero-day exploitation scenarios.
Ongoing work integrates CyberLayer with adversarial emulation platforms to stress-test AI agents against adaptive defenses.
As AI rapidly evolves, frameworks like OCCULT provide critical guardrails for policymakers and defenders to preemptively mitigate threats.
By quantifying risks across the cyber kill chain, MITRE aims to shift the paradigm from reactive patching to proactive resilience.
Collect Threat Intelligence on the Latest Malware and Phishing Attacks with ANY.RUN TI Lookup -> Try for free