The industry has been reporting on Large Language Models (LLMs) and their potential, but how do various LLMs stack up in the demanding world of a Security Operations Center (SOC)? SOC analysts and vendors building tools for the SOC are rapidly embracing LLMs to scale their operations, increase accuracy, and reduce costs, yet with hundreds to choose from, determining the best LLM for your particular SOC isn’t an easy task.
Simbian recently introduced what the company is calling the first benchmark to comprehensively measure LLM performance in SOCs, measuring them against a diverse range of real alerts and fundamental SOC tools over all phases of alert investigation, from alert ingestion to disposition and reporting. The benchmark compares LLMs across a variety of attacks, threats, and alerts in a realistic SOC environment, and includes a public leaderboard to help professionals decide the best LLM for their SOC needs.
Simbian feels their guide is different than others:
- It goes beyond existing benchmarks that compare LLMs over broad criteria such as language understanding, math, and reasoning. Some benchmarks exist for broad security tasks or very basic SOC tasks like alert summarization.
- Simbian’s benchmark is built on the autonomous investigation of 100 full-kill chain scenarios that mirror what human SOC analysts face every day. To achieve this, the company created diverse, real-world-based attack scenarios with known ground truth of malicious activity, allowing AI agents to investigate and be assessed against a clear baseline. These scenarios are based on historical behavior of well-known APT groups and cybercriminal organizations, with a focus on prevalent threats like ransomware and phishing.
- Simbian leveraged their AI SOC agent. The evaluation process is evidence-grounded and data-driven, always kicking off with a triggered alert representing the way SOC analysts operate. The AI agent then needed to determine if the alert was a True or False Positive, find evidence of malicious activity (think “CTF flags” in a red teaming exercise), correctly interpret that evidence by answering security-contextual questions, and provide high-level overview of the attacker’s activity and respond to the threat in an appropriate manner – all autonomously. This evidence-based approach is critical for managing hallucinations.
Findings and Key Takeaways
Simbian benchmarked some of the most well-known and high-performing models available as of May 2025, including models from Anthropic, OpenAI, Google, and DeepSeek.
The Results detailed that all high-end models were able to complete over half of the investigation tasks, with performance ranging from 61 to 67 percent. For reference, during the first AI SOC Championship the best human analysts powered by AI SOC scored in the range of 73 to 85 percent, while Simbian’s AI Agent at extra effort settings reached the value of 72 percent. This suggests that LLMs are capable of much more than just summarizing and retrieving data and that LLM capabilities could extend to robust alert triage and tool use via API interactions.
Another noteworthy finding is the importance of thorough prompt engineering and agentic flow engineering, which involves feedback loops and continuous monitoring. In initial runs, some models struggled, requiring improved prompts and fallback mechanisms for coding agents involved in analyzing retrieved data. This leads to a second key finding: AI SOC applications heavily lean on the software engineering capabilities of LLMs.
Perhaps most surprisingly, Sonnet 3.5 sometimes outperformed newer versions like Sonnet 3.7 and 4.0. Simbian speculates this could be due to “catastrophic forgetting,” where further domain specialization for software engineering or instruction following might detrimentally affect cybersecurity knowledge and investigation planning. This underscores the critical need for benchmarking to evaluate the fine-tuning of LLMs for specific domains.
Finally, Simbian found that “thinking models” (those that use post-training techniques and often involve into internal self-talk) didn’t show a considerable advantage in AI SOC applications, with all tested models demonstrating comparable performance.
This resembles the findings of the studies on software bug fixing and red team CTF applications, which suggest that once LLMs hit a certain capability ceiling, additional inference leads to only marginal improvements, often at a higher cost. This points to the necessity of human-validated LLM applications in AI SOC and the continued development of fine-grained, specialized benchmarks for improving cybersecurity domain-focused reasoning.
For the full details and results of the benchmarking, click here.