
What running 25 real malware samples through an automated analysis pipeline taught me about detection strategy, behavioral patterns, and why signature-based detection alone will always lose.
Published on April 02, 2026 by Kyle S
malware-analysis detection-engineering mitre-attack yara behavioral-analysis threat-intelligence cape soc
7 min READ
In my previous post, I walked through how I built an automated malware analysis pipeline using CAPEv2, Proxmox, and n8n. Infrastructure is one thing. What actually matters is what you learn from the data it produces.
I’ve now run 25 real-world malware samples through the pipeline. The results confirmed some things I expected, surprised me in a few areas, and fundamentally changed how I think about detection engineering. This post breaks down those findings.
The single most important number from this analysis: YARA signature detection caught 28% of samples. Behavioral detection caught 80%.
That’s not a slight edge. Behavioral analysis detected nearly 3x more threats than static signatures alone. And the 20% that behavioral analysis missed? Those were heavily packed samples that crashed during detonation or employed aggressive anti-sandbox techniques that prevented meaningful execution.
This isn’t a criticism of YARA. Signatures are fast, cheap to run at scale, and critical for known-threat triage. But if your detection strategy relies primarily on signatures, you’re structurally blind to 70%+ of what’s actually hitting your environment.
The takeaway isn’t “stop using signatures.” It’s layer your detections and assume signatures will fail for anything novel.
Across all 25 samples, five behavioral patterns dominated:
| Behavior | Prevalence | MITRE Technique |
|---|---|---|
| System discovery commands | 88% | T1082, T1083 |
| Code obfuscation / packing | 80% | T1027 |
| VM/sandbox evasion checks | 72% | T1497 |
| Process injection | 60% | T1055 |
| Browser credential theft | 48% | T1555.003 |
22 out of 25 samples ran some form of system reconnaissance before doing anything else. The pattern was consistent: whoami, systeminfo, ipconfig, tasklist, wmic calls - mapping out the environment before deciding whether to deploy the real payload.
This is useful for defenders. System discovery commands from unexpected parent processes are a reliable early-warning signal. A cmd.exe spawned by outlook.exe running systeminfo is almost never legitimate. These detections are low-noise and high-fidelity.
72% of samples actively checked whether they were running in a sandbox. Techniques ranged from simple (checking for VMware tools processes, querying MAC address OUI prefixes) to sophisticated (timing-based checks using GetTickCount, checking for human interaction artifacts like recent documents and browser history).
This has direct implications for sandbox design. My pipeline uses INetSim to simulate internet services and the CAPEv2 agent with anti-evasion patches, but the arms race is real. Samples that detected the sandbox simply exited cleanly - no malicious behavior to analyze.
60% of samples used some form of process injection. The breakdown:
The favorite injection targets? explorer.exe, svchost.exe, and RegAsm.exe. All blend into normal system activity, which is exactly the point.
For detection, monitor for process creation anomalies: svchost.exe without -k parameters, processes with unusual parent-child relationships, and high-entropy memory regions in legitimate processes.
XWorm was the most feature-rich sample in the dataset. Keylogging, screen capture, credential theft, persistence via scheduled tasks AND registry run keys (redundant persistence is a red flag), plus encrypted C2 over a non-standard port. Detection surface was wide - almost too many indicators to miss.
Unlike XWorm’s sprawling approach, WhiteSnake was focused. It hit browser credential stores, cryptocurrency wallets, and VPN configs, exfiltrated over HTTPS, and self-deleted. Total execution time: under 90 seconds. The lesson here is that speed kills - if your detection pipeline has even a 2-minute delay, this sample completes its entire kill chain before you see the first alert.
RisePro was the only sample that successfully evaded most behavioral detections. Heavy packing, delayed execution, environment fingerprinting, and incremental payload delivery meant the sandbox saw minimal activity during the analysis window. This is the 20% that behavioral detection misses - and why extending detonation times and using multiple analysis passes matter.
Mapping all observed behaviors to ATT&CK tactics across the dataset:
Discovery ████████████████████████ 88%
Defense Evasion ████████████████████████ 84%
Execution ██████████████████████ 80%
Credential Acc. █████████████████ 60%
Collection ████████████████ 56%
Persistence ███████████████ 52%
C2 ██████████████ 48%
Exfiltration ████████████ 44%
Priv Escalation ████████ 28%
Lateral Movmnt ███ 8%
Two observations:
Discovery and Defense Evasion dominate. Nearly every sample’s first priority is understanding the environment and avoiding detection. If you’re building detections, these tactics are your highest-ROI targets because they’re nearly universal across malware families.
Lateral Movement is rare in commodity malware. Only 8% showed lateral movement capability. This makes sense - most of these samples are stealers and RATs designed to compromise individual endpoints, not move through networks. Lateral movement detection is still critical for targeted attacks, but your commodity malware detections should focus elsewhere.
PCAP analysis from the isolated network produced 42 unique network IOCs across the dataset: C2 domains, callback IPs, and exfiltration endpoints. Several patterns emerged:
These aren’t traditional IOCs that age out quickly. The patterns - DNS-over-HTTPS for resolution, messaging platform APIs for exfil - are stable detection opportunities that work across malware families.
Every sample in this dataset would eventually get new hashes, new C2 infrastructure, and new packing. The behaviors stayed consistent. System discovery commands from unexpected parents, process injection patterns, and credential store access are durable detections that survive malware updates.
WhiteSnake completed its entire kill chain in 90 seconds. If your SIEM ingestion pipeline has a 5-minute delay, that sample has already exfiltrated credentials and self-destructed before your first alert fires. Real-time or near-real-time detection isn’t a luxury - it’s the difference between catching a breach and investigating one.
No single detection approach caught everything. YARA missed 72% of samples. Behavioral analysis missed 20%. Network IOCs missed samples that successfully evaded detonation. But layered together, the combined detection rate approaches 95%. Defense in depth isn’t a buzzword - it’s a mathematical necessity.
This analysis is feeding directly into detection content for my Unified SOC Lab. Specifically:
The pipeline produced 12 published analysis reports (available in the Analysis Catalog), 42 network IOCs, and 200+ behavioral signature matches. That’s not a toy dataset - it’s a working threat intelligence feed built entirely from homelab infrastructure.
Running your own malware analysis pipeline isn’t just a resume project. It fundamentally changes how you think about detection. When you watch 25 samples execute in real-time and see the patterns emerge, you stop thinking in terms of IOCs and start thinking in terms of behaviors. That shift - from reactive signature matching to proactive behavioral detection - is what separates a SOC analyst from a detection engineer.
The infrastructure post covered the “how.” This post covered the “so what.” The data speaks for itself.