Event Logging to EL3: Practical Telemetry, Retention & Cost Controls

Harshil Shah
Sep 1, 2025
4 min read

A federal CISO playbook to turn policy into outcomes—reach EL3 under M-21-31 with the minimum viable telemetry, a tiered retention model, defensible cost controls, and evidence an Authorizing Official will accept.

Audience: Federal CISOs, Deputies, SOC leaders, PMO • Time to implement: 60–90 days • Dependencies: SOC, privacy, legal, finance, procurement

“If your logging strategy can’t survive budget review, it won’t survive an incident review. Engineer for both.”

— Harshil Shaw

What you’ll achieve

Clear scope: which systems and log types must reach EL3 now vs. next quarter.
Telemetry blueprint: identity, endpoint, network, cloud, SaaS, and application logs that actually drive detection and investigations.
Retention you can fund: hot/warm/cold tiers mapped to mission criticality.
Spend control: routing, filtering, and storage tactics that cut cost without losing signal.
Evidence: artifacts to satisfy ATO reviewers and internal audit.

EL3 in one minute

Goal: Logging requirements met across all criticality levels, with centralized access for investigations and response.

Core capabilities: event coverage by category, integrity (tamper-resistant), timely availability, queryability, retention, and cross-system correlation.

Outcome Metric: % of in-scope systems at EL3 × % of required event types collected × % of events searchable within target time.

Minimum viable telemetry (by domain)

Identity & Access

AuthN/AuthZ events: success/fail, MFA state, risk score, device binding.
Privilege changes: role grants, elevation, break-glass usage, dormant → active.
Federation: token issuance, unusual claims, session anomalies.

Endpoints & Servers

EDR detections, process starts, driver loads, script engines, lateral movement.
Patching status deltas; kernel and audit logs for high-value assets.

Network & Edge

DNS queries (resolver and egress), HTTP(S) metadata, TLS fingerprint changes.
Zero Trust gateways: policy decisions, deny reasons, inline malware verdicts.

Cloud, SaaS & Apps

Cloud control-plane: IAM changes, key/secret use, policy updates, resource create/delete.
SaaS admin: sharing/permission changes, external app grants, bulk downloads.
Application logs: auth flows, business-critical transactions, error rates, admin actions.

Tiered retention that balances cost & readiness

Tier	Purpose	Typical Window	Storage	Notes
Hot	Active detection & investigations	30–90 days	Primary SIEM/search	Fast query SLA; index high-value fields only
Warm	Case expansion, threat hunting	6–12 months	Lower-cost searchable store	Columnar/object storage with late-binding schema
Cold	Compliance & rare look-backs	12–24 months+	Object/archive with on-demand restore	Use lifecycle policies; encrypt & verify integrity

Tip: Define retention by mission criticality and investigation value, not by product default. Document exceptions up front.

Cost controls that don’t break investigations

Reduce ingest safely

Filter noise at source: drop known-benign health checks, verbose DEBUG unless under case.
De-duplicate: collapse repetitive events with counters and first/last timestamps.
Field hygiene: exclude high-cardinality payloads (e.g., full request bodies) from hot paths.
Sampling with guardrails: full capture for security-critical events; sample low-risk telemetry.

Store smarter

Triage routing: route only high-value events to SIEM; send the rest to object storage.
Lifecycle policies: automatic down-tiering (hot → warm → cold) and deletion on schedule.
Compression & partitioning: time- and tenant-based partitions; enforce gzip/zstd.
Schema-on-read: keep raw + minimal enriched versions to avoid re-ingest.

Evidence pack for ATO & audit

Inventory: authoritative list of log sources, category, owner, system boundary, data sensitivity.
Coverage map: required vs. collected event types per system, with gaps and target dates.
Data path diagram: source → collector → broker → storage tiers → SIEM/SOAR, including integrity controls.
Retention matrix: system × tier × duration × encryption × access control.
Operational runbooks: onboarding, schema changes, incident hold/legal preservation.
Quarterly attestation: % systems at EL3, % events searchable within target time, spot-check queries.

KPIs that matter

KPI	Target Pattern	How it’s used
% in-scope systems at EL3	>90% within 2 quarters	Program health
Time to search new events	<5 minutes to SIEM/searchable store	Investigation readiness
Coverage of required event types	>95% for HVA, >85% portfolio-wide	Signal quality
Cost per ingested GB (hot)	Quarter-over-quarter ↓ with stable MTTD	Financial control
MTTD / MTTI (investigation)	Month-over-month ↓	Outcome metric

90-day implementation plan

Phase	Weeks	Deliverables
Scope & governance	1–2	In-scope systems, required events per category, owners, initial coverage map, exception process.
Data plumbing	3–6	Collectors/brokers deployed, routing rules, hot/warm/cold stores, integrity controls, schema catalog.
Use-cases & tuning	7–9	Detection use-cases, saved queries, dashboards, noise filters, investigation runbooks.
Attest & operate	10–12	Quarterly metrics, gap closure plan, cost report, evidence pack for AO/internal audit.

Control economics: fund what reduces risk

Control	Primary Benefit	Cost Signal	Decision
Centralized log routing/broker	One policy plane; reliable delivery; integrity	↓ Duplicate ingest, cheaper down-tiering	Fund
Phishing-resistant MFA logs & admin trails	Identity attack visibility	Modest ingest; high investigation value	Fund
Verbose DEBUG in hot storage	Occasional deep dives	↑ Cost; low daily value	Defer to warm/cold
High-cardinality payload capture	For rare forensic needs	↑↑ Cost; privacy risk	Scope tightly

FAQ

What about “EL4”?

Some programs define an internal “EL4/Optimized” tier to drive continuous improvement beyond EL3 (for example, stronger integrity controls or longer searchable retention). It’s optional—treat it as an agency standard, not a new policy requirement.

How do I avoid runaway costs?

Filter at the source, route by value, down-tier aggressively with lifecycle policies, and publish a quarterly cost per GB vs. MTTD trend. If MTTD stays flat while cost drops, you’re cutting waste—not signal.

What’s the quickest win?

Close identity and admin-action logging first. It unlocks the largest set of high-value detections and accelerates investigations across apps, cloud, and SaaS.

How do I show progress to leadership?

Use a simple scorecard: % systems at EL3, coverage by required event type, search latency, and unit cost. Pair the scorecard with two improved investigation case studies.