★ Server Agent

The Server Agent. On call, with discipline.

An on-call engineer that lives next to your kernel.

Server Agent ships as a daemon, container, or Kubernetes sidecar and reads everything your fleet emits. It collapses noisy alerts into one triaged incident, proposes a root-cause hypothesis with a confidence score, and pages a human the moment that score crosses your threshold. It does not act without permission, and the permission ladder is yours to set.

All products

Role in the fleet

Govern

The guardian seed. It watches production systems today, and the same discipline extends to watching the fleet itself.

See the operating model→

★ Why not a chat wrapper

Generic LLMs can't do this.

A general chatbot can read a stack trace. It cannot be on-call. It has no host identity, no kernel hooks, no signed audit trail, and no way to enforce a blast-radius limit when somebody asks it to restart a node. Server Agent runs inside the trust boundary you already audit. It carries a SOC 2-scoped service account, masks PII before signals leave the host, and refuses to execute anything outside its allow-list. Generic LLMs treat each prompt as fresh. Server Agent carries deploy history, change-window state, and 30 days of correlated signal. That is the difference between a useful assistant and a teammate your CISO will sign off on.

Daemon, container, or K8s sidecar. Read-only by default.

Collapses 100 alerts into one incident with one owner

Root-cause hypotheses ranked by Bayesian confidence, refused below 70%

Three permission tiers, global kill-switch within 5 seconds

★ Use cases

What it actually does.

Real work the agent does end-to-end. Every step is auditable, every consent gate logged, every escalation routed to a named human.

01
Alert-storm collapse
100+ alerts from one root failure become one incident with one owner.
- Ingest alerts from Datadog, Prometheus, CloudWatch, and Sentry on a 30-second window.
- Cluster by service-graph proximity, time correlation, and shared error fingerprint.
- Suppress duplicates. Escalate the highest-signal alert as the parent.
- Attach child alerts as evidence. Post a single incident to PagerDuty.
- Notify the on-call via Slack with a one-screen summary.
Permission tierRead-only.
Human in the loopSuppression is reversible. Raw alerts stay in their source systems and surface on demand.
02
Root-cause hypothesis
A ranked list of likely causes with confidence scores, citing logs, metrics, and the last deploy.
- Pull traces, logs, and metrics for the affected service across a 60-minute window.
- Cross-reference the last three deploys, recent config changes, and infrastructure events.
- Generate three to five hypotheses ranked by Bayesian confidence.
- Cite the specific log lines, span IDs, and commit SHAs supporting each.
- Refuse to recommend action below 70% confidence. Flag for human triage.
Permission tierRead-only.
Human in the loopHypotheses are advisory text. No write path is reachable from this surface.
03
Runbook execution
Runs a documented runbook step by step, pausing for approval at every state-changing action.
- Match incident signature against the runbook library.
- Surface the proposed runbook with diff preview of changes.
- Execute read-only diagnostic steps automatically.
- Block on Slack approval for any state-changing step (restart, scale, failover).
- Log every step to the audit trail with operator identity and timestamp.
Permission tierApproval required per step. Change windows enforced. Hard cap on blast radius.
Human in the loopA named operator approves every state-changing step.
04
Incident timeline auto-generation
A minute-by-minute timeline of what failed, what fired, who acted, and what changed.
- Capture every alert, deploy, config change, and human action from t-60 to t+resolved.
- Stitch into a chronological narrative with span and log citations.
- Annotate with the on-call's Slack messages and approval decisions.
- Render as markdown for ingestion into Jira, Linear, or Notion.
- Hand off to the postmortem first-draft generator.
Permission tierRead-only.
Human in the loopOutput is text. No retroactive editing of source data.
05
Deploy correlation
When error rates spike, the agent checks the last deploy first and shows you the diff.
- Detect anomaly in error rate or latency on a service.
- Query the last 24 hours of deploys touching that service or its dependencies.
- Surface the suspect deploy with author, time, and changed files.
- Compare current and previous version metrics side by side.
- Propose rollback as an approval-gated action.
Permission tierApproval required for rollback. Read-only for analysis.
Human in the loopRollback requires explicit operator approval and runs the team's standard rollback path.
06
Intrusion-attempt triage
SSH brute-force, anomalous API patterns, and suspicious lateral movement classified against MITRE ATT&CK and paged.
- Watch auth logs, network flow, and process lineage for known intrusion patterns.
- Classify against MITRE ATT&CK techniques (T1110, T1078, T1021, T1059).
- Score severity using source reputation, target sensitivity, and pattern frequency.
- Page security on-call via the security channel, not the SRE channel.
- Forward enriched event to the SIEM (Splunk, Sentinel, Chronicle).
Permission tierRead-only by default. Block-IP available as approval-required.
Human in the loopIP blocks scoped to the affected segment, time-boxed to 60 minutes, security approval required.
07
On-call handoff brief
The outgoing on-call's last 12 hours summarised in one screen for the incoming on-call.
- Collect open incidents, recent alerts, deploy state, and active change windows.
- Summarise unresolved issues with current hypotheses and owner.
- Flag flaky services, recurring alerts, and the most-paged signals this week.
- Highlight scheduled maintenance and embargo periods.
- Deliver as a Slack DM and a printable PDF at shift change.
Permission tierRead-only by construction.
Human in the loopThe on-call still runs the shift. The agent makes the first 10 minutes easier.

★ The journey

Phase by phase.

The end-to-end path the agent runs. Every phase logs what it captured, what consent applies, and where a human gates the next step.

01
Detection
Latency on the checkout service crosses 2x baseline. Three downstream alerts fire within 90 seconds.
02
Correlation
Server Agent collapses 47 alerts into one incident, identifies checkout as the parent service, attaches the rest as evidence.
03
Hypothesis
Three candidates ranked: deploy at 14:02 UTC 84%, DB connection pool saturation 41%, upstream rate limit 22%. Cites the suspect commit.
04
Page
On-call gets a Slack ping and a PagerDuty page with the one-screen summary and the top hypothesis.
05
Triage
Operator opens the diff. Agent runs read-only diagnostics: pool stats, slow query log, recent migration check.
06
Action
Operator approves rollback. Agent executes the team's standard rollback runbook, pausing for the production confirmation.
07
Verification
Error rate returns to baseline within 8 minutes. Agent confirms by watching the same signals that fired, then declares the incident mitigated.
08
Postmortem
First draft posted in the incident channel within 10 minutes of resolution, with timeline, blast radius, contributing factors, and proposed action items.

★ Posture

The non-negotiables, spelled out.

Consent, security, accuracy, and residency. Explicit, auditable, and the same line items your CISO or GC already asks about.

Permission tiers

Three tiers, configured per-action, per-environment, per-service. Read-only covers all observation, correlation, hypothesis generation, and notification. Approval-required covers any state change: restarts, scale events, rollbacks, IP blocks, config pushes. Every approval-required action posts a diff preview in Slack and waits for a named operator to approve. Auto is opt-in per-action and only available for narrow, well-bounded responses (clear a wedged log buffer, restart a sidecar) inside defined change windows. Every tier respects a global kill-switch that any on-call can trigger from Slack, PagerDuty, or the agent's own admin port. Kill-switch disables write actions within 5 seconds across the fleet.

Incident discipline

Correlation runs on three axes: time (30-second window), service-graph proximity (one hop default, configurable), and shared error fingerprint. Hypotheses carry a confidence score derived from signal coverage, deploy proximity, and historical incident similarity. The agent refuses to recommend a write action below 70% confidence and explicitly flags the gap. Below 50%, it does not page. It queues for human review at the next handoff. Postmortem first-drafts include timeline, contributing factors with evidence, blast radius, action items, and the three to five hypotheses that were rejected. The draft is editable, never auto-published, and never closes an incident on its own.

Security posture

SOC 2 Type II controls cover access, audit, change management, and incident response. PII is masked at the agent before signals leave the host, using a configurable redaction policy (email, phone, payment, custom regex). All write actions are signed, logged to an append-only audit trail, and forwarded to the customer's SIEM in CEF or OCSF. Intrusion-attempt classification aligns to MITRE ATT&CK technique IDs. Transport is mTLS. At-rest data uses customer-managed keys when deployed in-VPC. GDPR and India DPDP compliance covered for data residency and right-to-erasure. The agent never exfiltrates raw logs. It summarises locally and ships only what the policy permits.

★ Outcomes

What the numbers look like.

Figures we can point at. Every one carries a source you can verify.

65% of incidents involve more than one team

Justifies cross-signal correlation as a primary value driver.

Source: Atlassian State of Incident Management 2023

SREs spend ~25% of their time on toil

Alert triage and runbook execution are the two largest toil categories.

Source: Google SRE Workbook, chapter on eliminating toil

28-minute median time-to-acknowledge for Sev-1

The window Server Agent compresses with single-incident pages and a pre-built hypothesis.

Source: PagerDuty State of Digital Operations 2024

70%+ of P1 incidents correlate with a deploy in the prior 24 hours

Deploy-correlation is the highest-yield first check.

Source: Google DORA State of DevOps 2023

56% of orgs report alert fatigue causing missed incidents

Collapse-to-one-incident is the direct counter.

Source: IDC Observability Trends 2024

★ Where it lands

What it replaces. What it augments.

Replaces

Tier-1 alert-triage rotation.
Manual incident-timeline assembly.
Postmortem first-draft writing.
The spreadsheet of "which alert maps to which runbook".

Augments

PagerDuty, Datadog, Splunk, your existing SIEM, your existing runbook library, your existing on-call rotation.
Server Agent does not replace the on-call. It removes the work the on-call should not have been doing.

★ Inside Server Agent

What the Server Agent does

It watches your production fleet, correlates the signals, and pages you with the incident already triaged.

What it watches

Ships as a tiny daemon, container, or Kubernetes sidecar. Read-only by default; action permissions are explicit.
Watches alerts, errors, logs, network traffic, and attempted intrusions in real time.

How it triages

Cross-correlates signals so a 5xx spike, a CPU climb, and unusual traffic become one triaged incident, not three separate alerts.
Writes plain-English incident notes with the suspected cause, the affected services, and a suggested rollback or mitigation.

When something breaks

Crisis paging across Slack, PagerDuty, SMS, WhatsApp, and email, with quiet hours and severity routing per team.
Your on-call starts a minute ahead of the alert storm, already briefed.

★ Pricing

The numbers, up front.

Pricing scopes to fleet because the work scopes to fleet. A 200-host startup and a 40,000-host bank don't pay the same, and they shouldn't. We size by host count, signal volume, and the permission tiers you enable, then quote once. No per-seat tax, no per-incident gouging, no surprise overage at quarter-end.

What you get

Agent deployment across your fleetincluded

Crisis-paging integrationsincluded

Auto-remediation playbooksopt-in

SOC 2 / GDPR / DPDP controlsstandard

Quote scopeyour fleet

★ Server Agent

Put the Server Agent to work.

See all products

The Server Agent. On call, with discipline.

Generic LLMs can't do this.

What it actually does.

Alert-storm collapse

Root-cause hypothesis

Runbook execution

Incident timeline auto-generation

Deploy correlation

Intrusion-attempt triage

On-call handoff brief

Phase by phase.

Detection

Correlation

Hypothesis

Page

Triage

Action

Verification

Postmortem