Spellbook

Pre-built EvalOps recipes for instant telemetry magic

Clone proven evaluation workflows, connect telemetry, and start reviewing results in minutes. Every spell packages scorecards, monitors, and governance guardrails so you can ship quickly without sacrificing trust.

How it works

Spells are the fastest way to stand up governed evaluation loops—each one captures the telemetry, scorecards, and automation you need.

Step 01

Clone the recipe

Each spell includes scorecards, monitors, and scenario packs. Import them into EvalOps with one click and tailor metrics as you go.

Step 02

Connect telemetry

Use Community Edition, CI runners, or production connectors to stream traces. Spells call out the exact inputs each workflow needs.

Step 03

Review & iterate

Dashboards, alerts, and governance flows are ready to ship. Bring in stakeholders, capture evidence, and evolve the playbook together.

Featured spells

Operational recipes the EvalOps team runs every week

Start with these battle-tested workflows, then customize metrics, alerts, and governance flows to match your organization.

Spell 01

Regression Sentinel

Catch model regressions before they leave staging.

Daily and pre-release scorecards that diff evaluation traces across commits, ensuring the latest prompt or fine-tune behaves as expected.

Telemetry

  • Source control metadata
  • Scenario snapshots
  • Guardrail feedback

Best for

  • Platform teams
  • Release engineering

Run the spell

  1. Connect Community Edition or CI runners to EvalOps telemetry ingestion.
  2. Import the Regression Sentinel scorecard template with precision, hallucination, and guardrail coverage checks.
  3. Wire the EvalOps CI Gate action into deployment so risky builds auto-block.
Spell 02

Provider Bake-off

Benchmark multiple LLM providers with identical telemetry.

Run the same scenarios across OpenAI, Anthropic, Azure OpenAI, and Groq with unified scoring so you can pick the right provider for each workload.

Telemetry

  • Provider responses
  • Latency & token metrics
  • Cost annotations

Best for

  • Product teams
  • Procurement

Run the spell

  1. Enable the provider connectors you want to compare and configure routing weights.
  2. Clone the Provider Bake-off kit for prompts, evaluation criteria, and dashboards.
  3. Review latency, quality, and cost side-by-side and export a procurement-ready report.
Spell 03

Red Team Sandbox

Stress test prompts with adversarial scenarios and log everything.

Spin up adversarial evaluations that try to elicit jailbreaks, policy violations, or insecure behaviors—perfect for safety and security reviews.

Telemetry

  • Adversarial prompts
  • Policy violation scores
  • Trace replay logs

Best for

  • Security
  • Trust & safety

Run the spell

  1. Import the Red Team Sandbox scenario pack with seeded adversarial prompts.
  2. Enable policy classifiers and guardrail scoring inside EvalOps.
  3. Route findings into PagerDuty or Slack so red teamers and engineers collaborate in real time.
Spell 04

Drift Watchtower

Monitor production traces for silent quality drift.

Continuously sample production traffic, score it against historical baselines, and notify the right people when performance slips.

Telemetry

  • Production traces
  • Baseline metrics
  • Alert thresholds

Best for

  • Observability
  • ML Ops

Run the spell

  1. Set up live telemetry ingestion from your applications or message buses.
  2. Configure the Drift Watchtower monitor with baseline snapshots and thresholds.
  3. Deliver alerts into your incident tooling and attach remediation runbooks.
Spell 05

Agent Shadow Run

Run autonomous agents in shadow mode before production rollout.

Execute agents in parallel with human workflows, capture telemetry, and only promote when confidence surpasses defined thresholds.

Telemetry

  • Agent decisions
  • Human comparison data
  • Success criteria

Best for

  • Autonomy teams
  • Operations

Run the spell

  1. Integrate agent execution logs into EvalOps via the shadow-run connector.
  2. Apply the Shadow Run scorecard to compare agent vs. human outcomes.
  3. Graduate agents once they consistently exceed human benchmarks.
Spell 06

Support QA Scorecard

Evaluate customer support answers across CX, accuracy, and compliance.

Pull tickets or chat logs, run automated grading, and feed insights back into your CX and compliance programs.

Telemetry

  • Support transcripts
  • CX quality metrics
  • Compliance tags

Best for

  • Customer experience
  • Compliance

Run the spell

  1. Connect your helpdesk or knowledge base to stream transcripts into EvalOps.
  2. Apply the Support QA scorecard with customer satisfaction and policy adherence metrics.
  3. Share dashboards with CX leadership and trigger retraining workflows when thresholds dip.
Bring spells to your stack

Pair the spellbook with integrations and governance guardrails

Tell us which spell you need, the telemetry you’re wrangling, and who needs to sign off. We’ll send the configuration bundle, rollout plan, and connect you with a solutions engineer.