Local-first prompt regression diffs

Catch prompt regressions before they ship.

Point redline at the prompt-response logs you already have, change your prompt, then catch broken JSON, missing URLs, lost tables, and silent refusals before your users do.

$ python -m pip install redline-ai

$ redline demo --public --compact

Start in 5 minutes View source

100public rows dogfooded 0runtime dependencies 1local app workflow

CIpassing PyPIredline-ai MCPregistered GitHubstar

redline - demo

$ redline demo --public --compact
redline public dogfood: cases=10 regression=10 changed=0 missing=0 neutral=0
Confidence: HIGH | fix blocking cases before shipping
Scope: structural checks only; review factual correctness, tone, hallucinations, and subtle reasoning separately

REGRESSION case_001: candidate lost valid JSON format
REGRESSION case_002: candidate lost markdown table structure
REGRESSION case_006: candidate newly refuses
REGRESSION case_010: candidate lost bullet list structure

10 blocking regressions - fix before you ship.

$ redline diff dolly-suite-100.json dolly-candidate-100.jsonl --compact
redline diff: cases=100 regression=51 changed=27 missing=0 neutral=22
Diagnosis: candidate got shorter, lost required structure, dropped concrete details, returned empty outputs, and changed content substantially.

First-run proof

One command for the demo. One public dataset for proof.

The public demo is synthetic and safe to share, but it behaves like a real prompt change: the candidate gets shorter and silently drops required structure, dates, owners, URLs, and refusal behavior. The larger internet dogfood run imports 100 rows from Databricks Dolly 15k and pushes them through the same suite, diff, history, benchmark, app, and dashboard surfaces.

case_001 JSON validity REGRESSION

baseline

{
  "owner": "ana@acme.dev",
  "due": "2025-07-01",
  "links": ["acme.dev/x"],
  "status": "open"
}

candidate

owner: ana, due: soon,
links: omitted,
status: open

not valid JSON

reason: candidate lost valid JSON format; required keys are no longer parseable.

100 public prompt-response rows

Imported from Databricks Dolly 15k with redline's JSONL importer and kept as local-only dogfood evidence.

20 behavior groups covered

The generated suite covered 100/100 cases and 20/20 deterministic behavior groups.

51 blocking regressions found

The report diagnosed shorter answers, lost structure, dropped concrete details, empty outputs, and substantial content drift.

0 dashboard warnings

The local app loaded 1 report, 1 benchmark, and 1 history entry with no sidecar-artifact noise.

Animated redline product demo showing a prompt regression report

redline dashboard showing reports, benchmark evidence, history, and ship readiness — Local app from the 100-row Dolly dogfood run: reports, benchmark evidence, history, and ship readiness.

redline HTML report showing concrete regression reasons and side-by-side baseline and candidate outputs — HTML report from the same run with the diagnosis, coverage, methodology, and concrete reasons.

Product promise

You already have the eval data. redline turns it into a gate.

Zero manual cases to start

Build suites from JSONL prompt-response logs instead of inventing eval cases by hand.

No cloud calls by default

The core checks run locally with zero runtime dependencies, no API keys, and reproducible CI output.

Clear failure reasons

Reports name the broken behavior: lost JSON, missing keys, dropped URLs, new refusals, empty output, or structure drift.

Review loop included

Mark intentional changes, accept reviewed baselines, and keep the suite evolving with your prompt.

Core loop

From logs to a shipping gate in four commands.

01

Collect behavior

Watch an app, import JSONL, or adapt exports from your logging stack.
redline watch --log logs/prompts.jsonl
02

Generate the suite

Select representative behavior from real prompt-output data.
redline suite logs/prompts.jsonl --out redline-suite.json
03

Evaluate a change

Replay the suite against a new prompt file or candidate log.
redline eval --prompt prompts/v2.txt
04

Review and accept

Pin requirements, mark expected changes, and promote reviewed outputs.
redline mark ... redline accept ...

Trust model

Calibrated signals, not magic approval.

What redline catches

JSON validity, missing keys, URLs, numbers, entities, tables, lists, code blocks, empty outputs, refusals, and obvious policy wording flips are surfaced with reproducible reasons.

What redline does not pretend

Tone, factual accuracy, hallucination risk, and subtle reasoning quality need pinned requirements or an optional judge. redline tells you when a run has no structural blockers; it does not pretend that means the answer is perfect.

How to close the gap

Pin edge cases with requirements, add product-specific judge rubrics for ambiguous changes, then accept reviewed candidate outputs so the baseline evolves deliberately.

Bring your stack

AI-agnostic first, adapters when you need them.

stdio command OpenAI Anthropic HTTP APIs LangChain LlamaIndex LiteLLM JSONL exports Optional judges CI dashboards

Replay any app

Use a command that reads stdin and prints stdout, or copy a runner for OpenAI, Anthropic, LiteLLM, HTTP APIs, LangChain, or LlamaIndex.

redline init --runner stdio --copy-runner

Local dashboard

Publish self-contained HTML reports as CI artifacts or browse the ship-readiness dashboard locally.

redline app --reports-dir .redline/demo/reports

AI assistant loop

Run redline where prompt work happens.

The MCP server lets Claude, Codex, Cursor, Kiro, and other MCP clients inspect redline setup, run evals, summarize suites, render dashboards, and list cases without switching context.

Install from the registry

Use the published MCP entry backed by the PyPI package.

uvx --from redline-ai redline-mcp

Ask the natural question

Your assistant can run the local check and explain the behavioral diff inline.

Did my prompt change introduce regressions?

Keep writes guarded

Read-only tools are the default; marking cases requires explicit write approval.

redline_mark requires allow_write

View MCP Registry listing Read MCP docs

Release confidence

Not just a demo. A certified local product loop.

redline ships with repeatable checks for the paths that matter: clean install, external project CI behavior, report artifacts, history trends, dashboard output, and release packaging.

01 Clean package install

The release gate builds a wheel, installs it in a fresh virtualenv, and runs the first-user command path from outside the repo.

bash scripts/release_check.sh

02 External CI smoke

A temporary external project verifies strict doctor, strict validate, eval failure, GitHub annotations, history, and dashboard artifacts.

bash scripts/action_smoke.sh

03 Trend pressure

History shows whether blocking regressions are getting better or worse; summary and doctor surface thin coverage and semantic gaps.

redline history --fail-on worse

04 Publishable artifacts

Build scripts produce wheel and source distributions and run package metadata checks before anything is uploaded.

bash scripts/certify_release.sh

Public alpha path

Clone it, run the proof, then point it at one real prompt log.

The fastest way to understand redline is to make it catch one regression in your own AI workflow.

Read the quickstart

Open source surface

Review the checks before you trust the gate.

CI status CONTRIBUTING.md SECURITY.md MIT license