Specpunk
v1.0 / march 2026

spec-driven development toolkit

punk check: FAIL — 3 never_touch violations in src/billing/

AI agents drift. punk doesn't let them.

scope violation contract enforced

punk init → punk plan → punk check → punk receipt

contract scope
4 files
actual touched
7 files
never_touch violations
3 files
approval hash
a3f9c2d1
without punk 7 files changed

Raw agent output

review confidence: low / contract drift: invisible

  • src/auth/login.rs in scope
  • src/auth/session.rs in scope
  • src/auth/policy.rs in scope
  • +3 files outside contract scope collapsed
  • src/billing/subscription.rs never_touch
  • src/billing/invoice.rs never_touch
  • src/billing/plan.rs never_touch

risk markers

  • never_touch violations are indistinguishable from valid edits
  • no contract hash — impossible to verify what was approved
  • acceptance criteria exist, but no proof they were tested
with punk check 4 files contained

Check receipt

contained enough to reason about

contract ACs

  • enforce session timeout after inactivity
  • preserve refresh-token behavior
  • never touch billing or payment modules

approved scope

  • src/auth/login.rs
  • src/auth/session.rs
  • src/auth/policy.rs
  • src/auth/timeout.rs

proofpack

  • holdout tests: 4/4 pass
  • AC coverage: verified
  • approval hash: a3f9c2d1
latest notebook delta
holdout blind testing is the moat — agents can't game what they can't see.
benchmark note
scope containment matters more than confidence scores alone.
open question
how much signal does punk scan add over a manually authored contract.json?
next build
proofpack drawer + AC contradiction checks + multi-agent receipt bundle.

public lab notebook

A tool that shows its work.

Not a blog. Not founder theater. Just the narrow layer where decisions, failures, and unresolved questions stay visible enough to keep the rest honest.

thesis delta / v1.0

holdout testing / moat confirmed

v1.0: holdout blind testing is the moat.

Agents optimize for what they can observe. Holdout tests are invisible to the agent during generation — they can only be passed by code that actually satisfies the contract, not by code that learned the test shape.

milestone

shipped / zero warnings

25 commands. 214 tests. Zero clippy warnings.

The full command surface — init, plan, check, receipt, scan, proofpack, and more — ships clean. 214 tests and no warnings means the tool enforces on itself what it enforces on agents.

architecture decision

resolved / unified tool

signum absorbed. One tool to rule them all.

Signum's receipt chain and boundary verification moved into punk. The approval hash, deterministic boundary seal, and receipt chain are now first-class punk primitives, not a separate project.

control room

Open the artifact pack.

Closer to opening a module than reading copy. The artifacts are the object. The page only arranges them.

Required view: contract, check receipt, and proofpack define the minimal boundary.

artifact

contract.json

The approved contract: scope.touch, dont_touch, ACs, and holdout tests.

{
  "version": "2",
  "goal": "add session timeout",
  "scope": {
    "touch": ["src/auth/"],
    "dont_touch": ["src/billing/", "src/payments/"]
  },
  "acceptance_criteria": [
    {"id": "AC-01", "description": "timeout enforced after inactivity"},
    {"id": "AC-02", "description": "refresh token flow unchanged"}
  ],
  "risk_level": "medium",
  "holdout_scenarios": [
    {"id": "HO-1", "steps": [{"exec": {"argv": ["test", "-f", "..."]}}]}
  ],
  "approval_hash": "a3f9c2d1e7b4..."
}

artifact

check.json

Receipt from punk check — scope violations, PASS/FAIL, approval hash.

{
  "schema_version": "0.1.0",
  "type": "check",
  "status": "FAIL",
  "contract_hash": "a3f9c2d1e7b4...",
  "scope": {
    "declared_files": ["src/auth/"],
    "actual_files": ["src/auth/login.rs", "src/billing/subscription.rs"],
    "violations": [
      {"file": "src/billing/subscription.rs", "violation_type": "never_touch"},
      {"file": "src/billing/invoice.rs", "violation_type": "never_touch"}
    ]
  },
  "duration_ms": 42
}

artifact

AGENTS.md

Generated by punk scan — repo conventions surfaced for the agent.

## Scope
- auth module: src/auth/
- never touch: src/billing/, src/payments/

## Conventions
- error handling: Result<T, PunkError>
- test location: tests/ (unit) + tests/holdout/ (blind)

## Approval
- run punk check before submitting any diff
- attach punk receipt to every PR

artifact

proofpack.json

Holdout test results with confidence scores — proof the contract was satisfied.

{
  "schema_version": "1.0",
  "decision": "AUTO_OK",
  "release_verdict": "PROMOTE",
  "confidence": {
    "execution_health": 95,
    "baseline_stability": 100,
    "behavioral_evidence": 100,
    "review_alignment": 90,
    "overall": 96
  },
  "evidence_coverage": 0.92,
  "contract_full_sha256": "a3f9c2d1..."
}

artifact

receipt.json

The minimal reviewable bundle: scope clean, ACs verified, hash sealed.

{
  "schema_version": "0.1.0",
  "type": "task",
  "status": "COMPLETED",
  "contract_hash": "a3f9c2d1e7b4...",
  "check_receipt_hash": "b7e2f9a1c3d5...",
  "summary": {
    "files_modified": ["src/auth/login.rs", "src/auth/session.rs"],
    "files_created": ["src/auth/timeout.rs"],
    "scope_violations": 0
  }
}

artifact

review.md

A review posture driven by the contract, not just a list of files touched.

decision: inspect and approve
reason:
  - contract scope respected
  - never_touch rules clean
  - holdout tests passed
  - approval hash attached
remaining_risk:
  - timeout threshold may need product tuning

prompt surface

Run punk on the last AI PR you merged without checking scope.

A real prompt, not a lead form. It should stay useful even before there is any proper contact surface behind it.

quick start

open artifact pack

why punk exists

  • AI agents routinely touch files outside their stated task — punk makes that a hard failure, not a review suggestion
  • holdout blind testing means agents can't optimize for the test; they have to satisfy the contract
  • every approved change gets a sealed receipt — scope, AC coverage, confidence score, approval hash