January 2026 · 18 min read

The Spec-to-Ship Pipeline

From product spec to shipped feature in the AI era.

Three months ago, I shipped an AI feature that rewrites user-submitted bug reports into structured, actionable tickets. The feature took two weeks from spec to production. It works reliably on about 94% of inputs. Users love it.

The version before that took six weeks, worked on about 60% of inputs, and users hated it. Same model. Same API. Same team. Same budget.

The difference was the spec. The first attempt had a spec that said “use AI to improve bug reports.” The second attempt had a spec that defined the exact input format, the exact output schema, seven categories of expected failure, a 200-case eval suite, and fallback behavior for every failure mode.

The spec isn’t a nice-to-have in AI engineering. It’s the product. Everything else, the model, the prompt, the pipeline, follows from the spec. Get the spec wrong and you’ll spend weeks debugging a system that was doomed from the design phase.

The old way is broken

Traditional software engineering has a well-understood pipeline: requirements, design, implementation, testing, deployment. Each phase feeds into the next. The requirements doc says what to build. The design doc says how. The implementation does it. The tests verify it. The deployment ships it.

This pipeline breaks for AI features in a specific way: the “implementation” step is no longer deterministic. In traditional software, if the design is correct and the implementation matches the design, the system works. In AI features, the implementation can perfectly match the design and still produce garbage outputs because the model interpreted your prompt differently than you intended, or the distribution of real-world inputs doesn’t match your test cases, or the model was updated and subtle behavioral changes cascaded through your pipeline.

The failure mode isn’t “the code is buggy.” It’s “the system does something reasonable-looking that’s wrong in ways we didn’t anticipate.”

This means the traditional pipeline needs to be rebuilt around a different assumption: the implementation will behave unpredictably, and the spec must account for that unpredictability.

Spec-driven development for AI features

There’s a movement in AI engineering that’s been gaining traction through 2025 and into 2026: spec-driven development. The core idea is that specifications are the source of truth and code is a generated or verified secondary artifact. This inverts the traditional workflow where code is primary and specs are documentation you write after the fact (or never).

For AI features specifically, spec-driven development means writing three things before you write any code:

The behavioral contract: what the feature must do, what it must not do, and what it’s allowed to do ambiguously
The eval suite: concrete test cases that verify the behavioral contract
The failure spec: what happens when the AI component fails, times out, produces garbage, or behaves unexpectedly

Let me walk through each of these.

The behavioral contract

A behavioral contract for an AI feature is different from a traditional requirements doc. It needs to account for the fact that the AI’s output is probabilistic. Here’s what a good one looks like, using my bug report feature as an example:

Feature: Bug Report Structurer

INPUT:
  - Free-text bug report from user (1-5000 chars)
  - Metadata: product area, user tier, submission timestamp

OUTPUT (structured JSON):
  {
    "title": string (max 120 chars, imperative mood),
    "severity": "critical" | "major" | "minor" | "cosmetic",
    "steps_to_reproduce": string[] (ordered list),
    "expected_behavior": string,
    "actual_behavior": string,
    "affected_component": string (must match known component list),
    "confidence": float (0-1)
  }

MUST:
  - Preserve all factual claims from the original report
  - Assign severity based on user impact, not reporter tone
  - Extract steps to reproduce even if not explicitly listed
  - Map to a known component (or return "unknown")
  - Include confidence score

MUST NOT:
  - Invent steps not implied by the report
  - Upgrade severity because the reporter used exclamation marks
  - Discard information from the original report
  - Return malformed JSON under any circumstances

MAY:
  - Infer reasonable defaults for missing fields
  - Split a single report into multiple tickets if distinct issues
  - Request clarification (via confidence < 0.5 flag) when input
    is genuinely ambiguous

The MUST/MUST NOT/MAY structure is borrowed from RFC 2119, and it’s perfect for AI behavioral contracts because it explicitly separates hard requirements from soft ones. This matters because LLMs will always have a gray area of behavior. The contract makes that gray area explicit and bounded.

The eval suite

The eval suite is the most important artifact in the entire pipeline. More important than the prompt. More important than the model choice. More important than the code.

Here’s why: without an eval suite, you’re making changes blind. You update a prompt and have no idea if it made things better or worse. You switch models and can’t tell if the new model handles edge cases differently. You add a pre-processing step and don’t know if it broke something downstream.

Anthropic’s engineering team published guidance on this in early 2026: start with 20-50 simple tasks drawn from real failures. That’s it. You don’t need a thousand test cases to start. You need enough to catch the obvious regressions.

For my bug report feature, the eval suite has 200 cases organized into categories:

Category	Count	What it tests
Happy path	40	Well-formed reports with clear steps, severity, components
Minimal input	25	One-sentence reports, missing context
Adversarial	20	Prompt injection attempts, markdown abuse, nonsense input
Ambiguous severity	30	Reports where severity isn’t obvious
Multi-issue	20	Single reports describing multiple bugs
Non-English	15	Reports in other languages (product is English-only)
Edge cases	25	Extremely long reports, reports with code blocks, reports with URLs
Regressions	25	Cases that broke in previous versions (grows over time)

Each case has:

The input (the raw bug report)
The expected output (the structured JSON)
Acceptance criteria (which fields must match exactly, which can vary)
A human-assigned quality score (1-5) for the expected output

The eval runs automatically on every prompt change, every model update, and every pipeline modification. It takes about 3 minutes and costs roughly $4. That $4 buys more confidence than any amount of manual testing.

The failure spec

This is the part most teams skip, and it’s the part that kills them in production.

The failure spec defines what happens when things go wrong. Not “the system shows an error.” Specific, actionable failure handling for every category of failure.

FAILURE MODES:

1. Model returns malformed JSON
   → Retry once with simplified prompt
   → If retry fails: return original report unstructured
     with flag for human review

2. Model returns JSON but confidence < 0.3
   → Accept the output but flag for human review
   → Track frequency; if > 10% of reports, alert engineering

3. Model timeout (> 10s)
   → Return original report with "processing" status
   → Retry async; update when ready

4. Model returns hallucinated steps
   → Detected by: steps contain information not in original report
   → Fallback: return only steps that can be traced to original text

5. Model assigns wrong component
   → Detected by: component not in known list
   → Fallback: set component to "unknown", flag for triage

6. Rate limit / API error
   → Queue the report for batch processing
   → User sees: "Your report is being processed" (not an error)

7. Input contains prompt injection
   → Detected by: input/output divergence heuristic
   → Reject and return original report unchanged

Every single one of these failure modes happened in production during the first week. Every single one was handled gracefully because the spec anticipated it. The version without the failure spec? Users saw raw error messages, broken UI states, and mysteriously empty ticket fields.

Prompt engineering is software engineering

There’s a persistent misconception that prompt engineering is a creative discipline, like copywriting. You craft artful prompts that coax the model into good behavior. You iterate based on intuition. You know a good prompt when you see one.

This is wrong, and teams that treat it this way ship unreliable features.

Prompt engineering in 2026 is software engineering. Prompts are code. They have inputs, outputs, and behavioral contracts. They need version control, testing, and monitoring. They need to be reviewed in pull requests. They need to be documented.

Here’s what a production prompt looks like for my bug report feature. It’s not pretty. It’s not clever. It’s engineered.

You are a bug report structuring system. Your task is to convert
a free-text bug report into a structured JSON object.

## Input
You will receive a bug report submitted by a user, along with
metadata about the product area and user tier.

## Output Format
Return ONLY valid JSON matching this schema:
{
  "title": "string, max 120 chars, imperative mood",
  "severity": "critical|major|minor|cosmetic",
  "steps_to_reproduce": ["step1", "step2", ...],
  "expected_behavior": "string",
  "actual_behavior": "string",
  "affected_component": "string from allowed list",
  "confidence": 0.0-1.0
}

## Severity Rules
- critical: data loss, security vulnerability, complete feature failure
- major: feature partially broken, significant user impact
- minor: feature works but with inconvenience
- cosmetic: visual or text issues with no functional impact

## Allowed Components
[dashboard, api, auth, billing, notifications, search, settings,
 integrations, mobile-app, unknown]

## Rules
1. Preserve ALL factual claims from the original report
2. Do NOT invent steps or details not present in the report
3. Base severity on USER IMPACT, not reporter tone or punctuation
4. If you cannot determine a field, use reasonable defaults and
   set confidence below 0.5
5. If the report describes multiple bugs, structure the most
   severe one and note others in the title
6. Return ONLY the JSON object. No explanation, no markdown.

Notice what’s in this prompt:

Explicit output schema with types and constraints
Enumerated allowed values for categorical fields
Clear decision rules for ambiguous cases (severity)
A bounded list of allowed component values
Explicit prohibitions on common failure modes
A clear signal mechanism (confidence score) for uncertainty

And notice what’s not in this prompt: creativity, personality, tone instructions, or filler. The prompt is a specification. It tells the model exactly what to do, exactly how to format the output, and exactly what not to do. Every word earns its place.

Version control for prompts

Prompts need version control for the same reason code needs version control: you need to know what changed, when, and why. And you need to roll back when a change breaks something.

My team stores prompts in the repo alongside the code that uses them. Each prompt is a separate file. Changes go through pull requests. The CI pipeline runs the eval suite against the new prompt before the PR can merge.

prompts/
├── bug-report-structurer/
│   ├── v1.0.md          (initial version)
│   ├── v1.1.md          (added severity rules)
│   ├── v1.2.md          (added component list constraint)
│   ├── v2.0.md          (restructured for Claude 4 Sonnet)
│   ├── eval-results/
│   │   ├── v1.0-results.json
│   │   ├── v1.1-results.json
│   │   └── ...
│   └── CHANGELOG.md

Every version has eval results stored alongside it. You can see exactly how each change affected quality. This isn’t overhead. It’s the only way to make prompt changes confidently.

The prompt review checklist

When reviewing a prompt change in a PR, I check:

Check	What I’m looking for
Output schema defined?	The model knows exactly what format to return
Failure handling specified?	The prompt says what to do when uncertain
Constraints bounded?	Categorical values have explicit allowed lists
Prohibitions stated?	Common failure modes are explicitly forbidden
Examples included?	At least 2-3 few-shot examples for ambiguous cases
Eval results attached?	The change was tested against the eval suite
Regression check?	No previously-passing cases now fail
Cost impact estimated?	Longer prompts = more tokens = more cost

If any of these are missing, the PR doesn’t merge. This sounds strict. It prevents roughly 80% of the production issues we used to see with ad-hoc prompt changes.

Structured outputs: the reliability multiplier

The single biggest improvement to our pipeline’s reliability came from enforcing structured outputs. Not “ask the model to return JSON.” Actually enforced structured outputs using the model API’s native features.

Most modern model APIs (OpenAI, Anthropic, Google) now support structured output modes where the model is constrained to produce valid JSON matching a provided schema. The model is fine-tuned to output structured tool calls rather than generic text, which makes the process faster and less prone to syntax errors.

Before structured outputs:

Model returns: "Here's the structured bug report:\n\n```json\n{...}\n```"
Parser: fails because there's markdown around the JSON
Fix: regex to extract JSON from markdown
Model returns: "{title: 'Bug...'}" (invalid JSON, single quotes)
Parser: fails again
Fix: more lenient JSON parser
Model returns: valid JSON but missing "confidence" field
Downstream: null reference error

After structured outputs:

Model returns: {"title": "...", "severity": "...", ...}
Parser: succeeds every time because the API guarantees valid JSON
Downstream: works because the schema guarantees all required fields

The difference in reliability is dramatic. Our malformed-output rate went from about 8% to literally 0%. Not “close to zero.” Zero. The API guarantees valid JSON matching the schema. There is no parsing failure mode.

If you’re building any AI feature that needs structured output (which is most of them), use native structured output support. Do not rely on the model to format correctly on its own. Do not write regex parsers to extract JSON from prose. Use the API’s structured output mode.

The pipeline architecture

Let me show you the full pipeline for an AI feature, from user input to production output.

User Input
    │
    ▼
┌─────────────┐
│ Input        │  Validate length, sanitize, detect language,
│ Preprocessing│  check for injection patterns
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Context      │  Retrieve relevant data: user history,
│ Assembly     │  product area metadata, similar past reports
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Prompt       │  Template + context + input → final prompt
│ Construction │  Select prompt version based on model
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Model        │  API call with structured output schema
│ Invocation   │  Timeout: 10s, retry: 1x on failure
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Output       │  Validate against schema, check constraints,
│ Validation   │  compute confidence, detect hallucination
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Fallback     │  If validation fails: retry, degrade gracefully,
│ Handling     │  or return raw input with flag
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Logging &    │  Log input, output, latency, cost, quality score
│ Monitoring   │  Feed failures back into eval suite
└─────┬───────┘
      │
      ▼
Production Output

Every box in this diagram is a separate concern with its own tests, its own failure modes, and its own monitoring. The model invocation is just one box. It’s not the product. It’s a component.

Teams that treat the model as the entire pipeline (send input, get output, ship it) are the ones who end up with 60% success rates. Teams that build every box in this diagram are the ones who get to 94%.

Input preprocessing

This step is unglamorous and absolutely essential. Real user input is messy. It contains HTML from copy-paste. It has Unicode characters your system doesn’t expect. It sometimes contains the user’s entire email thread pasted in. It occasionally contains deliberate prompt injection attempts.

My preprocessing step:

Strip HTML tags (users paste from browsers)
Normalize Unicode (some “invisible” characters break tokenization)
Truncate to max length (5000 chars) with notification
Detect language (route non-English to different handling)
Run injection detection heuristic (flag but don’t block)

This catches about 15% of inputs that would have produced bad outputs if sent to the model raw.

Context assembly

The model produces better output when it has relevant context. For bug reports, context means: what product area is this about? What components exist in that area? What similar bugs have been filed recently? What’s the user’s history with the product?

This step retrieves that context and includes it in the prompt. The retrieval itself is deterministic (database queries, search index lookups) which means this step is testable and debuggable in the traditional software sense.

Output validation

Even with structured outputs, the model can produce valid JSON that’s semantically wrong. A severity of “critical” for a cosmetic issue. Steps to reproduce that don’t match the original report. A component that’s not in the allowed list (the structured output schema enforces the type but not necessarily the enum, depending on implementation).

Output validation is a traditional code layer that checks semantic constraints the schema can’t enforce. It’s fast (< 1ms), deterministic, and catches about 3% of outputs that are structurally valid but semantically wrong.

Error handling patterns for AI features

Error handling in AI systems is fundamentally different from error handling in traditional systems because the failure modes are different. Traditional systems fail clearly: exception thrown, null reference, timeout. AI systems fail subtly: the output looks right but isn’t.

Here are the patterns I’ve found most useful.

Pattern 1: The confidence gate

Every AI output should have a confidence score. When confidence is below a threshold, route to a different code path.

result = model.generate(prompt, schema=BugReport)

if result.confidence >= 0.7:
    # High confidence: auto-process
    create_ticket(result)
elif result.confidence >= 0.3:
    # Medium confidence: create draft, flag for review
    create_draft_ticket(result, needs_review=True)
else:
    # Low confidence: preserve original, notify human
    queue_for_manual_processing(original_input)

This three-tier pattern handles the full spectrum of model certainty without either blindly trusting the model or rejecting everything that isn’t perfect.

Pattern 2: The output divergence check

Compare the model’s output to the input and flag cases where the output contains information not present in the input. This catches hallucination.

input_entities = extract_entities(original_input)
output_entities = extract_entities(result.steps_to_reproduce)

hallucinated = output_entities - input_entities
if hallucinated:
    result.confidence *= 0.5  # Reduce confidence
    result.flags.append(f"Possible hallucination: {hallucinated}")

Pattern 3: The graceful degradation stack

Define a stack of increasingly degraded responses, and fall through them as failures accumulate.

try:
    # Tier 1: Full AI processing with best model
    result = process_with_model(input, model="claude-sonnet-4-5")
    validate(result)
    return result
except (ValidationError, TimeoutError):
    try:
        # Tier 2: Simplified processing with faster model
        result = process_with_model(input, model="claude-haiku")
        validate_basic(result)
        return result
    except:
        # Tier 3: Return input with minimal structuring
        return basic_structure(input)

The user always gets something useful. The quality degrades gracefully rather than catastrophically. And the monitoring system tracks how often each tier is hit, so you know when the primary path is degrading.

Pattern 4: The async retry with notification

For non-urgent AI features, process asynchronously and notify when done.

try:
    result = process_with_model(input, timeout=5)
    return result  # Immediate response
except TimeoutError:
    job_id = queue_async_processing(input)
    return {
        "status": "processing",
        "job_id": job_id,
        "message": "Your report is being processed. We'll notify you."
    }

This is better than blocking the user for 30 seconds while you retry. And it’s much better than showing an error. The user’s intent is captured, and they’ll get the result when it’s ready.

The shipping checklist

When I’m about to ship an AI feature, I run through this checklist. Every item is a potential production incident if skipped.

Check	Question	Threshold
Eval coverage	Does the eval suite cover all categories in the behavioral contract?	100% of categories, 20+ cases each
Success rate	What’s the end-to-end success rate on the eval suite?	> 90% (varies by domain)
Failure handling	Is every failure mode in the failure spec implemented and tested?	100%
Latency	What’s p50 and p99 latency?	p50 < 3s, p99 < 10s
Cost	What’s the per-invocation cost? Does it work at 10x scale?	Positive unit economics at projected volume
Monitoring	Can we see success rate, latency, cost, and quality in real-time?	Dashboard with alerts
Rollback	Can we disable the AI feature instantly without deploying?	Feature flag
Fallback	What happens when the feature is disabled?	Users can still complete the workflow
Privacy	Does the pipeline log PII? Is it compliant?	Reviewed by security
Injection	Has the feature been tested against prompt injection?	Adversarial test suite passes

The last two are easy to forget and expensive to fix after launch. Privacy is especially tricky because AI pipelines often log inputs and outputs for debugging, and those logs can contain sensitive user data. Build the privacy controls before you ship, not after.

What this looks like in practice

Let me walk through a real timeline for shipping the bug report structurer, from spec to production.

Day 1-2: Spec. Wrote the behavioral contract, failure spec, and initial eval suite (50 cases). No code.

Day 3: Baseline. Built the minimal pipeline (input → prompt → model → output) with no preprocessing, validation, or error handling. Ran the eval suite. Success rate: 62%.

Day 4-5: Prompt iteration. Iterated on the prompt using eval results. Added the severity rules, component list, and explicit prohibitions. Success rate: 78%.

Day 6: Structured outputs. Switched to native structured output mode. Malformed output rate dropped from 8% to 0%. Overall success rate: 83%.

Day 7-8: Pipeline hardening. Added input preprocessing, context assembly, output validation, and the graceful degradation stack. Success rate: 91%.

Day 9: Eval expansion. Added 150 more eval cases, focusing on the failure categories from days 3-8. Found three new failure modes. Fixed them. Success rate: 94%.

Day 10: Production prep. Added monitoring, alerting, feature flag, privacy review, and the injection test suite. Deployed behind a feature flag to 5% of users.

Day 11-12: Soft launch. Monitored the 5% rollout. No issues. Expanded to 25%, then 100%.

Day 13-14: Feedback loop. Reviewed production failures from the first 48 hours at full rollout. Added 25 new eval cases from real failures (the “regressions” category). Fixed two prompt issues. Success rate stable at 94%.

Total time: 14 days. Total cost of eval runs during development: about $120. Cost of the production incidents that the eval suite prevented: incalculable but large.

The spec is the product

I keep coming back to this phrase because it encodes the most important lesson I’ve learned building AI features.

In traditional software, the code is the product. The spec is documentation. In AI engineering, the spec is the product. The code (including the prompt) is the implementation of the spec.

When an AI feature fails in production, the root cause is almost never “the model is bad.” It’s “the spec didn’t account for this case.” The model is doing its best with the instructions it was given. If those instructions are vague, ambiguous, or incomplete, the model’s output will be vague, ambiguous, or incomplete.

Write the spec first. Write the eval suite second. Write the failure spec third. Then, and only then, start building the pipeline. The spec tells you when you’re done. The eval suite tells you how close you are. The failure spec tells you what happens when reality doesn’t match your assumptions.

Everything else is implementation details. Important implementation details, but details nonetheless.

The spec is the product.

product ai execution

Continue Reading

QuestionBench: Measuring How Well AI Agents Ask Questions