←  Back to overview
Article

Why we’re turning Claude Code into an operating system (on purpose)

Claude Code as a team OS: reusable skills, progressive context, verification and governance, so AI output becomes shippable, safe and repeatable.

Our design chapter has been writing about system fidelity: moving fast without quietly breaking the system you’ll need to maintain tomorrow.

This post is the engineering sibling: what happens when you apply the same “Generate → Consolidate” logic to code, workflows, and shipping.

If you haven’t read the design piece yet, start here: Claude Code vs V0 for Design Engineering: how to prototype in code without breaking your design system.

The why (before the how)

We’re investing in a more opinionated Claude Code way of working for one simple reason:

AI makes it easy to create output. It does not automatically make that output shippable.

In practice, teams hit three predictable bottlenecks:

  • Consistency: changes drift from patterns (style, architecture, security, accessibility, naming, conventions).
  • Verification: “looks good” is not a test suite, and green CI isn’t the same as “safe in production”.
  • Coordination: the hardest part is not writing code, it’s keeping shared context, decisions, and standards coherent across a team.

So our goal isn’t “let Claude write more code”.

It’s: shorten loops while keeping quality and risk under control.

This is also what we admire in how Intercom approaches it: they treat the agent setup as engineering infrastructure, not a personal productivity hack.

A quick primer (for non-engineers, and for engineers who like nouns)

When we say “Claude Code OS”, we mean a few concrete building blocks:

  • Skills: packaged instructions + references + scripts. Think “mini playbooks” Claude can invoke.
  • Hooks: rules that run before/after actions (e.g., block dangerous commands, auto-run checks).
  • Progressive disclosure: Claude doesn’t load the whole universe; it pulls the right files at the right time.
  • Verification: automated ways to prove a change works (tests, smoke flows, assertions, observability checks).
  • Governance: ownership, versioning, and discoverability so the system scales beyond one power user.

If that sounds like process: correct.

The alternative is accidentally scaling chaos.

1) From prompts to institutional knowledge: skills as culture, not snippets

A good skill is not “a markdown file”.

It’s a folder with:

  • a clear intent (“when to use this”)
  • examples and gotchas
  • optional scripts for repeatable work
  • a verify step (“how we know it worked”)

At ~50–100 skills you’re no longer “configuring Claude”.

You’re codifying your engineering culture.

Our current buckets (where ROI actually shows up)

  1. Repo / library reference: “how we do X here” + gotchas
  2. Product verification: smoke flows, e2e drivers, assertions
  3. Data & observability: dashboards, query patterns, incident triage
  4. Automation: repeatable routines (release notes, dependency bumps, changelog)
  5. Review & quality: adversarial review, security checks, style guardrails

2) Context is expensive: progressive disclosure beats “load everything”

The anti-pattern is “stuff the prompt with everything we know”.

It feels safe, but it’s slow, costly, and increases instruction conflicts.

Pattern: thin start, thick when needed

  • Start with minimal boot context (goal + constraints + Definition of Done)
  • Use skills as entry points to deeper context
  • Keep heavy reference docs in files Claude can open on demand (api.md, gotchas.md, examples/)

3) Versioning & governance: skill sprawl is real (and boringly fatal)

If skills become institutional knowledge, they need to be treated like code:

  • ownership
  • reviews
  • changelog
  • deprecations
  • discoverability

What we’re experimenting with right now

  • taxonomy prefixes per domain (fe/, be/, ops/, data/)
  • a skill contract: intent, inputs/outputs, gotchas, verify step
  • a weekly “skill release” cadence
  • telemetry: usage logs to prune low-signal skills

4) Hooks are the real accelerator (and your safety net)

Hooks are one way to turn “best practices” into default behavior, but in practice, a lot of the guardrails live in your Claude Code config + skill-level allowlists.

In our workflow, this can show up as:

  1. Claude settings/config (e.g. settings files) that define defaults and constraints.
  2. Skill “allowed” config that limits which commands/tools a given skill can run (so safety is scoped by intent).
  3. Optional hooks where you want enforcement around actions (before/after), like blocking dangerous commands or automatically running checks.

The underlying jobs stay the same:

  • Prevention: block risky actions (rm -rf, force push, destructive migrations)
  • Verification: run checks after edits (lint/test/smoke)

We strongly prefer on-demand enforcement for highly restrictive modes (so it doesn’t annoy you all day), and default-on checks for always-useful verification.

5) “Fast lane” reviews: auto-approving safe PRs (with strict rules)

We’re exploring faster review loops, but only with an explicit risk model and it’s worth saying clearly: this is not fully “auto” yet in our current R&D.

Right now, the flow is closer to:

  • a PM/designer/engineer manually triages what’s safe to pick up (often by tagging Claude in Slack)
  • Claude drafts a PR
  • the PR is still reviewed manually

“Fast lane” (auto-approval for truly low-risk changes) is a next phase once the workflow is no longer in beta.

Our proposal: 3 lanes

  1. Safe lane (auto-approve possible)
    • docs, copy, non-prod config, refactors with no behavior change
    • required: lint + unit tests + static analysis
  2. Normal lane
    • feature work with limited blast radius
    • required: reviewer + e2e smoke
  3. Critical lane
    • auth, payments, permissions, data migrations, infra
    • required: 2 reviewers + staged rollout + runbook

Feature flags are the bridge between speed and safety: ship small, validate in prod conditions, rollback without drama.

6) Read-only production introspection (soon): functional teams asking real questions safely

This is the most strategic unlock for us.

Imagine a world where PM/ops/CS can ask:

  • “Did signups drop after yesterday’s release?”
  • “Which cohort is failing onboarding?”
  • “Is this issue limited to one tenant?”

…without raw production access.

What it requires (non-negotiables)

  • allowlisted queries/commands
  • rate limits
  • PII redaction
  • structured outputs (summaries, not dumps)
  • audit trail

7) Failure modes (and how we prevent a haunted prompt mansion)

If we do this wrong, we’ll get:

  • contradicting skills
  • a bloated catalog nobody understands
  • token waste
  • false confidence (“green CI” ≠ safe behavior)

Our mitigations:

  • one canonical skill per domain (single source of truth)
  • deprecations that redirect
  • progressive disclosure by default
  • verification skills as first-class citizens

Conclusion

AI can accelerate delivery.

But only if you build the surrounding system: constraints, verification, and governance.

That’s what we’re building: a Claude Code OS that makes speed safe, and makes quality repeatable.

TBC ... 👀

Join Bothrs

Bundle forces and let’s create impactful experiences together!
See positions

Agentic organizations don’t wait, they build.

Start your GenAI Discovery Track and unlock easier, smarter experiences.

Stef Nimmegeers, Co-Founder Nimble