CI speed is one of those things that feels like a quality-of-life improvement until it stops being one and starts being a constraint on what you can ship in a day. We hit that line earlier this year, swapped our GitHub-hosted runners for Blacksmith on every job in our pipeline, and watched the effect ripple through the whole engineering workflow — including a part of it we hadn't predicted: the throughput of our agent-driven implementation pipeline.

This post is a quick walk through what we changed, what it bought us in raw numbers, and the part we didn't see coming — that the speed-up isn't really a speed-up, it's a permission for a different way of working.

What we changed

Every CI job in our pipeline — the main pull-request workflow (lint, format check, type-check, unit tests, app build, Storybook compile, and a Cloudflare Workers deploy dry-run), the end-to-end smoke-test workflow, and the staging-deploy workflow — now uses runs-on: blacksmith-2vcpu-ubuntu-2404. We didn't adopt Blacksmith's caching action or their pnpm-installer action — the change is just the runner. Everything else (actions/checkout@v4, actions/setup-node@v4, pnpm/action-setup@v4, actions/cache@v4) stayed identical, including our cache keys and Turbo remote-cache configuration.

The whole switch was a one-line runs-on: change per job. Six jobs across three workflows. The diff was about as boring as a CI-infrastructure change can be. That's deliberate — we wanted to be able to rip Blacksmith back out if it broke something subtle, and we wanted the difference to be cleanly attributable.

What we got back

The end-to-end PR cycle went from roughly 9 minutes to roughly 4 minutes for a typical small change. That's the time from "push" to "all six checks green, ready to merge" — Lint & Format, Test, Build, Storybook Build, Wrangler Deploy Dry Run, and E2E Smoke Tests. Some of that is the runners themselves being faster on cold start; some of it is the cache layer behaving better under Blacksmith's storage; some of it is just less queueing.

Five minutes a PR doesn't sound like much. For a single human pushing one PR, it isn't. But CI cost is non-linear in the way it interacts with focus. A 9-minute build is just long enough that you context-switch to something else, then come back, then have to re-load the change in your head. A 4-minute build is short enough that you wait, see green, and merge in the same flow. The difference shows up not in saved seconds but in how many things you can keep in your head at once.

The part we didn't see coming

Sojournii's engineering workflow is unusual in one specific way: a lot of our day-to-day implementation runs through an agent-driven pipeline that picks up a planned epic, walks the stories in dependency order, and ships each one as its own PR with a CodeRabbit + Claude review pass before merging into a trunk branch. The trunk eventually opens a PR to main for human review. We use the Anthropic SDK and a small set of routing rules — a coordinator model orchestrates, a smaller model handles cheap mechanical work, and a larger model handles risk-adjacent decisions and any code that smells like security or money.

The pipeline's loop on each story looks like this:

Implement the story (delegate to the right model tier)
Open a PR
Wait for CI
Wait for CodeRabbit's review
Triage feedback, push fixes, repeat the wait
Merge

Steps 3 and 4 are the bottleneck, and they're a bottleneck the agent can't make any cleverer just by being a better agent. The wait is structural. So the wait shape directly determines what the agent can do in a single session before its context window starts to hurt.

With 9-minute builds, a six-story epic — six PRs, plus a trunk PR, plus retries on any round of CR feedback — couldn't fit in one session. The pipeline had to schedule itself into the future, save state, and resume later. Resuming wakes the conversation back up in a cold cache state, which costs both time (re-reading the conversation) and money (re-priming the prompt cache for the next inference). It works — that's what we built — but it's expensive in a way that's hard to measure because the cost lands in a different budget than the human-engineer one.

With 4-minute builds, six stories fit. The pipeline pushes a story PR, the build comes back inside the prompt-cache TTL, it merges and pushes the next story without ever leaving the session. The whole epic — design refinement, RFC drift checks, six PRs with full review, a trunk PR with another review pass, populated Confluence Summary and Test Plan — completes in one continuous run.

That's not a speed-up. That's a different operating mode.

Concrete numbers

The engineering blog you're reading was built end-to-end by the agent pipeline in about 90 minutes wall-clock on Blacksmith. Six stories — refactor, frontmatter contract, content directory + generator + sitemap, redirect + listing page, footer link, inaugural post + hero. Six PRs. One CodeRabbit Major finding caught and addressed (a missing shape guard on the new optional frontmatter fields). All review gates ran.

On our previous CI cadence, that would have been a multi-day epic — not because the work was harder, but because the pipeline would have had to checkpoint after every two or three stories and resume cold. The marginal cost of extending an epic across two sessions is real: cache misses, re-reads, the human cost of remembering where you stopped. The marginal cost of running it in one continuous session is essentially zero.

What this changes about how we plan

We're not yet at the point where a six-story epic in 90 minutes is the right granularity for every change. Some epics — anything that touches concurrency-sensitive backend logic — should still be split across days because the human review at trunk-PR time benefits from a cool-down period. The agent can move faster than the reviewer should.

But for content additions, refactors, type-system changes, and most UI work, the new operating mode is: brief the agent on the epic, walk away for an hour, come back to a trunk PR to review. The work is done. The Summary doc is written. The Test Plan is populated. The drift report explains every place the implementation deviated from the RFC, which on this epic was three positive deviations and two cosmetic ones.

The CI runner is the thing that changed. Everything else followed.

Things to be careful about

A few honest caveats for anyone considering the same swap:

Blacksmith's runners are pinned to specific labels (blacksmith-2vcpu-ubuntu-2404, blacksmith-4vcpu-ubuntu-2404, etc). If you have many workflows, do a single bulk migration so you can roll back atomically. We didn't, and we briefly had a mixed pipeline where one job ran on GitHub-hosted and the rest on Blacksmith. The cache keys interacted in confusing ways for an afternoon.
The cache layer behaves differently. Our Turbo remote-cache hit rate went up under Blacksmith, which we weren't expecting. We didn't change anything about Turbo's configuration; the storage backend is just faster. Take the win, but don't assume your cache-hit metrics from the previous platform translate one-to-one.
The agent-throughput effect is real but only if your bottleneck was actually CI. If the agent was waiting on CodeRabbit (which sometimes takes 10–15 minutes for larger PRs even on a fast pipeline), faster CI doesn't help. Measure where your wait is before assuming the runner is the answer.
You still need every other piece. Faster CI doesn't replace good test coverage, doesn't replace review discipline, and doesn't replace the willingness to revert a PR that broke something subtle. It just makes the cycle short enough that those disciplines can keep up with the agent's pace.

If you're building anything where the engineering workflow runs through an LLM-driven pipeline at any point — agentic implementation, agentic review, even just agentic dependency-update PRs — your CI runner is now part of the substrate the agent is reasoning over. Treat it accordingly.

How Blacksmith CI changed what our agentic workflow can do in one session

What we changed

What we got back

The part we didn't see coming

Concrete numbers

What this changes about how we plan

Things to be careful about

Related posts

What changed when our marketing workflow became Paperclip-native

Why Atlassian still anchors our agentic product workflow

Ready to run an experience business that keeps its margin?