How we scope a ten-week AI engagement

Ten weeks is the engagement length we have settled on for first-time AI work with a new customer. Shorter than that and there is not enough time to ship anything that survives contact with production. Longer than that and the customer's organisation loses the political momentum that funded the work in the first place. Ten weeks is enough time to ship a real system, document it, and hand it over with an operating runbook that the in-house team can defend.

The shape of those ten weeks is, by now, well-rehearsed. We have run more than a dozen of these engagements and the rhythm has converged on a structure that is durable enough to write down without too many caveats. This piece is that structure — what we lock in week one, what we ship by week ten, and the trade-offs that make the cadence work.

Week one: three artifacts, no code

The first week produces three documents and no production code. The first is the architecture brief — a 6-to-10 page document that describes the system we are about to build, the choices we are making, and the choices we are explicitly not making. The second is the eval contract, described elsewhere on this site, which describes the properties the system must hold. The third is the engagement plan, which lays out the ten-week schedule, the demo cadence, and the named owners on both sides.

Customers occasionally object to spending a week of an expensive engagement on documents. The objection evaporates around week six, when the project hits its first scope-creep moment and we resolve it in twenty minutes by reading the architecture brief together rather than re-litigating the design from scratch. The documents are not bureaucracy — they are insurance against the conversations that would otherwise eat the engagement alive.

Weeks two through four: the core loop

The next three weeks are spent building the core loop of the system end to end, in the simplest defensible form. The model is hard-coded to a single provider. The prompt is the simplest version that can pass the most basic eval. The retrieval layer, if there is one, is hard-coded to a single index. The point is to have a working pipeline from input to output as quickly as possible, with full observability and a working eval suite running on every commit.

We do not optimise anything during these three weeks. We do not negotiate with the customer about prompt wording. We do not benchmark against alternative models. The goal is to remove the question 'is this technically possible' from the project as quickly as possible, so that the rest of the engagement can be spent on the questions that actually matter — accuracy, cost, latency, and operational risk.

Weeks five through seven: hardening

The middle three weeks are the unglamorous core of the engagement. The eval suite is expanded to cover the long tail of inputs the system will see in production. The prompt is iterated against the eval suite, not against the customer's intuition. The retrieval layer is replaced with the production data source. The cost ceiling is set, the rate limits are set, the failure modes are catalogued and the recovery paths are written down. By the end of week seven, the system passes the eval suite at a rate the customer is willing to commit to in writing.

It is during this phase that customers most often try to add new requirements. We have learned to defer them, on the record, with a written note about which artifact they would change and what the cost of accepting them would be. The deferral is not a refusal — it is an accounting move. Most of the deferred requirements turn out, by week ten, to have been unnecessary, and the customer is glad we did not act on them in the moment.

Weeks eight and nine: production rollout

The last two engineering weeks are spent on the rollout itself. The system is deployed to production behind a feature flag, with a 1% traffic ramp, and instrumented heavily enough that we can detect a regression within minutes. Each ramp step is gated on the eval suite continuing to pass and on a small set of production-only metrics — latency, cost per call, refusal rate — staying inside the agreed envelope. By the end of week nine, the system is at 100% of the agreed traffic, the on-call rotation has been handed over, and the runbook has been walked through with the team that will operate it.

# A representative ramp schedule
# day  traffic  gate
# 0    1%       eval suite green for 24h
# 2    10%     no production regressions
# 5    25%     cost per call inside envelope
# 8    100%    on-call rotation rehearsed

Week ten: handover, on the record

The final week is dedicated to handover. The runbook is reviewed line by line with the in-house team. The on-call rotation runs for the full week with us shadowing. The architecture brief is updated to reflect every decision that changed during the build, so the document the customer keeps is an accurate description of the system that exists, not the system that was originally proposed. The eval suite, which has grown by a factor of five during the engagement, is committed to the customer's repository with documentation on how to extend it.

We do not believe in long warranty periods. After week ten, the system is the customer's. We are reachable for the first month at a documented response-time, but the operating responsibility transfers in full. This is the test that determines whether the engagement was successful — if the customer can run the system without us in month two, the engagement worked. If they cannot, the engagement was a pilot.

What we honestly promise by week ten

By the end of week ten, the customer has a system that is in production at 100% of the agreed traffic, an eval suite that passes at a rate that is committed in writing, a runbook that has been rehearsed by the people who will use it, and an architecture brief that accurately describes what was built. They do not have a system that handles every edge case anybody can imagine, because that is not a ten-week deliverable for any non-trivial AI system. They do have a foundation that can be extended by their own team, in their own time, against their own backlog. That is what 'production AI' looks like when it works.