Median agent latency, p99 across 8 production systems.
Production AI, not pilot theatre.
We design, build and operate AI-native systems that survive contact with real users — agents, RAG, evaluation harnesses, and the infrastructure underneath.
Five capabilities. All shipped, all maintained.
- 01
Agent and workflow systems
LangGraph, Temporal, custom orchestrators. Tool use, planning, long-running workflows. Stateful, observable, restartable.
- 02
Retrieval (RAG)
Hybrid search (semantic + lexical), document chunking strategies, reranking, citation. Postgres + pgvector or Weaviate / Qdrant based on scale.
- 03
Model integration
OpenAI, Anthropic, open-source via vLLM. Vendor-neutral by default. Routing logic for cost, latency, and capability.
- 04
Evaluation, observability and guardrails
Eval datasets and harnesses (Braintrust, Langfuse, custom). Tracing, regression detection, prompt and model versioning.
- 05
Deployment, monitoring and cost control
CI/CD with eval gates, real-time cost dashboards, rate-limiting, fallback strategies. SOC2-aligned where required.
Honest about boundaries.
Foundation model R&D
Not our edge. We integrate, evaluate and operate — we don't pretrain.
Pure prompt-engineering no-code agents
Not engineering. Use Make, Zapier or n8n directly — they're better at it.
One-off chatbot widgets without evals
We won't ship something we can't measure. If there's no eval surface, there's no engagement.
Discovery, build, operate, hand off.
Discovery — 2 to 6 weeks, flat fee
Architecture, scope, risks, costs. Output is a plan you could build with anyone, including not us.
Build — 8 to 20 weeks, weekly milestones
Senior engineers, weekly demos from real branches, merges into main from week one.
Operate — ongoing, monthly retainer
Eval gates in CI, production traces sampled and scored, on-call coverage when agreed.
Hand off — optional
When your team is ready to own it, we leave runbooks, ADRs, and a handover plan that holds.
The tools we reach for first.
- Anthropic ClaudeDefault reasoning + agent tool use
- OpenAI GPT-4 / GPT-5Multimodal, strong tool routing
- Open-source via vLLMWhen latency or data residency demands it
- Voyage / CohereEmbeddings, reranking
- Together AIRouting across open weights
- Postgres + pgvectorFirst choice up to ~10M chunks
- WeaviateWhen schema and scale grow together
- QdrantWhen sub-100ms latency is the constraint
- Tantivy / MeilisearchLexical half of hybrid search
- BM25 hybridDefault reranking layer
- Vercel / AWS / GCPPer workload — not religious
- TemporalLong-running, restartable workflows
- Inngest / SQSEvent-driven background work
- Langfuse / BraintrustTracing + eval surface
- Sentry / GrafanaErrors, latency, cost panels
Evals first. Always.
- Evaluation datasets built with engineering team
- Pre-deploy eval gates in CI
- Production traces sampled, scored, fed back
$ ai-pipeline ship \
--eval-gate=v0.4 \
--regression-threshold=0.02 \
--trace=production
[eval] running 145 cases against gate v0.4
[eval] 142 / 145 passed (97.93%)
[eval] 3 regressions on long-context retrieval (set: rag.long-ctx.v3)
· case 087 rouge-l 0.71 → 0.62 (-0.09)
· case 104 rouge-l 0.69 → 0.61 (-0.08)
· case 121 citation 1.00 → 0.83 (-0.17)
✗ blocked. regression > threshold (0.02)
→ inspect: https://app.braintrust.dev/auralink/rag/runs/8b4f...Operations is the work.
Uptime SLO target, enforced via fallback chains.
Per-tenant cost dashboards — default deliverable, not a paid add-on.
Two engagements, briefly.
Three engagement models. Honest ranges.
Discovery
- Architecture and ADRs
- Risk register
- Cost bands
Build
- Senior engineers
- Weekly demos
- Eval-gated deploys
Operate
- On-call coverage
- Eval + regression
- Cost ops
All engagements scoped on a call. We don't list standard packages because there aren't any.
Six questions we get every week.
How small is too small?
Anything under four weeks of build, we recommend a Discovery only. We won't sell you a multi-month engagement we can't justify.
Do you sign DPAs / SOC2 / NDAs?
Yes. NDAs by default; DPAs and SOC2-aligned controls available on request, on file with our standard policies.
Can you work with our existing engineering team?
Yes. We pair with in-house engineers, share branches, do code review, and write the kind of documentation your team can own after we leave.
We already use a specific vendor (OpenAI, Anthropic, etc.). Is that fine?
Vendor-neutral on principle. We'll work inside the constraints you've committed to and tell you honestly when a different choice would change the outcome.
What if the AI part doesn't pan out?
Discovery is structured to find that out before you spend a build budget. Half the value of Discovery is sometimes deciding not to build the thing.
Where are you?
Tel Aviv. Remote across EU, Israel and US time zones. We overlap with London/Berlin in the morning and East Coast US in the afternoon.
Have a system that needs to work, not just demo?
Tell us what you're trying to ship. We'll reply with concrete next steps — usually within two business days.
