Engineering, on contract.

Hardcore Engineering Services

Hands-on engineering for teams whose AI agents have to work in production, not just in the demo. Audits, cost rescue, evals, builds, red-teaming. Indicative price bands below. Every engagement is scoped before invoice.

AvailabilityOpen · taking bookings
RemoteWorldwide · all TZs
In-houseLondon · +20% premium
InsuredProfessional Indemnity insurance to £1M
Included in every engagement

Tokenmaxxing: spend less on models, not less on quality.

Most LLM bills run 3–10× larger than they need to be: wrong models for the job, no caching, no observability, locked to one vendor, over-blown context. Every engagement here ships the same cost-and-context pass I run on my own gateway daily.

Context & tool-use disciplineNo vendor lock-inOpen-source model routingSpend observability

#1 Agent Audit

from £6,000

1–2 weeks · fixed fee · remote or on-site

Your agent behaves differently on identical inputs and nobody can explain why. In two weeks you get the architecture mapped, the failure modes catalogued, a cost and latency baseline, and a written 90-day roadmap your team can act on.

  • Architecture & failure-mode review
  • Cost, latency & token-spend baseline
  • Eval-readiness assessment
  • Written 90-day roadmap
The pattern

A support-triage agent files the same ticket under three different labels in one afternoon, and the debugging thread is all screenshots, no repro. The usual culprits: an unpinned model version drifting silently, and no eval baseline to catch it. Pinning the version and building a golden set from real traffic turns "nobody can say why" into a ranked failure-mode list your team can start fixing on Monday.

Why me

I run multi-model routing, evals and OpenTelemetry tracing on my own production gateway every day. The same discipline lands in your report, not textbook theory.

30-min scoping call · bring your weirdest trace

#2 LLM Cost Rescue

from £4,000

1–2 weeks · fixed fee · savings estimate before you commit

Your inference bill scales faster than your user base. Right-sized models per task, caching, prompt diet and open-source routing: the tokenmaxxing pass as a stand-alone rescue.

Stack LiteLLMOmniRouteQwen / Llamasemantic cache
  • Per-route model right-sizing (incl. OSS candidates)
  • Caching & context-window diet
  • Spend observability dashboard
  • Vendor lock-in exit options, costed
The pattern

The bill doubles month on month while usage stays flat, because every query, password resets and FAQ lookups included, goes through a frontier model at premium per-token rates. Deflecting cheap queries to a small open-source model behind a routing layer, plus a semantic cache for repeats, commonly halves the bill before any deeper work starts.

Why me

Tokenmaxxing is my daily practice. I run OSS model routing with per-team budgets and spend observability on my own stack, not just recommend it.

Send one month's bill · get a savings estimate

#3 Agent Evals & Reliability

from £8,000

2–3 weeks · fixed scope · golden sets from your traffic

You change a prompt and have no idea if you made it better or worse. I build the eval harness, regression gates and guardrails that let you ship changes weekly with evidence, not vibes.

Stack promptfooArize PhoenixCI eval gates
  • Eval harness wired into CI
  • Golden datasets from your real traffic
  • Regression gates & release checklist
  • Guardrails for the failure modes that matter
The pattern

A one-line prompt tweak ships on Tuesday. The regression surfaces on Friday, in a customer complaint, and the rollback debate takes longer than the change did. A CI eval harness with golden sets drawn from real traffic moves that discovery to before merge: the regression fails a gate instead of paging support. Releases go from monthly-and-nervous to weekly.

Why me

Eval-first is how I build my own agents. Harnesses, guardrails and failure-mode catalogues are standing tooling in my stack, not a research project.

Bring one flaky agent · leave with a test plan

#4 Agentic Workflow Build

from £15,000

4–6 weeks · fixed scope · written change-control

You prototyped an agent in a notebook but nobody is comfortable putting it in front of customers. I take it to production: tracing, evals and human-in-the-loop fallback baked in. You own the code; scope changes are quoted, never silently absorbed.

Stack your framework or mineOpenTelemetryMCP tools
  • End-to-end production agent or workflow
  • Observability, evals & guardrails from day one
  • Deploy + handover documentation
  • 2-week warranty: bug fixes & stability; new features scoped separately
The pattern

A contract-summarisation prototype wins every internal demo and ships to zero customers, because nobody will sign off on running it unsupervised. Productionising it means structured outputs with schema validation, tracing on every tool call, eval-gated deploys, and a human-in-the-loop fallback for low-confidence output. That is the difference between a demo and a system you can put in front of customers.

Why me

17 years shipping production systems: apprentice, then Staff Platform Engineer at Tractable AI, then founding engineer at Intropy. Taking prototypes to production is the job I have done for a decade.

Describe the prototype and where it is stuck · written quote follows

#5 Red-Team & Injection-Proofing

from £7,000

1–2 weeks · fixed scope · report + fixes

Your agent has tools that touch real systems: databases, email, payments. And you have never tested what a malicious input makes it do. I attack it the way an adversary would, then close the paths I find.

Stack OWASP LLM Top 10MCP sandboxingtool-call approval
  • Indirect prompt-injection & tool-abuse testing
  • Attack-vector map across your tools & MCP surface
  • Provenance, allow-listing & approval-gating fixes
  • Written findings + severity ranking
The pattern

An agent with database and email tools obeys whatever it reads, and one poisoned document in the knowledge base is enough to steer it into mailing out what it can see. A structured red-team pass against the OWASP LLM Top 10 reliably surfaces exploit paths like these; the fixes are tool-call approval gates, input provenance tagging, and an action allow-list. Close the paths before launch, not after the incident.

Why me

SRE and reverse-engineering background. I think about attack surface and blast radius by instinct, which is exactly the lens agent tooling needs and rarely gets.

Before you give an agent write-access · not after

AI Platform Setups

Repeatable, fixed-price installs that drop into your stack. Tooling only: if you need the diagnosis and the golden datasets too, that is an engagement above. DevOps and platform engineering is my home turf: these are the foundations that make everything above cheaper and safer to run.

Vector DB + Agentic Memoryfrom £3k Semantic retrieval + cross-session memory: agents that find by meaning and remember.
Team Agent Harnessfrom £2.5k One coding-agent setup for the whole team: shared config, shared skills, guardrails. No more ten divergent setups.
opencodeshared skillsagent profiles
Model Gatewayfrom £3k One endpoint, many models. Failover, budgets, spend caps: swap providers without touching app code.
LiteLLMOpenRouterself-hosted routing
Central MCP Gatewayfrom £3k One governed surface for every tool your agents can call: auth, audit and allow-listing.
MCPgateway patternaudit log
LLM Observabilityfrom £4k Every token, span and pound visible. Traces, spend metrics and dashboards your on-call will actually open.
Private / Local LLMfrom £5k Local or own-cloud model serving. Sensitive data never leaves your perimeter, at fixed cost.
vLLMOllamaown-cloud GPU
Evals & Test Harnessfrom £3k Prompt and agent regression testing in CI: every change scored against golden sets.
AI Workflow Automationfrom £3k Self-hosted automation with LLM steps where they earn their keep: approvals, retries, audit.
n8nTemporalhuman-in-the-loop
RAG & Document Ingestionfrom £4k Contracts, wikis and PDFs become answerable knowledge: parsing, chunking and retrieval that hold up in production.

Recruiter? I'm not taking permanent roles. But your client with the stuck AI project? That contract Hardcore Engineering will take.

Hardcore Engineering supplies services company-to-company: statement-of-work, deliverable-based engagements structured for outside-IR35 working (status determination sits with the client). If you have a client with a stuck AI project, a build to ship, or a team that needs senior AI capacity, that is a contract for the company. Refer one that closes and there's a 10–15% referral fee in it for you.

Here's a pitch you can copy-paste to a client:

Hardcore Engineering (founder: Stephan Schielke, 17+ yrs) delivers senior AI and agentic-workflow engineering: fixes agents that break in production, builds them fixed-scope, secures them before launch. Remote worldwide (all TZs) or in-house London. B2B, statement-of-work. Bands: audit from £6k, cost rescue from £4k, evals from £8k, build from £15k, red-team from £7k, platform setups £2.5k–£5k. Details and contact: hardcore.engineer/services

Not sure which of these fits?

Describe what is stuck in three sentences. Worst case you get a pointer back; best case it becomes a scoped plan.

Three sentences is enough · worst case a pointer, best case a plan
keyboard_command_key
/search for stuff Send a prompt to my LLMs /summarize a page Open a chat Use a /command Send a /dm to my phone Ask the /chat assistant Smash your head against the keyboard Hire me /search for stuff
...
CTRL+K