Engineering, on contract.

Hardcore Engineering Services

Hands-on engineering for teams whose AI agents have to work in production, not just in the demo. Audits, cost rescue, evals, builds, red-teaming. Indicative price bands below. Every engagement is scoped before invoice.

AvailabilityOpen · taking bookings

RemoteWorldwide · all TZs

In-houseLondon · +20% premium

InsuredProfessional Indemnity insurance to £1M

Included in every engagement

Tokenmaxxing: spend less on models, not less on quality.

Most LLM bills run 3–10× larger than they need to be: wrong models for the job, no caching, no observability, locked to one vendor, over-blown context. Every engagement here ships the same cost-and-context pass I run on my own gateway daily.

Context & tool-use disciplineNo vendor lock-inOpen-source model routingSpend observability

#1 Agent Audit

from £6,000

1–2 weeks · fixed fee · remote or on-site

Your agent behaves differently on identical inputs and nobody can explain why. In two weeks you get the architecture mapped, the failure modes catalogued, a cost and latency baseline, and a written 90-day roadmap your team can act on.

Stack promptfoo Arize Phoenix OpenTelemetry

Architecture & failure-mode review
Cost, latency & token-spend baseline
Eval-readiness assessment
Written 90-day roadmap

The pattern

A support-triage agent files the same ticket under three different labels in one afternoon, and the debugging thread is all screenshots, no repro. The usual culprits: an unpinned model version drifting silently, and no eval baseline to catch it. Pinning the version and building a golden set from real traffic turns "nobody can say why" into a ranked failure-mode list your team can start fixing on Monday.

Why me

I run multi-model routing, evals and OpenTelemetry tracing on my own production gateway every day. The same discipline lands in your report, not textbook theory.

Email about an audit 30-min scoping call · bring your weirdest trace

#2 LLM Cost Rescue

from £4,000

1–2 weeks · fixed fee · savings estimate before you commit

Your inference bill scales faster than your user base. Right-sized models per task, caching, prompt diet and open-source routing: the tokenmaxxing pass as a stand-alone rescue.

Stack LiteLLM OmniRouteQwen / Llamasemantic cache

Per-route model right-sizing (incl. OSS candidates)
Caching & context-window diet
Spend observability dashboard
Vendor lock-in exit options, costed

The pattern

The bill doubles month on month while usage stays flat, because every query, password resets and FAQ lookups included, goes through a frontier model at premium per-token rates. Deflecting cheap queries to a small open-source model behind a routing layer, plus a semantic cache for repeats, commonly halves the bill before any deeper work starts.

Why me

Tokenmaxxing is my daily practice. I run OSS model routing with per-team budgets and spend observability on my own stack, not just recommend it.

Get a spend review Send one month's bill · get a savings estimate

#3 Agent Evals & Reliability

from £8,000

2–3 weeks · fixed scope · golden sets from your traffic

You change a prompt and have no idea if you made it better or worse. I build the eval harness, regression gates and guardrails that let you ship changes weekly with evidence, not vibes.

Stack promptfoo Arize PhoenixCI eval gates

Eval harness wired into CI
Golden datasets from your real traffic
Regression gates & release checklist
Guardrails for the failure modes that matter

The pattern

A one-line prompt tweak ships on Tuesday. The regression surfaces on Friday, in a customer complaint, and the rollback debate takes longer than the change did. A CI eval harness with golden sets drawn from real traffic moves that discovery to before merge: the regression fails a gate instead of paging support. Releases go from monthly-and-nervous to weekly.

Why me

Eval-first is how I build my own agents. Harnesses, guardrails and failure-mode catalogues are standing tooling in my stack, not a research project.

Scope an eval harness Bring one flaky agent · leave with a test plan

#4 Agentic Workflow Build

from £15,000

4–6 weeks · fixed scope · written change-control

You prototyped an agent in a notebook but nobody is comfortable putting it in front of customers. I take it to production: tracing, evals and human-in-the-loop fallback baked in. You own the code; scope changes are quoted, never silently absorbed.

Stack your framework or mineOpenTelemetry MCP tools

End-to-end production agent or workflow
Observability, evals & guardrails from day one
Deploy + handover documentation
2-week warranty: bug fixes & stability; new features scoped separately

The pattern

A contract-summarisation prototype wins every internal demo and ships to zero customers, because nobody will sign off on running it unsupervised. Productionising it means structured outputs with schema validation, tracing on every tool call, eval-gated deploys, and a human-in-the-loop fallback for low-confidence output. That is the difference between a demo and a system you can put in front of customers.

Why me

17 years shipping production systems: apprentice, then Staff Platform Engineer at Tractable AI, then founding engineer at Intropy. Taking prototypes to production is the job I have done for a decade.

Get a scoped quote Describe the prototype and where it is stuck · written quote follows

#5 Red-Team & Injection-Proofing

from £7,000

1–2 weeks · fixed scope · report + fixes

Your agent has tools that touch real systems: databases, email, payments. And you have never tested what a malicious input makes it do. I attack it the way an adversary would, then close the paths I find.

Stack OWASP LLM Top 10MCP sandboxingtool-call approval

Indirect prompt-injection & tool-abuse testing
Attack-vector map across your tools & MCP surface
Provenance, allow-listing & approval-gating fixes
Written findings + severity ranking

The pattern

An agent with database and email tools obeys whatever it reads, and one poisoned document in the knowledge base is enough to steer it into mailing out what it can see. A structured red-team pass against the OWASP LLM Top 10 reliably surfaces exploit paths like these; the fixes are tool-call approval gates, input provenance tagging, and an action allow-list. Close the paths before launch, not after the incident.

Why me

SRE and reverse-engineering background. I think about attack surface and blast radius by instinct, which is exactly the lens agent tooling needs and rarely gets.

Book a red-team pass Before you give an agent write-access · not after

AI Platform Setups

Repeatable, fixed-price installs that drop into your stack. Tooling only: if you need the diagnosis and the golden datasets too, that is an engagement above. DevOps and platform engineering is my home turf: these are the foundations that make everything above cheaper and safer to run.

Vector DB + Agentic Memoryfrom £3k Semantic retrieval + cross-session memory: agents that find by meaning and remember.

Qdrant Weaviate pgvector

Add to your stack

Team Agent Harnessfrom £2.5k One coding-agent setup for the whole team: shared config, shared skills, guardrails. No more ten divergent setups.

opencodeshared skillsagent profiles

Equip your team

Model Gatewayfrom £3k One endpoint, many models. Failover, budgets, spend caps: swap providers without touching app code.

LiteLLM OpenRouterself-hosted routing

Kill the lock-in

Central MCP Gatewayfrom £3k One governed surface for every tool your agents can call: auth, audit and allow-listing.

MCPgateway patternaudit log

Govern your tools

LLM Observabilityfrom £4k Every token, span and pound visible. Traces, spend metrics and dashboards your on-call will actually open.

Grafana Prometheus Langfuse

See your spend

Private / Local LLMfrom £5k Local or own-cloud model serving. Sensitive data never leaves your perimeter, at fixed cost.

vLLM Ollamaown-cloud GPU

Own your models

Evals & Test Harnessfrom £3k Prompt and agent regression testing in CI: every change scored against golden sets.

promptfoo Arize PhoenixCI gates

Test your prompts

AI Workflow Automationfrom £3k Self-hosted automation with LLM steps where they earn their keep: approvals, retries, audit.

n8n Temporalhuman-in-the-loop

Automate a process

RAG & Document Ingestionfrom £4k Contracts, wikis and PDFs become answerable knowledge: parsing, chunking and retrieval that hold up in production.

LlamaIndex Unstructured Firecrawl

Feed your agents

Recruiter? I'm not taking permanent roles. But your client with the stuck AI project? That contract Hardcore Engineering will take.

Hardcore Engineering supplies services company-to-company: statement-of-work, deliverable-based engagements structured for outside-IR35 working (status determination sits with the client). If you have a client with a stuck AI project, a build to ship, or a team that needs senior AI capacity, that is a contract for the company. Refer one that closes and there's a 10–15% referral fee in it for you.

Here's a pitch you can copy-paste to a client:

Hardcore Engineering (founder: Stephan Schielke, 17+ yrs) delivers senior AI and agentic-workflow engineering: fixes agents that break in production, builds them fixed-scope, secures them before launch. Remote worldwide (all TZs) or in-house London. B2B, statement-of-work. Bands: audit from £6k, cost rescue from £4k, evals from £8k, build from £15k, red-team from £7k, platform setups £2.5k–£5k. Details and contact: hardcore.engineer/services

Email LinkedIn GitHub Stack Overflow About + contact

Not sure which of these fits?

Describe what is stuck in three sentences. Worst case you get a pointer back; best case it becomes a scoped plan.

Email what's stuck Three sentences is enough · worst case a pointer, best case a plan