What's Changing in AI
20 minutes · Software developers · Skepticism-first · Demo-led
Engineers are a hostile audience for AI talks. They've sat through too many demos, are allergic to hype, and are deeply protective of their craft. This page is the talk script — exact talking points, time allocations, and Q&A prep. The companion Live Demo page runs a real LLM in the browser for the show-don't-tell segment.
Opening Line (memorise this)
“METR ran a randomised controlled trial in 2025. Sixteen senior open-source maintainers. Their own repos, their own tasks. With AI, they got 19% slower. They believed they got 20% faster. That's a 39-point self-perception gap — and that gap is the talk. For the next twenty minutes I'm going to show you what's actually changed under the hood, run a real model on this laptop with no internet, and hand you a two-week experiment you can run on Monday. If by minute twenty you don't have one thing you're going to try this sprint, I've failed.”
— Concede the priors before the room raises them. Cite METR (arXiv:2507.09089) on screen — engineers verify citations.
Preparation Checklist
Do these before the talk. Each one earns 5–10 minutes of credibility on stage.
The Update Flow · 6 beats
Six beats for delivering this content to a senior engineering audience without losing them in the first ten minutes.
Time Allocation
Recommended pacing for a tight 20-minute talk plus 5-minute Q&A buffer. The live demo does the heavy lifting — keep the inflection-point overview deliberately brief.
The 5 Inflection Points · talking-point scripts
Five structurally different things that didn't exist 12 months ago. Each card has the verified evidence, the implication, and the exact line to land it.
Open the Live Demo page now
Don't read the next bullets — switch tabs. The Live Demo page boots a real Llama or Qwen model 100% inside the browser (WebGPU, no API calls, works offline). Walk through the six tabs in this order: Chat → Reasoning Knob → Persona → Tools → Orchestration → Evals.
Tee up each tab with a falsifiable prediction — engineers trust speakers who let the demo prove (or break) them. If a prediction misses, name it on stage; that's the credibility move.
- Chat — ~30 sec to show first-token latency and tokens/sec on local hardware. Predict: “TTFT under 2 seconds, ~25–40 tok/s on this MacBook.”
- Reasoning Knob — same prompt, three system prompts. Predict: “Watch output tokens roughly 3× from low to high — and watch the answer get measurably better.”
- Persona — same weights, same prompt; system prompt changes the output completely. Predict: “Two personas will give near-opposite recommendations on the same code snippet.”
- Tools — model emits JSON, browser executes a calculator/HTTP fetch, result fed back. Predict: “Malformed JSON ~10% of calls — you'll see a retry. That's why structured-output mode exists.”
- Orchestration — Planner → Researcher → Critic, all the same model, different roles. The “Received from upstream agent” chip on each pane shows how data flows. Predict: “The Critic will catch at least one error the Planner missed. Same weights — the architecture is what raised the floor.”
- Evals — pick 2-3 models, pick a test set, click Run. The harness loads each model, runs every question, scores deterministically, and shows a comparison bar chart. Predict: “Bigger model wins on code-fix; smaller model is within 1 point on basic math but 3-4× faster. This is exactly the artefact I asked you to email me.”
If the demo fails on stage · recovery script
Say this verbatim, smile, and keep moving:
“Good — this is the honest part. The model is ~880 MB on a hostile conference network, running on WebGPU which two-year-old laptops don't support. Here's the cached transcript I ran at 7am this morning — same model, same hardware class. The point still lands: this ran fully offline, with no API call, on a browser tab.”
Pre-cache the model on this laptop before the talk (open live-demo.html on the venue Wi-Fi, click Load model, wait for the green pill). Have a 30-second screen recording of the five tabs ready in a second tab as a hard fallback. Never apologise — engineers respect graceful failure more than perfect demos.
The New Disciplines · what to learn this year
Four skills that appreciated fastest in 2026. Land each in one sentence — the audience needs to know what to do on Monday morning.
Reality Check · honest tradeoffs
Engineers smell hype instantly. These are the inconvenient findings worth leading with. If you must drop one, keep the METR slowdown and the slop problem.
What Would Change My Mind
Steelman the skeptics in the room before they ask. Three concrete falsifiers — if any of these land in the next 18 months, this talk's thesis is wrong and I'll say so on stage at the next one.
-
Falsifier 01
METR's task-doubling curve stalls past 14 months. The current 7-month doubling is the load-bearing claim for “agents are real now.” If the 2026 follow-up shows the trend bending, the agent thesis weakens — and so does the case for redesigning your workflow around them.
-
Falsifier 02
Prompt-injection mitigations ship and hold for 12 consecutive months with no high-severity bypasses. If that happens, the security objection in the Reality Check evaporates — and the “agents in production” adoption curve gets steeper than I'm forecasting.
-
Falsifier 03
The METR 19% slowdown inverts in the 2026 replication. If experienced devs on real codebases now measure faster and believe they're faster — the perception gap was a tooling-maturity artefact, not a structural finding. Reality Check section gets rewritten.
Naming the falsifiers earns the right to make the rest of the claims. Tetlock-style forecasting hygiene is rare in tech talks — that's exactly why it lands.
Q&A Prep
Six anticipated questions and suggested answers. Read once before the talk; you'll get at least four of these.
Closing Line (memorise this too)
“The METR doubling is seven months. Your next perf cycle is six. So here's the deal: pick one of the three experiments on the slide, give it two weeks, and email me the result — pass, fail, or weird. I'll publish the aggregate at the end of the quarter. That's our own eval, our own data, our own answer. Not a vendor's benchmark. Not a Twitter thread. Yours.”
— Promise an artefact. Recruit the audience as data. Concrete > philosophical.