A real model. Right here. Right now.
Five demos for software developers. Same model, five lenses on what's actually new.
Where do model files come from? Β· πΎ local Β· π¦ cache Β· π internet
Loading order: 1) local folder ./models/<model-id>/ if present, 2) browser Cache API (already-downloaded weights), 3) HuggingFace CDN. The progress text above tells you which path the loader took.
To pre-download into the local folder (works fully offline, no browser cache dependency):
./download-models.sh # default: Llama-3.2-1B + Qwen3-0.6B
./download-models.sh all # all 8 models
./download-models.sh Qwen3-1.7B-q4f16_1-MLC # specific id
# Then serve over http (file:// blocks fetch in Chrome):
python3 -m http.server 8000
# Open: http://localhost:8000/live-demo.html
Streaming chat β first-token latency & tok/s
A real model, generating one token at a time, on this laptop. Watch first-token latency and tokens/sec. Both are real numbers, not animations.
Reasoning effort β same prompt, three system prompts
Frontier labs ship a single reasoning_effort knob. Small open models don't have it natively β but you can fake it with system prompts. Same model, three settings. Watch tokens out and time go up; watch quality go up too.
Same model, same prompt β different role
A reminder to the audience: the model is one weights file. The behaviour they see in production is mostly the system prompt. Pick a persona; type a prompt; same weights answer like a Senior SRE, a Skeptical Reviewer, or a 10-year-old.
Tool calling β model emits JSON, browser executes, result fed back
This is the single most important architectural shift. The model decides: βI need to call calculator with these args.β The browser runs the JS. The result is fed back. The loop continues until the model returns a final answer. These presets are intentionally simple so 1β3B models can copy the JSON pattern reliably.
Multi-agent orchestration β Planner β Researcher β Critic
One model. Three roles. Sequential calls. Each agent's output is the next agent's input. This is exactly how Claude Code, Devin, Cursor Composer work β minus a thousand engineering details. The architecture is simple; that is the point.
Evals β score multiple models on your own test set
The closing line of the talk says it: vendor benchmarks are marketing. Your eval is leverage. This tab lets you pick 2-3 models, pick a test set, run all questions sequentially against each model, score with a deterministic check, and compare. Same browser. No API. No spreadsheet. Click Run eval.