Cloudflare has a feature most people overlook: you can run AI models directly inside a Worker. No GPU rental, no server setup, no cold start headaches.
They call it Workers AI.
How it works
Workers AI is a binding — just like KV or R2 — declared in wrangler.toml and used as an object in your code. Cloudflare handles the rest: inference runs on their network, as close to your users as possible.
# wrangler.toml
[ai]
binding = "AI"
[vars]
AI_MODEL = "@cf/google/gemma-4-26b-a4b-it"
That’s it. Nothing else to install.
Calling the model
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run(env.AI_MODEL, {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain Cloudflare Workers in one paragraph." },
],
max_tokens: 1024,
});
const text =
response.choices?.[0]?.message?.content ||
response.response ||
"";
return new Response(text);
},
};
One thing to watch: Workers AI returns an OpenAI-compatible format — choices[0].message.content. Some older models still use a flat response field. Handle both to be safe.
Available models
Cloudflare supports a solid range — text generation, image, speech, embeddings. The ones I reach for most:
| Model | Good for |
|---|---|
@cf/google/gemma-4-26b-a4b-it | Chat, content generation |
@cf/meta/llama-3.1-8b-instruct | Instruction following |
@cf/baai/bge-base-en-v1.5 | Embeddings, semantic search |
Full list at developers.cloudflare.com/workers-ai/models.
Adding memory with KV
Workers are stateless by default — every request starts fresh. If you want the AI to remember context across calls, store it in KV:
// Load chat history from KV
const raw = await env.STATE.get(`chat:${userId}`);
const history = raw ? JSON.parse(raw) : [];
// Append the new message
history.push({ role: "user", content: userMessage });
// Call AI with full history
const response = await env.AI.run(env.AI_MODEL, {
messages: [{ role: "system", content: SYSTEM_PROMPT }, ...history],
});
const reply = response.choices?.[0]?.message?.content || "";
// Persist
history.push({ role: "assistant", content: reply });
await env.STATE.put(`chat:${userId}`, JSON.stringify(history), {
expirationTtl: 60 * 60 * 24,
});
No database, no Redis. KV is enough for conversational state.
Limits to know
- CPU time: Free plan gives you 10ms CPU time. AI inference needs more — Workers Paid ($5/month) bumps it to 30s.
- Not fully free: Workers AI bills by neurons (their compute unit). Free tier gives 10,000 neurons/day — fine for testing, tight for production.
- Latency: Slightly higher than a direct external API call since inference is distributed. For a chatbot, it’s fine.
When does this make sense?
Workers AI is a good fit when you want to:
- Build a small personal bot without spinning up a server
- Plug AI into an existing Cloudflare pipeline (Pages, KV, R2)
- Prototype fast without touching infra
I’m using it to power a Telegram bot — it receives a message, calls Gemma 4 to generate a blog post draft, then automatically opens a PR on GitHub. The entire pipeline runs on Cloudflare. No servers anywhere.
That’s serverless working the way it’s supposed to.