Running AI on Cloudflare Workers — no server, no GPU

Cloudflare has a feature most people overlook: you can run AI models directly inside a Worker. No GPU rental, no server setup, no cold start headaches.

They call it Workers AI.

How it works

Workers AI is a binding — just like KV or R2 — declared in wrangler.toml and used as an object in your code. Cloudflare handles the rest: inference runs on their network, as close to your users as possible.

# wrangler.toml
[ai]
binding = "AI"

[vars]
AI_MODEL = "@cf/google/gemma-4-26b-a4b-it"

That’s it. Nothing else to install.

Calling the model

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run(env.AI_MODEL, {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "Explain Cloudflare Workers in one paragraph." },
      ],
      max_tokens: 1024,
    });

    const text =
      response.choices?.[0]?.message?.content ||
      response.response ||
      "";

    return new Response(text);
  },
};

One thing to watch: Workers AI returns an OpenAI-compatible format — choices[0].message.content. Some older models still use a flat response field. Handle both to be safe.

Available models

Cloudflare supports a solid range — text generation, image, speech, embeddings. The ones I reach for most:

Model	Good for
`@cf/google/gemma-4-26b-a4b-it`	Chat, content generation
`@cf/meta/llama-3.1-8b-instruct`	Instruction following
`@cf/baai/bge-base-en-v1.5`	Embeddings, semantic search

Full list at developers.cloudflare.com/workers-ai/models.

Adding memory with KV

Workers are stateless by default — every request starts fresh. If you want the AI to remember context across calls, store it in KV:

// Load chat history from KV
const raw = await env.STATE.get(`chat:${userId}`);
const history = raw ? JSON.parse(raw) : [];

// Append the new message
history.push({ role: "user", content: userMessage });

// Call AI with full history
const response = await env.AI.run(env.AI_MODEL, {
  messages: [{ role: "system", content: SYSTEM_PROMPT }, ...history],
});

const reply = response.choices?.[0]?.message?.content || "";

// Persist
history.push({ role: "assistant", content: reply });
await env.STATE.put(`chat:${userId}`, JSON.stringify(history), {
  expirationTtl: 60 * 60 * 24,
});

No database, no Redis. KV is enough for conversational state.

Limits to know

CPU time: Free plan gives you 10ms CPU time. AI inference needs more — Workers Paid ($5/month) bumps it to 30s.
Not fully free: Workers AI bills by neurons (their compute unit). Free tier gives 10,000 neurons/day — fine for testing, tight for production.
Latency: Slightly higher than a direct external API call since inference is distributed. For a chatbot, it’s fine.

When does this make sense?

Workers AI is a good fit when you want to:

Build a small personal bot without spinning up a server
Plug AI into an existing Cloudflare pipeline (Pages, KV, R2)
Prototype fast without touching infra

I’m using it to power a Telegram bot — it receives a message, calls Gemma 4 to generate a blog post draft, then automatically opens a PR on GitHub. The entire pipeline runs on Cloudflare. No servers anywhere.

That’s serverless working the way it’s supposed to.