On this page

How much does the Z.ai (GLM) API cost?Z.ai chat and the GLM Coding Plan are not the API How Z.ai API pricing works What makes your bill higher than expected How to reduce Z.ai API costs Can you use the Z.ai (GLM) API for free?Real-world cost examples Complete Z.ai (GLM) API pricing table Conclusion Related

Z.ai GLM API Pricing: Full Breakdown of Costs (Jun 2026)

Reynaldi Chernando

Updated: June 19, 2026

On this page

This guide covers what the Z.ai (GLM) API costs, what drives the bill up, how to bring it down, and how to use GLM models for free. If you want a setup walkthrough first, see our free GLM API tutorial.

How much does the Z.ai (GLM) API cost?

GLM-5.2, the current flagship, costs $1.40 per 1M input tokens and $4.40 per 1M output tokens. The lowest paid input rate is GLM-4.7-FlashX at $0.07 input and $0.40 output, and two text models (GLM-4.7-Flash and GLM-4.5-Flash) are free.

Model	Input / 1M	Output / 1M
GLM-5.2 (flagship)	$1.40	$4.40
GLM-5	$1.00	$3.20
GLM-4.7	$0.60	$2.20
GLM-4.5-Air	$0.20	$1.10
GLM-4.7-FlashX	$0.07	$0.40
GLM-4.7-Flash	Free	Free

A few things to know before you read the rates as final. Prices are in USD and apply to the international z.ai platform. Z.ai runs its infrastructure primarily in China, which can affect latency and data residency.

Z.ai chat and the GLM Coding Plan are not the API

Three Z.ai products get mixed up, and only one of them is the pay-per-token API this article covers.

The Z.ai chat interface at chat.z.ai is a consumer product. It is separate from the API and is not billed per token.

The GLM Coding Plan is a flat monthly subscription (Lite, Pro, Max, and Team tiers). It gives you a usage quota that refreshes on a fixed cycle, and it works only inside officially supported coding tools such as Claude Code, Cline, and OpenCode. Z.ai states that the plan does not cover SDK-based access or custom integrations, and that usage through unsupported tools may be restricted. Public listings put Lite at roughly $3 to $19 per month depending on promotions, with higher tiers above that, and billing is often quarterly. These prices change frequently, so check z.ai before subscribing. The Coding Plan can be much cheaper than the API if your usage fits inside a supported coding tool, because you pay a flat rate instead of metering every token.

The pay-per-token API is the metered service. You get an API key, call the OpenAI-compatible or Anthropic-compatible endpoint, and pay for the tokens you use at the rates above. Use the API for production apps, agents, SDK access, and anything outside the supported coding tools.

How Z.ai API pricing works

GLM billing is per token. You pay for input tokens (everything you send: the prompt, system instructions, conversation history, and any documents) and for output tokens (everything the model generates). A token is a chunk of text, roughly 4 characters or about three-quarters of a word in English.

Output costs more than input on every GLM model. On GLM-5.2 the gap is about three times, and the lineup has its own mechanics worth understanding before you commit.

Reasoning tokens are billed as output

GLM models can produce reasoning before the final answer. Those reasoning tokens are billed at the standard output rate, and the official pricing page lists no separate reasoning surcharge. A model that thinks longer produces more output tokens, so a verbose reasoning trace costs the same per token as the visible answer.

Caching: a read discount plus a storage meter

Z.ai prices cached input separately and lower than fresh input. On GLM-5.2, cached input is $0.26 per 1M tokens against $1.40 for uncached, so reused context costs roughly a fifth of the normal rate. There is a second line most providers do not expose: "cached input storage," which is what you pay to keep cached content available between calls. Z.ai currently lists storage as free for a limited time. Build around the read discount, but treat the free storage as temporary.

The model suffix system

The catalog uses suffixes that map to size and speed:

No suffix (GLM-5.2, GLM-4.7): the standard model at that generation.
-X (GLM-4.5-X): a premium, higher-cost variant. GLM-4.5-X is $2.20 / $8.90.
-Air (GLM-4.5-Air): a smaller, cheaper model. $0.20 / $1.10.
-AirX (GLM-4.5-AirX): a faster Air variant at higher cost. $1.10 / $4.50.
-Turbo (GLM-5-Turbo): tuned for speed.
-Flash (GLM-4.7-Flash, GLM-4.5-Flash): free, lightweight models.
-FlashX (GLM-4.7-FlashX): a cheap paid model that sits just above the free tier. $0.07 / $0.40.

What makes your bill higher than expected

Output tokens dominate the cost

Output is priced about three times higher than input across the lineup ($4.40 vs $1.40 on GLM-5.2). Workloads that generate long responses cost more than the input volume suggests, so the length of what the model writes matters more than the length of what you send.

Web Search is billed per use

The built-in Web Search tool costs $0.01 per use, charged on top of the tokens for the request. An agent that searches on most turns adds a per-call fee that token math alone will not show.

Premium tiers jump sharply

The -X variants cost several times the base model. GLM-4.5-X at $2.20 / $8.90 is roughly four times the price of GLM-4.5 at $0.60 / $2.20. Selecting a premium variant by default, rather than for the requests that need it, multiplies the bill.

The cache storage promo can end

Cached input storage is free now, but it is labeled limited-time. If you build a workload that depends on keeping large cached contexts available, a future storage charge would change the math.

Vision and media are metered on their own rates

Vision models are priced per token at different rates from the text models (GLM-4.5V is $0.60 / $1.80). Image, video, and audio generation are billed per item or per minute, not on the text token rates. A pipeline that mixes text with media bills each part on its own meter.

How to reduce Z.ai API costs

1. Match the model to the task

Model choice moves the bill more than any other lever. The free Flash models handle routine completion, formatting, and quick lookups at zero cost. GLM-4.5-Air and GLM-4.7 cover most mid-complexity work. Reserve GLM-5.2 for the requests that need the strongest reasoning. The same chatbot can cost $288 a month on GLM-5.2 or $60 on GLM-4.5-Air at the same volume (see the examples below).

2. Use prompt caching for repeated context

If you send the same system prompt, instructions, or document context across many requests, cached input is billed at roughly a fifth of the standard rate. For an assistant that reuses a large fixed prompt on every call, the read discount applies to the bulk of each request.

3. Control output length

Because output is the expensive side, capping response length and asking for concise answers cuts cost directly. Set a max output token limit and request only the format you need rather than letting the model pad its responses.

4. Consolidate prompts

One well-formed request with full context is cheaper than several back-and-forth follow-ups, each of which resends the growing conversation history as input. Sending complete context once avoids paying for the same history repeatedly.

5. Consider the Coding Plan or self-hosting

If your usage runs through a supported coding tool, the flat-rate GLM Coding Plan can cost less than metered tokens. GLM weights are released under the MIT license, so for high, steady volume you can self-host instead of paying per token, trading API fees for your own compute.

Can you use the Z.ai (GLM) API for free?

Puter.js: the User-Pays model

Puter.js is a JavaScript library that lets you add GLM models to your app with no API key, no backend, and no bill to you as the developer. It works on the User-Pays model: each user of your app covers their own AI usage through their Puter account, so your costs stay at zero no matter how many users you have.

<html>
<body>
  <script src="https://js.puter.com/v2/"></script>
  <script>
    puter.ai.chat("Explain quantum computing in simple terms", {
      model: "glm-5.2"
    }).then(response => {
      document.body.innerHTML = response.message.content;
    });
  </script>
</body>
</html>

We ran the same workload we use across our pricing guides: 500 monthly users sending 30 messages each, averaging 1,000 input and 300 output tokens per message, for 15M input and 4.5M output tokens a month. On GLM-5.2 through the API, our calculation puts that at $21.00 for input and $19.80 for output, about $40.80 a month, growing linearly with your user base.

Through Puter.js the same app costs you $0 at any scale, because each user carries their own usage. On a cheaper model the dollar gap is smaller (GLM-4.7 would be about $18.90 a month at that volume, and the free Flash models are $0 either way), but Puter.js still removes API key and billing management and stays free as you grow.

Free GLM models on the API

Three models are free on the API: GLM-4.7-Flash and GLM-4.5-Flash for text, and GLM-4.6V-Flash for vision. These are genuinely free rather than trial credits, though the free models are rate-limited and tuned for speed over depth. They cover routine tasks well; complex reasoning and multi-step agent work need the paid models.

Self-host the open weights

GLM weights are published under the MIT license, so you can run them on your own hardware at no API cost. This suits high, predictable volume where owning the compute beats per-token billing, and it keeps data on your own infrastructure.

OpenRouter

GLM models are available through OpenRouter, which can route to whichever host is cheapest or fastest. Rates vary by host and sit close to the direct API prices.

Real-world cost examples

We modeled three common workloads on current GLM rates.

Customer support chatbot. We calculated a chatbot handling 100,000 messages a month, averaging 800 input and 400 output tokens per message, for 80M input and 40M output tokens. The model you pick sets the bill:

Model	Monthly cost
GLM-5.2	$288.00
GLM-4.7	$136.00
GLM-4.5-Air	$60.00
GLM-4.7-FlashX	$21.60
GLM-4.7-Flash	$0.00

Summarizing 100 PDFs. We modeled 100 documents at about 10,000 input tokens each with a 500-token summary, for 1M input and 50K output tokens. On GLM-4.7 that is about $0.71 total ($0.60 input, $0.11 output). On GLM-5.2 it is about $1.62; on GLM-4.5-Air, about $0.26.

Daily content generation. We modeled 50 pieces a day at 500 input and 1,500 output tokens each, run every day, for about 750K input and 2.25M output tokens a month. On GLM-4.7 that is about $5.40 a month. On GLM-5.2 it is about $10.95; on GLM-4.5-Air, about $2.63. Output-heavy work like this rewards a cheaper model more than input-heavy work does.

Complete Z.ai (GLM) API pricing table

Text models, per 1M tokens:

Model	Input	Cached Input	Output
GLM-5.2	$1.40	$0.26	$4.40
GLM-5.1	$1.40	$0.26	$4.40
GLM-5	$1.00	$0.20	$3.20
GLM-5-Turbo	$1.20	$0.24	$4.00
GLM-4.7	$0.60	$0.11	$2.20
GLM-4.7-FlashX	$0.07	$0.01	$0.40
GLM-4.6	$0.60	$0.11	$2.20
GLM-4.5	$0.60	$0.11	$2.20
GLM-4.5-X	$2.20	$0.45	$8.90
GLM-4.5-Air	$0.20	$0.03	$1.10
GLM-4.5-AirX	$1.10	$0.22	$4.50
GLM-4-32B-0414-128K	$0.10	—	$0.10
GLM-4.7-Flash	Free	Free	Free
GLM-4.5-Flash	Free	Free	Free

Vision models, per 1M tokens:

Model	Input	Cached Input	Output
GLM-5V-Turbo	$1.20	$0.24	$4.00
GLM-4.6V	$0.30	$0.05	$0.90
GLM-4.5V	$0.60	$0.11	$1.80
GLM-4.6V-FlashX	$0.04	$0.004	$0.40
GLM-OCR	$0.03	—	$0.03
GLM-4.6V-Flash	Free	Free	Free

Built-in tools and media (selected):

Item	Price
Web Search (built-in tool)	$0.01 / use
GLM-Image (image generation)	$0.015 / image
CogView-4 (image generation)	$0.01 / image
CogVideoX-3 (video generation)	$0.20 / video
GLM-ASR-2512 (audio)	$0.03 / 1M tokens

The full catalog includes additional video, audio, and agent models. See the official Z.ai pricing page for the complete list.

Conclusion

GLM-5.2, the flagship, costs $1.40 per 1M input tokens and $4.40 per 1M output. Below it sits a wide range, down to GLM-4.7-FlashX at $0.07 / $0.40 and two free Flash models.

The main cost levers:

Match the model to the task; this moves the bill most.
Cache repeated context for the read discount.
Control output length, since output is the costly side.
Consolidate prompts to avoid resending history.
Use the Coding Plan or self-hosting where the usage pattern fits.

Free, Serverless AI and Cloud

Start creating powerful web applications with Puter.js in seconds!

Get Started Now

Read the Docs • Try the Playground