Z.AI API

Q: How much does it cost?

With the User-Pays model , users cover their own AI costs through their Puter account. This means you can build apps without worrying about infrastructure expenses.

Access Z.AI instantly with Puter.js, and add AI to any app in a few lines of code without backend or API keys.

Get Started Read Tutorial

// npm install @heyputer/puter.js
import { puter } from '@heyputer/puter.js';

puter.ai.chat("Explain AI like I'm five!", {
    model: "z-ai/glm-5"
}).then(response => {
    console.log(response);
});

<html>
<body>
    <script src="https://js.puter.com/v2/"></script>
    <script>
        puter.ai.chat("Explain AI like I'm five!", {
            model: "z-ai/glm-5"
        }).then(response => {
            console.log(response);
        });
    </script>
</body>
</html>

List of Z.AI Models

Chat

GLM 5.1

z-ai/glm-5.1

GLM-5.1 is a frontier-class reasoning model from Z.ai (formerly Zhipu AI), built as a post-training refinement of GLM-5 with a focus on coding and agentic tasks. It uses a 744B-parameter Mixture-of-Experts architecture with 40B active parameters per token and supports a 200K context window. GLM-5.1 scored 58.4 on SWE-Bench Pro, surpassing GPT-5.4 (57.7) and Claude Opus 4.6 (57.3), and reached 95.3 on AIME 2026. It excels at long-horizon agentic workflows, multi-step tool use, and complex software engineering tasks. The model is text-only — no image or audio input.

Chat

GLM 5V Turbo

z-ai/glm-5v-turbo

GLM-5V-Turbo is Z.ai's (Zhipu AI) native multimodal coding model, designed to bridge visual perception and code generation in a single architecture. It processes images, video, and text natively and is optimized for agentic workflows — turning design mockups, screenshots, and UI layouts into runnable code. The model scores 94.8 on the Design2Code benchmark (vs. Claude Opus 4.6's 77.3) and leads on GUI agent benchmarks like AndroidWorld and WebVoyager. It also outperforms Claude Opus 4.5 on BrowseComp for agentic browsing tasks. Built on a 744B-parameter MoE architecture (40B active per token) with a ~200K context window. Trained with reinforcement learning across 30+ task types to maintain strong text-only coding alongside its vision strengths. Best suited for design-to-code generation, GUI automation, and vision-grounded agentic development.

Chat

GLM 5 Turbo

z-ai/glm-5-turbo

GLM-5 Turbo is a foundation model by Z.ai optimized for fast inference and agent-driven workflows, excelling at tool invocation, complex instruction decomposition, and long-chain task execution in OpenClaw scenarios. It is built on top of the GLM-5 architecture (744B parameters, 40B active) with DeepSeek Sparse Attention for reduced deployment cost and up to 205K token context. GLM-5 Turbo supports reasoning/thinking mode and is designed for real-world multi-step agentic tasks including scheduled, persistent, and high-throughput operations.

Chat

GLM 5

z-ai/glm-5

GLM-5 is Zhipu AI's (Z.ai) fifth-generation flagship open-weight foundation model with 744B total parameters (40B active) in a Mixture of Experts architecture, designed for agentic engineering, complex systems coding, and long-horizon agent tasks. It achieves state-of-the-art performance among open-weight models on coding and agentic benchmarks like SWE-bench Verified and Terminal Bench 2.0, approaching Claude Opus 4.5-level capability.

Chat

GLM 4.7 Flash

z-ai/glm-4.7-flash

GLM 4.7 Flash is designed for speed and efficiency while maintaining strong performance. It features a 200K token context window, making it suitable for processing long documents and generating extended responses.

Chat

GLM 4.7 FlashX

z-ai/glm-4.7-flashx

GLM-4.7-FlashX is the fastest inference tier in Z.ai's GLM-4.7 generation, offering the lowest latency in the lineup. It shares the 200K-token context window and core improvements of the 4.7 generation — stronger coding, tool usage, multi-step reasoning, and natural conversational tone — while trading peak capability for maximum speed. The full GLM-4.7 model scores 73.8% on SWE-bench Verified, 84.9% on LiveCodeBench, and 95.7% on AIME 2025. FlashX inherits the same foundational training but is the right pick when response time matters more than squeezing out every point of accuracy. Targets high-throughput coding assistance, real-time agent orchestration, and latency-sensitive chat where the standard GLM-4.7 or GLM-4.7-Flash would be too slow for the concurrency requirements.

Chat

AutoGLM Phone Multilingual

z-ai/autoglm-phone-multilingual

AutoGLM Phone Multilingual is a 9B-parameter vision-language model from Z.ai purpose-built for autonomous smartphone control. It takes a screenshot of a phone screen, interprets the UI through multimodal perception, and outputs precise actions — taps, swipes, text input — to complete multi-step tasks described in natural language. The multilingual variant extends coverage beyond Chinese-optimized apps to English and other languages, making it suitable for international mobile automation workflows. Its architecture is based on GLM-4.1V-9B-Thinking, and it supports a 66K-token context window. Ideal for developers building mobile testing pipelines, phone-based AI assistants, or cross-app automation agents. Devices are controlled via ADB (Android) or HDC (HarmonyOS), with the model callable through a standard chat completions API.

Chat

GLM 4.6V

z-ai/glm-4.6v

GLM-4.6V is a 106B vision-language model featuring native multimodal Function Calling—the first to directly pass images as tool inputs. It supports 128K context for processing 150+ page documents or 1-hour videos in a single pass.

Chat

GLM 4.6V Flash

z-ai/glm-4.6v-flash

GLM-4.6V-Flash is a 9B-parameter vision-language model from Z.ai, the lightweight variant of the GLM-4.6V series. It supports a 128K-token context window and processes images, documents, charts, video frames, and text within a single request. Its key differentiator is native multimodal function calling — images and screenshots can be passed directly as tool parameters, and visual tool outputs are consumed in the same reasoning chain. This bridges the gap between visual perception and executable action for multimodal agent workflows. Best for latency-sensitive and cost-conscious applications that need vision-language capabilities: document understanding pipelines, UI-to-code conversion, visual QA, and multimodal agent loops. For maximum accuracy on complex visual reasoning, the full 106B GLM-4.6V model is available.

Chat

GLM 4.6V FlashX

z-ai/glm-4.6v-flashx

GLM-4.6V-FlashX is the fastest inference tier in Z.ai's GLM-4.6V vision-language model series. Built on the same 9B-parameter architecture as GLM-4.6V-Flash, it shares the 128K-token context window and native multimodal function calling capabilities but is further optimized for throughput and minimal latency. It supports vision input, reasoning, tool use, and structured JSON output — the same feature set as GLM-4.6V-Flash with higher concurrency limits and faster response times. Ideal for high-volume visual processing pipelines where per-request latency is critical: real-time document scanning, automated UI testing at scale, or multimodal chat applications that need vision understanding without waiting on a larger model.

Chat

GLM 4.7

z-ai/glm-4.7

GLM-4.7 is Zhipu AI's latest ~400B flagship released December 2025, optimized for coding with 200K context and 128K output. It scores 73.8% on SWE-bench and 95.7% on AIME 2025.

Chat

GLM 4.6

z-ai/glm-4.6

GLM-4.6 is Zhipu AI's 355B-parameter (32B active) flagship text model with 200K context, excelling at coding, agentic workflows, and search tasks. It's 15% more token-efficient than GLM-4.5 and ranks as the #1 domestic model in China.

Chat

GLM 4.5V

z-ai/glm-4.5v

GLM-4.5V is a 106B-parameter vision-language model achieving SOTA on 42 multimodal benchmarks, capable of image/video reasoning, GUI agent tasks, document parsing, and visual grounding. It features a thinking mode toggle and 64K multimodal context under MIT license.

Chat

GLM 4.5

z-ai/glm-4.5

GLM-4.5 is Zhipu AI's flagship 355B-parameter open-source model (32B active) designed for agentic AI applications with dual thinking/non-thinking modes. It excels at reasoning, coding, and tool use, ranking 3rd globally among all models on combined benchmarks under MIT license.

Chat

GLM 4.5 Air

z-ai/glm-4.5-air

GLM-4.5-Air is a compact 106B-parameter variant (12B active) of GLM-4.5, offering competitive agentic performance with significantly lower resource requirements. It supports the same dual reasoning modes and 128K context window as its larger sibling.

Chat

GLM 4.5 AirX

z-ai/glm-4.5-airx

GLM-4.5-AirX is the ultra-fast inference variant of Z.ai's GLM-4.5-Air, a 106B-parameter Mixture-of-Experts model with 12B active parameters per forward pass. It shares the same architecture and 128K-token context window as GLM-4.5-Air but is optimized for maximum throughput and minimal latency. GLM-4.5-Air itself delivers strong results — scoring 59.8 across 12 industry benchmarks and outperforming models like Gemini 2.5 Flash and Qwen3-235B on reasoning evaluations. AirX preserves that capability while targeting low-latency, high-concurrency production scenarios. Best suited for real-time agent pipelines, high-volume chat, and latency-sensitive coding assistance where the full GLM-4.5's throughput is insufficient but you still need competitive reasoning and tool-use performance.

Chat

GLM 4.5 Flash

z-ai/glm-4.5-flash

GLM-4.5-Flash is the free tier in Z.ai's GLM-4.5 model family, optimized for coding, reasoning, and agent tasks. It shares the hybrid reasoning architecture of the broader GLM-4.5 series, supporting both a thinking mode for complex multi-step problems and a non-thinking mode for instant responses. With a 128K-token context window and native support for function calling, structured output, and streaming, it provides a capable baseline for developers prototyping agent workflows or building cost-sensitive applications. It integrates with coding agent frameworks like Claude Code and Roo Code. An excellent starting point for teams evaluating the GLM-4.5 ecosystem — no cost to experiment, with a clear upgrade path to GLM-4.5 or GLM-4.5-X for heavier workloads.

Chat

GLM 4.5 X

z-ai/glm-4.5-x

GLM-4.5-X is the high-performance, ultra-fast inference variant of Z.ai's flagship GLM-4.5 model. It retains the full 355B-parameter MoE architecture (32B active) and 128K-token context window while being tuned for significantly faster response times — exceeding 100 tokens per second in real-world tests. GLM-4.5 itself ranks among the top models globally across 12 benchmarks spanning reasoning, coding, and agentic tasks, with an aggregate score of 63.2. The X variant delivers that same capability ceiling with latency suitable for interactive applications. Designed for production workloads where both quality and speed matter — real-time coding agents, interactive tool-use pipelines, and high-concurrency deployments that can't afford the response time of the standard GLM-4.5 endpoint.

Chat

GLM 4 32B 0414 128K

z-ai/glm-4-32b-0414-128k

GLM-4-32B-0414-128K is a 32B-parameter dense language model from Z.ai with an extended 128K-token context window. Pre-trained on 15 trillion tokens of high-quality data — including substantial reasoning-focused synthetic data — it was further refined with rejection sampling and reinforcement learning for instruction following, code generation, and function calling. It supports bilingual Chinese-English usage and is optimized for tasks like tool use, search-grounded Q&A, and structured output generation. Performance is competitive with models in the GPT and DeepSeek V3/R1 class at a fraction of the parameter count. A strong choice for cost-sensitive workloads that need long-context reasoning, multi-file code editing, or reliable JSON output without stepping up to the larger MoE models in the GLM family.

Chat

GLM 4 32B

z-ai/glm-4-32b

GLM-4-32B is a 32-billion parameter bilingual (Chinese-English) foundation model by Zhipu AI, pre-trained on 15TB of reasoning-focused data. It delivers performance comparable to GPT-4o on code generation, function calling, and Q&A tasks while remaining deployable on accessible hardware.

Frequently Asked Questions

What is this Z.AI API about?

The Z.AI API gives you access to models for AI chat. Through Puter.js, you can start using Z.AI models instantly with zero setup or configuration.

Which Z.AI models can I use?

Puter.js supports a variety of Z.AI models, including GLM 5.1, GLM 5V Turbo, GLM 5 Turbo, and more. Find all AI models supported by Puter.js in the AI model list.

How much does it cost?

With the User-Pays model, users cover their own AI costs through their Puter account. This means you can build apps without worrying about infrastructure expenses.

What is Puter.js?

Puter.js is a JavaScript library that provides access to AI, storage, and other cloud services directly from a single API. It handles authentication, infrastructure, and scaling so you can focus on building your app.

Does this work with React / Vue / Vanilla JS / Node / etc.?

Yes — the Z.AI API through Puter.js works with any JavaScript framework, Node.js, or plain HTML. Just include the library and start building. See the documentation for more details.

Z.AI API

List of Z.AI Models

GLM 5.1

GLM 5V Turbo

GLM 5 Turbo

GLM 5

GLM 4.7 Flash

GLM 4.7 FlashX

AutoGLM Phone Multilingual

GLM 4.6V

GLM 4.6V Flash

GLM 4.6V FlashX

GLM 4.7

GLM 4.6

GLM 4.5V

GLM 4.5

GLM 4.5 Air

GLM 4.5 AirX

GLM 4.5 Flash

GLM 4.5 X

GLM 4 32B 0414 128K

GLM 4 32B

Frequently Asked Questions

Related Resources

Free, Unlimited Z.AI GLM API

Getting Started with Puter.js

Free, Unlimited Gemini API