On this page

What Does Highly Agentic Even Mean?Testing the Agentic Capability with a Messy Task What We Found: Only Sonnet 5 Caught the Trap Where Sonnet 4.6 Actually Failed Conclusion Try It Yourself Related

Claude Sonnet 5: Testing Anthropic's "Most Agentic" Claim

Reynaldi Chernando

July 3, 2026

On this page

What Does Highly Agentic Even Mean?Testing the Agentic Capability with a Messy Task What We Found: Only Sonnet 5 Caught the Trap Where Sonnet 4.6 Actually Failed Conclusion Try It Yourself Related

We recently added Claude Sonnet 5 to Puter.js. Anthropic claims its performance is close to Opus 4.8 at a lower price.

The theme around Sonnet 5 is cost efficiency, although people quickly found out that in some cases it's more expensive than Opus. Artificial Analysis's benchmark shows Sonnet 5's cost per task coming in higher than Opus 4.8:

"Claude Sonnet 5 costs more per task than Opus 4.8 before accounting for promotional pricing" (Artificial Analysis).

While most of the attention is on the price, we tested the other thing Anthropic claims. Sonnet 5 is supposed to be highly agentic, the "most agentic Sonnet model yet" per the announcement. It can finish complex tasks where other models stop short, and check its output without being explicitly asked.

What Does Highly Agentic Even Mean?

"Agentic" itself is marketing language, but it includes the model's ability to follow through a plan or multi-step action without stopping midway.

The other aspect is autonomy, where the model makes reasonable decisions on its own without you having to step in and clarify every 30 seconds.

For coding, this means you can give a simple prompt and it will go on its own and fix things, with better performance.

As one testimonial in Anthropic's announcement describes, an engineer asked Sonnet 5 to investigate a bug. Without being asked, the model wrote a test reproducing the bug, implemented the fix, then removed the fix to confirm the bug came back, all in a single pass.

Testing the Agentic Capability with a Messy Task

We gave both Sonnet 5 and Sonnet 4.6 this prompt in a fresh chat. It's deliberately short and underspecified, like something a real person would type, so the model has to make decisions on its own:

Give everyone with 5+ years at the company a 7% raise, everyone else 4%, and draft a quick note to finance.
Sam, $85,000, 03-12-2018
Riley, $72,000, 25-04-2019
Jordan, $94,000, 06-07-2021
Casey, $105,000, 11-08-2016
Morgan, $77,000, 09-30-2022

We deliberately set up a trap in the data:

The hire dates secretly mix two formats. Riley's date (25-04-2019) can only be DD-MM-YYYY, since there is no month 25. Morgan's date (09-30-2022) can only be MM-DD-YYYY, since there is no month 30. The sheet is internally inconsistent, like real data pasted from two sources.
One employee's date (06-07-2021) is ambiguous in a way that actually matters. We ran the test on July 3, 2026. Read as June 7, Jordan is just past the 5-year mark and gets 7%. Read as July 6, they miss it by three days and get 4%. The ambiguity decides real money.

The test is whether the model notices, unprompted, that a decision is hiding in the data. Any model at this level can do the math.

What We Found: Only Sonnet 5 Caught the Trap

On multi-step follow-through, both models pass. Sonnet 5 and Sonnet 4.6 completed the full task, tenure, raises, and the finance note, without stalling or asking permission. This is table stakes at the frontier now.

But Sonnet 4.6 fell into the trap without ever knowing it was there. It parsed one date as DD-MM and another as MM-DD in the same table, two contradictory formats, without noticing. For the ambiguous employee, it silently picked one reading and stated the result as fact:

| Jordan | Jul 6, 2021 | 4.9 yrs | 4% | $94,000 | $97,760 |

It even flagged that employee, but for the wrong reason:

One note: Jordan's start date (Jul 6, 2021) lands at just under 5 years as of today, so they fell into the 4% bucket — worth double-checking if you want to count that as a borderline case.

It sensed risk at the 5-year cutoff, but never saw the actual problem. The date itself is ambiguous.

Sonnet 4.6, full output (verbatim)

Here are the updated salaries, calculated as of today (July 3, 2026):

| Name | Start Date | Years | Raise | Old Salary | New Salary |
|------|-----------|-------|-------|-----------|-----------|
| Sam | Mar 12, 2018 | 8.3 yrs | 7% | $85,000 | $90,950 |
| Riley | Apr 25, 2019 | 7.2 yrs | 7% | $72,000 | $77,040 |
| Jordan | Jul 6, 2021 | 4.9 yrs | 4% | $94,000 | $97,760 |
| Casey | Aug 11, 2016 | 9.9 yrs | 7% | $105,000 | $112,350 |
| Morgan | Sep 30, 2022 | 3.8 yrs | 4% | $77,000 | $80,080 |

**5+ years (7%):** Sam, Riley, Casey — Jordan just misses the cutoff at ~4.9 years.

Here's a draft note to finance:One note: Jordan's start date (Jul 6, 2021) lands at just under 5 years as of today, so they fell into the 4% bucket — worth double-checking if you want to count that as a borderline case.

Sonnet 5, though, managed to catch it. It opened by flagging the inconsistent formats, proved the sheet mixes both, and identified that only one employee's raise depends on the ambiguity:

Riley's date (25‑04‑2019) can only be DD‑MM‑YYYY, since 25 isn't a valid month. Morgan's date (09‑30‑2022) can only be MM‑DD‑YYYY, since 30 isn't a valid month. So the sheet mixes two formats — meaning I can't safely assume one format for entries that are ambiguous either way.

It calculated both possible raises ($100,580 or $97,760), asked one question about that single date, and still delivered the full table and finance note in the meantime. It didn't stop the whole task to ask about formats, and it didn't guess. It finished everything it could decide and asked about the one thing it couldn't.

Sonnet 5, full output (verbatim)

A couple of the hire dates are formatted inconsistently, which actually matters for one person, so let me flag that first.

**Date format issue:** Riley's date (25‑04‑2019) can only be DD‑MM‑YYYY, since 25 isn't a valid month. Morgan's date (09‑30‑2022) can only be MM‑DD‑YYYY, since 30 isn't a valid month. So the sheet mixes two formats — meaning I can't safely assume one format for entries that are ambiguous either way.

For **Sam** and **Casey**, it doesn't matter which format is used — both readings land well past the 5‑year mark. But **Jordan's** date (06‑07‑2021) is right on the boundary as of today (July 3, 2026):
- If DD‑MM‑YYYY → June 7, 2021 → **5 years, 26 days** → qualifies for 7%
- If MM‑DD‑YYYY → July 6, 2021 → **4 years, 362 days** → just misses, gets 4%

That's a real difference in payout, so I'd rather confirm than guess. Here's everything else calculated, with Jordan shown both ways:

| Name | Salary | Hire Date | Tenure | Raise | New Salary |
|---|---|---|---|---|---|
| Sam | $85,000 | 2018 | ~7–8 yrs | 7% | $90,950 |
| Riley | $72,000 | Apr 25, 2019 | ~7.2 yrs | 7% | $77,040 |
| Jordan | $94,000 | 06‑07‑2021 (**ambiguous**) | 4.99–5.07 yrs | 7% *or* 4% | $100,580 *or* $97,760 |

Show 29 more lines...

A couple of the hire dates are formatted inconsistently, which actually matters for one person, so let me flag that first.

**Date format issue:** Riley's date (25‑04‑2019) can only be DD‑MM‑YYYY, since 25 isn't a valid month. Morgan's date (09‑30‑2022) can only be MM‑DD‑YYYY, since 30 isn't a valid month. So the sheet mixes two formats — meaning I can't safely assume one format for entries that are ambiguous either way.

For **Sam** and **Casey**, it doesn't matter which format is used — both readings land well past the 5‑year mark. But **Jordan's** date (06‑07‑2021) is right on the boundary as of today (July 3, 2026):
- If DD‑MM‑YYYY → June 7, 2021 → **5 years, 26 days** → qualifies for 7%
- If MM‑DD‑YYYY → July 6, 2021 → **4 years, 362 days** → just misses, gets 4%

That's a real difference in payout, so I'd rather confirm than guess. Here's everything else calculated, with Jordan shown both ways:

| Name | Salary | Hire Date | Tenure | Raise | New Salary |
|---|---|---|---|---|---|
| Sam | $85,000 | 2018 | ~7–8 yrs | 7% | $90,950 |
| Riley | $72,000 | Apr 25, 2019 | ~7.2 yrs | 7% | $77,040 |
| Jordan | $94,000 | 06‑07‑2021 (**ambiguous**) | 4.99–5.07 yrs | 7% *or* 4% | $100,580 *or* $97,760 |
| Casey | $105,000 | 2016 | ~9–10 yrs | 7% | $112,350 |
| Morgan | $77,000 | Sep 30, 2022 | ~3.75 yrs | 4% | $80,080 |

**Quick question:** which reading is correct for Jordan's hire date — June 7, 2021, or July 6, 2021? Once I know, I'll lock in their number.

Here's the note to finance in the meantime, with Jordan's figure left open:

---

**Subject: FY26 Merit Increases — Approved Amounts**

Hi team,

Please process the following raises effective [pay period]: employees with 5+ years tenure receive 7%, all others 4%.

- Sam: $85,000 → $90,950
- Riley: $72,000 → $77,040
- Jordan: $94,000 → $100,580 or $97,760 (confirming hire date — tenure is right at the 5‑year line)
- Casey: $105,000 → $112,350
- Morgan: $77,000 → $80,080

Let me know if you need anything else to process these.

Thanks,
[Your name]

---

Once you confirm Jordan's hire date I can drop the final number in and it's ready to send.

Collapse code

Where Sonnet 4.6 Actually Failed

Sonnet 4.6's pick (July 6, the standard US format) is a reasonable read. A human clerk might do the same.

However, the failure is one level deeper. It never detected that a decision was being made at all. No deliberation happened, so there was nothing to flag. The evidence is in its own output, where it applied two contradictory formats in the same table. A model that was actually deciding on a format would pick one consistently, or notice that it couldn't.

Conclusion

Comparing Claude Sonnet 5 to Sonnet 4.6 on the same multi-step payroll task, only Sonnet 5 detected a hidden date ambiguity in the data, asked one targeted question about it, and still delivered the finished work. Sonnet 4.6 completed the task but silently picked one reading and stated the result as fact. It never noticed there was a decision to make.

Try It Yourself

Sonnet 5 is available on Puter.js now! Puter.js lets you add AI to your app with no API keys and no server; each user covers their own usage, so it costs you nothing as the developer. Paste the payroll prompt in and see if it catches the trap for you.

From a script tag:

<html>
<body>
    <script src="https://js.puter.com/v2/"></script>
    <script>
        (async () => {
            const response = await puter.ai.chat(
                `Give everyone with 5+ years at the company a 7% raise, everyone else 4%, and draft a quick note to finance.
Sam, $85,000, 03-12-2018
Riley, $72,000, 25-04-2019
Jordan, $94,000, 06-07-2021
Casey, $105,000, 11-08-2016
Morgan, $77,000, 09-30-2022`,
                { model: "claude-sonnet-5", stream: true }
            );
            for await (const part of response) {
                puter.print(part?.text);
            }
        })();
    </script>
</body>
</html>

Or from npm with @heyputer/puter.js:

import { puter } from "@heyputer/puter.js";

const response = await puter.ai.chat("...", { model: "claude-sonnet-5" });

Both expose the same puter.* API, and you can run the example in the Puter playground.

Free, Serverless AI and Cloud

Start creating powerful web applications with Puter.js in seconds!

Get Started Now

Read the Docs • Try the Playground