Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been talking to friends about this extensively, and read all sorts of different social media posts on X where people deep dove things (I'm at work so I don't have any links handy - though I did submit one on HN, grain of salt, unsure how valid it is but it was interesting: https://news.ycombinator.com/item?id=47752049 ).

I think the real issue stems from the 1 Million token context window change. They did not anticipate the amount of load it would give you. That first few days after they released the new token window, I was making amazing things in one single session from nothing, to something (a new .NET based programming language inspired by Python, and a Virtual Actor framework in Rust). I think since then they've been trying too many things to tweak things, whilst irritating their users.

They even added a new "Max" thinking mode, and made "High" the old medium, which is ridiculous because you think you're using "High" but really you're not. There's a hidden config file to change their terrible defaults to let Claude be smarter still, and apparently you can toggle off the 1M tokens.

I think the real fix, and I'm surprised nobody there has done this yet, is to let the user trim down their context window.

Think about it, you used to have what? 350k tokens or so? Now Claude will keep sending your prompt from 30 minutes ago that's completely irrelevant to the back-end, whereas 3 months ago it would have been compacted by now.

Others have noted that similar prompting for some ungodly reason adds tens of thousands of extra garbage tokens (not sure why).

Edit looks like someone figured out that if you downgrade your version of Claude Code and change one single setting it unruins Claude:

https://news.ycombinator.com/item?id=47769879



Yea, I've realized that if I stay under 200k tokens I basically don't have usage issues any more.

A bit annoying, but not the end of the world.


super-edit: Sorry, this is not a usage related question, I have move it to: https://news.ycombinator.com/item?id=47772971

Here is the question for which I cannot find an answer, and cannot yet afford to answer myself:

In Claude Code, I use Opus 4.6 1M, but stay under 250k via careful session management to avoid known NoLiMa [0] / context rot [1] crap. The question I keep wanting answered though: at ~165k tokens used, does Opus 1M actually deliver higher quality than Opus 200k?

NoLiMa would indicate that with a ~165k request, Opus 200k would suck, and Opus 1M would be better (as a lower percentage of the context window was used)... but they are the same model. However, there are practical inference deployment differences that could change the whole paradigm, right? I am so confused.

Anthropic says it's the same model [2]. But, Claude Code's own source treats them as distinct variants with separate routing [3]. Closest test I found [4] asserts they're identical below 200K but it never actually A/B tests, correct?

Inside Claude Code it's probably not testable, right? According to this issue [5], the CLI is non-deterministic for identical inputs, and agent sessions branch on tool-use. Would need a clean API-level test.

The API level test is what I really want to know for the Claude based features in my own apps. Is there a real benchmark for this?

I have reached the limits of my understanding on this problem. If what I am trying to say makes any sense, any help would be greatly appreciated.

If anyone could help me ask the question better, that would also be appreciated.

[0] https://arxiv.org/abs/2502.05167

[1] https://research.trychroma.com/context-rot

[2] https://claude.com/blog/1m-context-ga

[3] https://github.com/anthropics/claude-code/issues/35545

[4] https://www.claudecodecamp.com/p/claude-code-1m-context-wind...

[5] https://github.com/anthropics/claude-code/issues/3370


2 parent comments above say that you can use older version of claude code with opus 200k to compare. my guess is that eventually you’ll be able to set it in model settings yourself


Yeah, I have been seeing lots of comments, tweets, etc, but given everything I have learned about these models - i do not think the change to 1M was innocuous. I'm not sure what they've claimed publicly, but I'm fairly certain they must be doing additional quantization, or at minimum additional quantization of the KV cache. Plus, sequence length can change things even when not fully utilized. I had to manually re-enable the "clear context and continue" feature as well.


I used the heck out of it when it was announced, and it felt like I was using one of the best models I've ever used, but then so were all of their other customers, I don't think they accounted for such heavy load, or maybe follow up changes goofed something up, not sure. Like I said, the 1M token, for the first few days allowed me to bust out some interesting projects in one session from nothing to "oh my" in no time.

I'm thinking they should go back to all their old settings and as a user cap you at their old token limit, and ask you if you want to compact at your "soft" limit or burst for a little longer, to finish a task.


How do you re-enable that feature?


The future of harnesses cannot be „resend the whole history on every step“ or whatever this terrible compaction is.

Most of the context is unstructured fluff, much of it is distracting or even plain wrong. Especially the „thinking“ tokens are often completely disjoint halucinations that don’t make any sense.

I think what will have to happen is that context looks less like a long chat and action log and more like a structured, short, schema validated state description, plus a short log trace that only grows until a checkpoint is reached, which produces a new state.


You’re going to loose a lot of natural language nuances then. Plus git is essentially your structured, validated state description.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: