Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.
In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.
Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.
I wrote my own agent, and it sends data to LLMs in this order: "General Prompts (How to write good code)" + "The Code" + "The Feature Request". This means the KV cache will be used even when the feature request changes.
And output tokens are usually way less than the input tokens.
So I think that my approach is very lightweight on token usage compared to an interactive session.
It would be interesting to measure it for the other agents out there. Sending a feature request two times vs an interactive session.
That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental
I've been wondering for a while if ignoring most of that bubble and whatever it cooks up might be a wrong move on my part.
Glad to see that it's just noise.
I suppose the biggest effects these skills have is to prime the user to expect something positive.
Actually kinda like what we do with LLMs. Just put a word in their context window and they suddenly start behaving different because probabilities changed.
People have all right to be angry if basic responsible adult things like "quarantine the server spreading large amounts of malware" do not happen within the reasonable timespan that passed.
Not even a news. A hint. Nothing. Radio silence.
___
There is a house. It is currently on fire (since over 24h).
So far, people have talked about how, conceptually, house fires are bad.
You can still enter the house just fine.
People saying "hey what about locking the door to not trap more people in it" are being shunned for the crime of breaking someones workflow.
The owner of said house is nowhere to be seen.
Passerbys stating "oh my god that house is on fire! get water!" are either ignored or reminded that there is no problem and they should move along.
___
Idk man. I don't think any of this is real.
And I don't even use arch, lol. And after this thing exposed the institutional rot, neither should you or really anyone.
Unless you like ending up locked inside a house fire. I guess they provide warmth in the cold harsh reality of the 2026 internet.
The server actually hosting the rootkit executable is npmjs.com, run by a for-profit company, and they still take about 24h to act on our reports, while reported AUR packages have been processed in about 1-2h by people that work unrelated dayjobs on top of this, to self-subsidize their open source work.
Sorry you're displeased with us not writing blogposts faster on top of all this. The situation is already exhausting enough without people like you.
Look, man, I understand all that, but pulling the plug is something that takes at most 90s. Let's say 300s to add the "Warning: There is an attack. We're working on it. Systems are down for now" box
After that, you have all the time in the world to prioritize dayjobs etc.
It's not about dropping everything and fixing the root cause. It's just about taking stuff offline so that the immediate danger is mitigated.
That is not too much to ask.
It's not "people like me" having weird opinions there.
Shut it down. Then fix whenever there is time to do so.
___
But hey. Finally a statement from someone with some amount of position in the org I guess?
I wouldn't want to be in your shoes for sure, but that's beside the point. Nothing here is unreasonable other than the ostrich-style incident response lack-of-process.
And I don't mean stupid corporate process. I mean "common sense adults are in the room" process. Throw waterbucket at burning server reflex.
___
I mean I can see that your userbase absolutely sucks and could imagine that one would be scared of getting roasted for "interrupting their workflow", but this is not the way.
Their workflow is irrelevant.
As said, I'm all here for maintainer empathy, but only after the fire is put out first.
___
Anyway, "institutional rot" is not an insult but a diagnosis. I'd love to be proven wrong on that, but I don't see it.
And trust me, I do know first hand how thankless this non-job is and what hell one goes through.
I have skin in the game. I just don't have a horse in the arch race.
"Hey, let's take down all of npm, because there's a package that installs something malicious, and some people may install it without reviewing it first. The thousands of other people relying on this service can wait."
Do you not realize how crazy of an request that is?
You do realize that the people relying on the service also get served wormable malware, right?
The service is already disrupted.
It is not that a disruption could be _avoided_. The discussion makes no sense.
___
Hell, even if I would be completely wrong in that assessment (not sure how, but let's assume that's the case)
You can still put up a banner. "Hey, FYI: We're under attack".
If not right away, then at the very least the moment media reports on it. And if media reported wrong, the banner says "Don't worry people. Media got it wrong."
> You do realize that the people relying on the service also get served malware, right? The service is already disrupted.
Huh? No they don't. I'm not sure what part of the attack your misunderstood, but most people are going to be completely unaffected by this. None of the infrastructure or anything like that got compromised. I updated my AUR packages 2 hours ago, and didn't get served any malware.
Again, there's probably some kind of malware on npmjs at any given time. You don't just shutdown the entire server because of that, that's madness.
As said, I don't think discussing this makes sense, as our perceptions of reality seem to be fundamentally incompatible.
But regardless, let's try a different perspective: PR/Public perception
The moment multiple well-known media outlets start publishing a story stating that "stuff is happening", the situation changes.
At that point, regardless of how you personally feel about this, the narrative is "people are affected".
This forces your hand.
Which is not(!) to say that it would mean that you would have to accept what the media says. The media could be full of shit talking nonsense. *But* at that point, you need to either correct them, or do the correct action as per their narrative.
____
I don't think that PR/Public perception is the main relevant perspective here - in fact I'm just mentioning it, because all the much stronger much more technical arguments seem to be lost on you.
But there you go.
Your argument makes no sense, because "ackschually I'm unaffected" is just russian roulette survivorship bias, but even if it _would_ make sense, the system logic of the next outer layer cans that take.
____
Anyway. The fact that people (not just you, mind you) are so busy playing "well ackschually" while there is an active wormable attack going on is precisely why I said "institutional rot". Although, I think I need to correct that to "cultural rot".
Priorities are broken. The wrong metrics are being optimized here.
I would love to hear more about this from the actual Arch maintainers instead of random users with opinions, but.. not sure where that communication would be. I didn't find any. And I did go looking!
Why are you still misunderstanding when other replies already explained?
AUR has always been AT YOUR OWN RISK.
To use your analogy, the house is an underwater cave with a big scary sign warning you that you will die, you go in without training, and blame the cave for not being safe.
Even if there's a lot of noise there's clearly something real there. People are shipping more working products than was previously possible, they're debugging faster than was previously possible, and various other things. I mean you can go fishing for things to confirm your skepticism if you want but it's pretty clear to me.
Sure, but that doesn't mean that you can't filter signal from noise.
So the actual problem statement is not "how do I keep up" but "how do I correctly tune my filter", which is solvable.
The biggest challenge there I think is that many people are not prepared for just how sharp and uncompromising that filter needs to be, but that too is solvable.
If you're not going to experiment at all you're not going to be able to do that. Agentic coding was basically a joke the first time I tried it. Now it isn't.
This no longer works when bad faith actors will push code straight from LLMs with little review, and respond to your comments with LLM responses. They will constantly leave you with the responsibility of verifying the output. You are the human in their loop. This is a brutal asymmetry. In the past, at least you knew a person probably spent more time handwriting code than you will spend reviewing it. This no longer applies, now the reviewer can easily spend more time than the author.
The thing that makes it scale is to default to "no" and require the other party to convince you of "yes". Just put the burden of proof where it belongs.
If they don't manage, then that's their problem.
Communicating this in a way that is viable for a business scenario certainly comes with its own difficulties, but that is a solvable problem.
In fact, you can use AI to stress test your communication there.
Just throw what you want to say at the AI but don't tell it that it is you who wrote it. Then tune the input until it stops saying that you're the problem and starts agreeing with you.
Highly recommend. It's a perfect emotion-driven cargo-culting normie simulator that never calls HR on you.
Did you not read what I said, they will use LLMs to spam proof onto the human reviewer. Just endless replies with LLM generated answers until you yield and approve the PR.
> I also don't want to be an asshole to my community & reject all their changes.
Do they pay you to triage their noise?
Remember that you owe no one anything at all. Neither legally nor morally.
Your chosen license likely even states the former in plain english.
___
Personally, I've adopted the "you annoy me, you're out" stance and have been quite happy with it. You do need a tough shell to do that though as you will be facing all the social exploits people can throw at you.
It also leaves "growth potential" on the table, the same way that limiting your exposure to ionizing radiation does.
That all said, it depends on what your goals are + where in the lifecycle of your project you are.
So don't take this as "this is the way" but "this can be one way".
Either way, you're not an asshole for not reading slop. Don't let anyone gaslight you into that.
> while it started to look off after a while, all the replies were still like this - a bit weird, but still plausible
I believe that we will be seeing the death of "assume good faith", which is not a bad thing, given that this was an exploit vector that has been actively abused for many years now.
"Assume bad faith and work backwards from that, rule out any possible exploits and only then clear the input for processing" will be the new normal.
Which is good. We need friction. Friction makes stuff slow down and work at the speed of humans.
It is a bad thing. The good response to bad actors abusing good faith is to make sure there are consequences that disincentivize that behavior in the future. Sliding further towards a low trust society means the bad actors winning in the same way that terrorists win when we subject everyone to restrictions as a result.
Quite the opposite. You just add a Wall with a Gate.
Inside those walls, you suddenly have a high trust society again.
The issue that is currently breaking reality was that we thought that everywhere could be a "high trust" space. This was proven countless times to be wrong.
Tearing down all walls - as it happened with the assault on friction (thanks hyperscaling) - did not lead to the "high trust" spilling out, but the "low trust" spilling in, essentially.
It's a question where you build that wall. If you build it around the home of your immediate family and keep almost everyone else out then you can hardly be said to have a high trust society. The goal should be to put only those bad actors behind a wall, preferably a physical one.
Yeah, gated communities like that are usually a clear sign that something bad is happening with the given society - or in a minor cases with the community, if it needs to gate itself from a society that is not failing.
Sure, but that's a completely different discussion.
Plus that even with such a small scale of the "inside", the thing fails gracefully.
It is arguably a failure mode, yes, but it is one that leaves a functioning system (albeit one that stays below its potential).
This is not true for the inversion of the scenario. That does _not_ fail safe but just leaves rubble behind.
> But if you hold that position, you also have to be fine with companies not offering products and services in your country.
Well.. I mean.. yeah?
I don't think this is as bad as you think it is.
Have you looked at SV and its product offerings recently?
It's mostly just enshittified gamified value extraction that doesn't respect the user at all.
"If you do not let us do all this the way we want, we will take away your ability to use our shit" hits different when the "shit" in that sentence is actually just "shit".
I'm half-remembering a now-old satire along the same lines has Germans wondering why having Google Street View work in their area also requires internal photos of their apartments.
reply