This has been my approach and of course what you lose is the "random and surprising" (maybe good) but also the "evolutionary" aspect.
So, if you write strong tooling (even with AI) around the connection points - you can create blackboxes tht are secure and only allow the agent to perform certain actions. The blackbox email service calls out to a secure store (for keys/etc) and accesses your emails in a read-only way, etc (for example).
Everything is then much more intentional. You're writing tools for your agent but you also can't do fun or evolutionary things which is most of the fun behind OpenClaw. That and many people seem to genuinely see them as 'pets' or 'strange Ai friends' but that's a different problem and it's due to the interesting methods OpenClaw uses to give the illusion of intelligence, always on, and memories. These are all well know (variations on RAG, markdowns, etc)
Why would I want non-deterministic behavior here though?
If I want to max uptime, I write a tool to track/monitor. Then write a small agent (non-ai) that monitors those outputs and performs your remediation actions (reset something, clear something, etc, depends on service).
Do I want Claude re-writing and breaking subscription flow because it detected an issue? No.
It's not, hence the "don't post AI slop as your comment" posting a few days back that had 1000+ comments.
Currently an unsolved problem - just stealthier on some platforms than others. Trigger the right topic on HN and the bots come out in-force together with humans sloppily copy/pasting LLM content.
I don't see what you're seeing, in any dimension. But here's a fair take.
I wrote several very specialized benchmarks that I've used over time, that surface "model personalities" and their effects on decision making (as well as measuring the outcomes).
Grok 4.1 Fast Reasoning is/was a solid model. It's also fundamentally different from the pack.
I call it a smart, aggressive, Claude Haiku. That is, its "thinking" is quite chaotic and sometimes short-hand and its output can be as well (relate to other models).
Its aggressiveness can allow it to punch above in competitive scenarios that I have in some of my benchmarks. Its write-ups and documentation are often replete with "dominate", "relentless" and a general high energy that skirts the limits of 'cringe bro'. That said, it has generally performed just beneath the SOTA (at the time: GPT-5.2, Gemini-3-Flash, Claude Opus 4.5). Angry Sonnet perhaps.
The latest release feels quite similar but also underperforms the same older crowd (so far) so it hasn't quite made the leap that Claude's 4.6 and GPT's 5.3/5.4 series made. It's also now priced the same as its peers but does not deliver SOTA capabilities (at least not consistently in my opinion).
I don't see why we can't have AI powered reviews as a verification of truth and trust score modifier. Let me explain.
1. You layout policy stating that all code, especially AI code has to be written to a high quality level and have been reviewed for issues prior to submission.
2. Given that even the fastest AI models do a great job of code reviews, you setup an agent using Codex-Spark or Sonnnet, etc to scan submissions for a few different dimensions (maintainability, security, etc).
3. If a submission comes through that fails review, that's a strong indication that the submitter hasn't put even the lowest effort into reviewing their own code. Especially since most AI models will flag similar issues. Knock their trust score down and supply feedback.
3a. If the submitter never acts on the feedback - close the submission and knock the trust score down even more.
3b. If the submitter acts on the feedback - boost trust score slightly. We now have a self-reinforcing loop that pushes thoughtful submitters to screen their own code. (Or ai models to iterate and improve their own code)
4. Submission passes and trust score of submitter meets some minimal threshold. Queued for human review pending prioritization.
I haven't put much thought into this but it seems like you could design a system such that "clout chasing" or "bot submissions" would be forced to either deliver something useful or give up _and_ lose enough trust score that you can safely shadowban them.
This is marketing. The same way Apple cares about your privacy so long as they can wall you in their garden.
Not a value judgment, just saying that the CEO of a company making a statement isn't worth anything. See Googles "don't be evil" ethos that lasted as long as it was corporately useful.
If Anthropic can lure engineers with virtue signaling, good on them. They were also the same ones to say "don't accelerate" and "who would give these models access to the internet", etc etc.
"Our models will take everyone's jobs tomorrow and they're so dangerous they shouldn't be exported". Again all investor speak.
It usually refers to situations without access to the source code.
I've always taken "clean room" to be the kind of manufacturing clean room (sealed/etc). You're given a device and told "make our version". You're allowed to look, poke, etc but you don't get the detailed plans/schematics/etc.
In software, you get the app or API and you can choose how to re-implement.
In open source, yes, it seems like a silly thing and hard to prove.
True, but I think the implication (as I read it) is that AI may be providing more complex solutions than were needed for the problem and perhaps more complex than a human engineer would have provided.
Somehow this article explains perfectly, visually, how AI generated code differs from human generated code as well.
You see the exact same patterns. AI uses more code to accomplish the same thing, less efficiently.
I'm not even an AI hater. It's just a fact.
The human then has to go through and cleanup that code if you want to deliver a high-quality product.
Similarly, you can slap that AI generated 3D model right into your game engine, with its terrible topology and have it perform "ok". As you add more of these terrible models, you end up with crap performance but who cares, you delivered the game on-time right? A human can then go and slave away fixing the terrible topology and textures and take longer than they would have if the object had been modeled correctly to begin with.
The comparison of edge-loops to "high quality code" is also one that I mentally draw. High quality code can be a joy to extend and build upon.
Low quality code is like the dense mesh pictured. You have a million cross interactions and side-effects. Half the time it's easier to gut the whole thing and build a better system.
Again, I use AI models daily but AI for tools is different from AI for large products. The large products will demand the bulk of your time constantly refactoring and cleaning the code (with AI as well) -- such that you lose nearly all of the perceived speed enhancements.
That is, if you care about a high quality codebase and product...
"High-quality code can be a joy to extend and build upon." I love the analogy here. It is a perfect parallel to how a good 3D model is a delight to extend. Some of the better modelers we've worked with return a model that is so incredibly lightweight, easily modifiable, and looks like the real thing that I am amazed each time.
The good thing about 3D slop vs. code slop is that it is so much easier to spot at first glance. A sloppy model immediately looks sloppy to nearly any untrained eye. But on closer look at the mesh, UVs, and texture, a trained eye is able to spot just how sloppy it truly is. Whereas with code, the untrained eye will have no idea how bad that code truly is. And as we all know now, this is creating an insane amount of security vulnerabilities in production.
We will get an interesting effect if AI plateaus around where it does now, which is that AI code generation will bring "the long run" right down to "the medium run" if not on to the longer side of the short run. AI can take out technical debt an order of magnitude faster than human developers, easily, and I'm still waiting for it to recognize that an abstraction is necessary and invest into putting on in the code rather than spending the ones already present.
Of course if AI continues to proceed forward and we get to the point where the AIs can do that then they really will be able to craft vast code bases at speeds we could never keep up with on our own. However, I'm not particularly convinced LLMs are going to advance past this particular point, to a large degree because their training data contains so much of this slop approach to coding. Someone's going to have to come up with the next iteration of AI tech, I think.
I wonder about heavy curation of data sets, and then only senior level developers in the Alignment/RLHF phases, such that the expertise of a senior level developer were the training. The psychology of those senior level developers would be interesting, because they would knowingly be putting huge numbers of their peers, globally, out of work. I wonder about if it would, then if course it will, and then I question if we're really that desperate.
debt doesn't harm you until the carrying costs become to high v profits. Just have to hit that point (if is exists, maybe growth accelerates forever if you are optimistic).
If you only knew how the enterprise space does stuff you'd realize how little a priority maintainability is.
I'm grateful we had Java when this stuff was taking off; if any enterprise applications were written in anything else available at the time (like C/C++) we'd all suffer even more memory leaks, security vulnerabilities, and data breaches than we do now.
Now that's interesting, because I come from a world where enterprise level stuff was all done in C/C++ until quite recently, and with the shift to :web technologies" the quality of virtually everything has dropped through the floor, including the knowledge and skill level of the developers working on the tech. It is rare that I see people that have been working in excess of 10 years post graduation, if they went to college. The college grads have been pushed out by lower quality and lower skilled React developers that really do not belong in the industry at all. It's really a crime how low things have gotten, in such a short time: 10 to 15 years ago there were 2-3 decades of experienced people all over the place. Not anymore.
After 2 days of giving it a go, I find that Gemini CLI is still considerably worse than both Codex and Claude Code.
The model itself also has strange behaviors that seem like it gets randomly replaced with Gemini-3-Flash or something else. I'll explain.
Once agentic coding was a bust, I gave it a run as a daily driver for AI assistant. It performed fairly well but then began behaving strangely. It would lose context mid conversation. For instance, I said "In san francisco I'm looking for XYZ". Two turns later I'm asking about food and it gives me suggestions all over the world.
Another time, I asked it about the likelihood of the pending east coast winter storm of affecting my flight. I gave it all the details (flight, stops, time, cities).
Both GPT-5.2 and Claude crunched and came back with high quality estimations and rationale. Gemini 3.1 Pro... 5 times, returned a weather forecast widget for either the layover or final destination. This was on "Pro" reasoning, the highest exposed on the Gemini App/WebApp. I've always suspected Google swaps out models randomly so this.. wasn't surprising.
I then asked Gemini 3.1 Pro via the API and it returned a response similar to Claude and GPT-5.2 -- carefully considering all factors.
This tells me that a Google AI Ultra subscription gives me a sub-par coding agent which often swaps in Flash models, a sub-par web/app AI experience that also isn't using the advertised SOTA models, and a bunch of preview apps for video gen, audio gen (crashed every time I attempted), and world gen (Genie was interesting but a toy).
This will be a quick cancel as soon as the intro rate is done.
It's like Google doesn't ACTUALLY want to be the leader in AI or serve people their best models. They want to generate hype around benchmarks and then nerf the model and go silent.
Gemini 3 Pro Preview went from exceptional in the first month to mediocre and then out of my rotation within a month.
So, if you write strong tooling (even with AI) around the connection points - you can create blackboxes tht are secure and only allow the agent to perform certain actions. The blackbox email service calls out to a secure store (for keys/etc) and accesses your emails in a read-only way, etc (for example).
Everything is then much more intentional. You're writing tools for your agent but you also can't do fun or evolutionary things which is most of the fun behind OpenClaw. That and many people seem to genuinely see them as 'pets' or 'strange Ai friends' but that's a different problem and it's due to the interesting methods OpenClaw uses to give the illusion of intelligence, always on, and memories. These are all well know (variations on RAG, markdowns, etc)
reply