> APIs are contracts. Not the pinky promise of "I'll do my best guess"
You have never had to work with PHP backends, have you?
JSON in PHP is a flustercluck. Undefined, null, "" or "null", that is always the question.
If you use a typed Go/Rust client and schemas, you usually end up with "look ahead schemas" that try to detect the actual types behind the scenes, either with custom marshallers or with some v1/v2/v3 etc schema structs.
It's so painful to deal with ducktyped languages ... that's something I wouldn't wish on anyone.
I mean, there is still people who think that a UFO was sighted in Roswell at the radar testing site of Area 51.
Imagine that, 70ish years later there is people that cannot grasp how modern the A-12 prototype was. [1]
In my opinion the US has a real scientific education problem. So much so that people still think that alien life that built machines so advanced that they can bridge distances over lightyears travel time... just the belief that they will remotely resemble our appearance anyhow is statistically so close to 0 that I have no words to express how unlikely it is to happen. You have a greater chance getting hit every millisecond of your life by a lightning strike than this being the case.
We have control flow. It's requirements specifications and test driven development. You just have to enforce it, so the agents cannot cheat their way around it.
I decided to build my agentic environment differently. Local only, sandboxed, enforced with Go specific requirement definitions that different agent roles cannot break as a contract.
That alone is far better than any hyped markdown-storage-sold-as-memory project I've seen in the last weeks.
Currently I am experimenting with skills tailored to other languages, because agentskills actually are kinda useless because they're not enforced nor can any of their metadata be used to predictably verify their behaviors.
My recommendation to others is: Treat LLM output as malware. Analyse its behavior, not its code. Never let LLMs work outside your sandbox. Force them to not being able to escape sandboxes. And that includes removing the Bash tool, for example, because that's not a reproducible sandbox.
Also, choose a language that comes with a strong unit testing methodology. I chose Go because it allows me to write unit tests for my tools, and even agents to agents communication down the line (with some limitations due to TestMain, but at least it's possible).
If you write your agent environment or harness in Typescript, you already failed before you started. Compiled code isn't typesafe because the compiler doesn't generate type checks in the resulting JS code.
Anyways, my two cents from the purpleteaming perspective that tries to make LLMs as deterministic as possible.
Fun fact: You still can't build the vllm container with updated dependencies since llmlite got pwned. Either due to regression bugs, or due to impossible transient dependencies in the dependency tree that are not resolvable. There is just too much slopcode down the line, and too many dependencies relying on pinned outdated (and unpublished) dependencies.
I switched to llama.cpp because of that.
To me it feels more and more that the slopcode world is the opposite philosophy of reproducible builds. It's like the anti methodology of how to work in that regard.
Before, everyone was publishing breaking changes in subminor packages because nobody adhered to any API versioning system standards. Now it's every commit that can break things. That is not an improvement.
Write only code is such a bad bad idea. No one is reviewing 20k loc PRS with 15 new dependencies in an afternoon. Sorry it's just not happening I don't care how many years you have been a software engineer. Yet that's the new thing and how we all are supposed to work or else we are all Luddites.
I'm personally waiting to be downgraded to simply being called "lazy".
When I see pages of obviously generated prose being submitted as any kind of documentation, my eyes just glaze over. I feel so guilty sharing similar stuff too, though to my credit, at least I always lead with a self-written TLDR, the slop is just for reference. But it's so bad, like genuinely distressing tier. I don't want to read all that junk, and more and more gets produced.
Prose type docs have always been my Achilles heel, and this is like the worst possible evolution of that.
For a brief period in the past few weeks, they somehow managed to make a change to ChatGPT Thinking that made it succint. The tone was super fact oriented too. It was honestly like waking up from a fever dream.
Can you elaborate why those bugs weren't found by e.g. fuzzing in the past?
I'm genuinely curious what "types" of implementation mistakes these were, like whether e.g. it was library usage bugs, state management bugs, control flow bugs etc.
Would love to see a writeup about these findings, maybe Mythos hinted us towards that better fuzzing tools are needed?
If I had to guess, I'd say that AI is better at finding TOCTOU bugs than fuzzing because it starts by looking at the code and trying to find problems with it, which naturally leads it to experiment with questions like "is there any way to make this assumption false?", whereas fuzzing is more brute force. Fuzzing can explore way more possible states, but AI is better at picking good ones.
In this particular sense, AI tends to find bugs that are closer to what we'd see from a human researcher reading the code. Fuzz bugs are often more "here's a seemingly innocuous sequence of statements that randomly happen to collide three corner cases in an unexpected way".
Outside of SpiderMonkey, my understanding is that many of the best vulnerabilities were in code that is difficult to fuzz effectively for whatever reason.
Fuzzing isn't good at things like dealing with code behind a CRC check, whereas the audit based approach using an LLMs can see the sketchy code, then calculate the CRC itself to come up with a test case. I think you end up having to write custom fuzzing harnesses to get at the vulnerable parts of the code. (This is an example from a talk by somebody at Anthropic.)
That being said, I think there's a lot of potential for synergy here: if LLMs make writing code easier, that includes fuzzers, so maybe fuzzers will also end up finding a lot more bugs. I saw somebody on Twitter say they used an LLM to write a fuzzer for Chrome and found a number of security bugs that they reported.
I never understood why there is no interactive Help program like there was in the "old days" when CHM files on Windows 95/98/XP were a thing. These CHM files and the interactiveness are heavily underrated, and they were some really good documentation, especially the ones from IDEs and compiler suites.
Today I wish there was something like this but made for tutorials and wizards. If someone presses "Help" they should not have to go online on your website just to literally never find any help for their problems.
We are in the golden age of LLMs, yet nobody uses LLMs to explore and discover locally hosted knowledge bases ... which are in my opinion the single most useful use case of them. You could build such a great UX with it.
For example, I'm selfhosting a lot of archived wikis via a kiwix server. Devdocs, wikipedia, dev and cyber related wikis. Having an LLM assistant running on those locally was probably the best improvement for my learning experience. And the workflow is integrated into my custom New Tab page, it's literally a search field on my homepage of the browser, so it's always accessible.
The real question is why Anthropic was able to use DMCA takedown requests "in good faith" against the Claude leaks when their own CTO claimed it is a 100% slopcoded codebase, and they themselves argue that all LLM generated code is transformed enough to not be copyrightable. Which they have to state without being able to turn back because they violated millions of book and software licenses during training.
You can profit from getting away with lying to judges.
A judge isn't involved, anyway. The leaker would have to take you to court and then prove that your request was in bad faith and that they didn't infringe copyright.
Competent programmers understand how to tell the computer what needs to happen. Really good programmers understand how the computer executed the code, and take advantage of it - they know about speculative execution and cache prefetching. Competent lawyers know what the law says. Really good lawyers understand how the law is executed, and take advantage of it - they know when it won't be enforced.
You have never had to work with PHP backends, have you?
JSON in PHP is a flustercluck. Undefined, null, "" or "null", that is always the question.
If you use a typed Go/Rust client and schemas, you usually end up with "look ahead schemas" that try to detect the actual types behind the scenes, either with custom marshallers or with some v1/v2/v3 etc schema structs.
It's so painful to deal with ducktyped languages ... that's something I wouldn't wish on anyone.
reply