It makes a lot of sense to use an MCP for git and everything else if you want observability across many users. It gives you a place to shim security controls, monitoring, and alerting into the tool call pipeline.
I get what you're saying, but I think this is still missing something pretty critical.
The smaller models can recognize the bug when they're looking right at it, that seems to be verified. And with AISLE's approach you can iteratively feed the models one segment at a time cheaply. But if a bug spans multiple segments, the small model doesn't have the breadth of context to understand those segments in composite.
The advantage of the larger model is that it can retain more context and potentially find bugs that require more code context than one segment at a time.
That said, the bugs showcased in the mythos paper all seemed to be shallow bugs that start and end in a single input segment, which is why AISLE was able to find them. But having more context in the window theoretically puts less shallow bugs within range for the model.
I think the point they are making, that the model doesn't matter as much as the harness, stands for shallow bugs but not for vulnerability discovery in general.
OK, consider a for loop that goes through your repo, then goes through each file, and then goes through each common vulnerability...
Is Mythos some how more powerful than just a recursive foreloop aka, "agentic" review. You can run `open code run --command` with a tailored command for whatever vulnerabilities you're looking for.
newer models have larger context windows, and more stable reasoning across larger context windows.
If you point your model directly at the thing you want it to assess, and it doesn't have to gather any additional context you're not really testing those things at all.
Say you point kimi and opus at some code and give them an agentic looping harness with code review tools. They're going to start digging into the code gathering context by mapping out references and following leads.
If the bug is really shallow, the model is going to get everything it needs to find it right away, neither of them will have any advantage.
If the bug is deeper, requires a lot more code context, Opus is going to be able to hold onto a lot more information, and it's going to be a lot better at reasoning across all that information. That's a test that would actually compare the models directly.
Mythos is just a bigger model with a larger context window and, presumably, better prioritization and stronger attention mechanisms.
Harnesses are basically doing this better than just adding more context. Every time, REGARDLESS OF MODEL SIZE, you add context, you are increasing the odds the model will get confused about any set of thoughts. So context size is no longer some magic you just sprinkle on these things and they suddenly dont imagine things.
So, it's the old ML join: It's just a bunch of if statements. As others are pointing out, it's quite probably that the model isn't the thing doing the heavy lifting, it's the harness feeding the context. Which this link shows that small models are just as capabable.
Which means: Given a appropiately informed senior programmer and a day or two, I posit this is nothing more spectacular than a for loop invoking a smaller, free, local, LLM to find the same issues. It doesn't matter what you think about the complexity, because the "agentic" format can create a DAG that will be followable by a small model. All that context you're taking in makes oneshot inspections more probable, but much like how CPUs have go from 0-5 ghz, then stalled, so too has the context value.
Agent loops are going to do much the same with small models, mostly from the context poisoning that happens every time you add a token it raises the chance of false positives.
I know you're right that there's a saturation point for context size, but it's not just context size that the larger models have, it's better grounding within that as a result of stronger, more discriminative attention patterns.
I'm not saying you're not going to drive confusion by overloading context, but the number of tokens required to trigger that failure mode in opus is going to be a lot higher than the number for gpt-oss-20b.
I'm pretty sure a model that can run on a cellphone is going to cap out it's context window long before opus or mythos would hit the point of diminishing returns on context overload. I think using a lower quality model with far fewer / noisier weights and less precise attention is going to drive false positives way before adding context to a SOTA model will.
You can even see here, AISLE had to print a retraction because someone checked their work and found that just pointing gpt-oss-20b at the patched version generated FP consistently: https://x.com/ChaseBrowe32432/status/2041953028027379806
I first mirrored these in the early 2000s because I was worried it would eventually vanish. my mirror has been gone for decades, and the original survives. :)
> "I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code"
It's the same old trick, "in two years we'll have fully self driving cars", "in two years we'll have humans on Mars", "in two years AI will do everything", "in two year bitcoin will replace visa and mastercard", "in two year everyone will use AR at least 5 hours a day", ...
Now his new prediction is supposed to materialize "by the end of 2027", what happens when it doesn't? Nothing, he'll pull another one out of his ass for "2030" or some other date in the future, close enough to raise money, far enough that by the time it's invalidated nobody will ask him about it
How are people falling for these grifters over and over and over again? Are we getting our collective minds wiped out every 6 months?
Your quote supports hype but does not support your claim that Anthropic is telling customers they need more money to deliver the hype.
Of course Anthropic is saying that to investors. Every company does that, from SpaceX to Crumbl. “If you give us $X we will achieve Y” isn’t some terrible behavior, it’s how raising funds works.
Elizabeth Holmes is serving time for promising investors something her company couldn't deliver, so there is a line beyond which hype becomes fraud. Probably AGI, ASI, and fully automated societies aren't something well enough defined for courts to rule on, unlike making unfounded medical diagnoses from a pinprick of blood.
I work at a non-tech Fortune 500 and this is looking nearly spot-on from here. Nobody on my team touches the code directly anymore as of about 2 months ago. They're rolling it out to the entire software department by June. I can't speak to the economy at large, but this doesn't look like baseless hype to me. My understanding is that Claude Code reached this level late last year, ie. Amodei was just wrong about uptake rates.
The real problem here is that this is now the only way the maintainer/reporter can reasonably work.
Proving out a security vulnerability from beginning to end is often very difficult for someone who isn't a domain expert or hasn't seen the code. Many times I've been reasonably confident that an issue was exploitable but unable to prove it, and a 10s interaction with the maintainer was enough to uncover something serious.
Exhausting these report channels is making this unfeasible. But the number of issues that will go undetected, that would have been detected with minimal collaboration between the reporter and the maintainer, is going to be high.
I think the point is: why is OpenAI wasting its time on this? If it's just another channel for billing tokens then OK I guess, but it's not like it's a huge breakthrough.
OpenAI should be the roads, not the trucks. Let other product teams sort out the AI browsers. OpenAI has lots of problems to solve related to models and thats where they should focus. This is a side quest.
Yes it does. He's refuting that in this part of the post:
> When they finally did reply, they seem to have developed some sort of theory that I was interested in “access to PII”, which is entirely false. I have no interest in any PII, commercially or otherwise. As my private email published by Ruby Central demonstrates, my entire proposal was based solely on company-level information, with no information about individuals included in any way. Here’s their response, over three days later.
A very specific denial. "I didn't propose this specific type of monetization". Would be better if he followed up with "Yes, I proposed monetization, but what I had in mind was this more specific, benign form of monetization:"