More

onlyrealcuzzo · 2026-06-12T13:09:25 1781269765

I'm shocked they don't come with a way to run them in a sandbox.

Shouldn't this be relatively easy for a $1T company to set up?

Isn't this trivial compared to the entire harness?

fr3dx · 2026-06-12T16:16:34 1781280994

There is a builtin sandbox and various third-party options https://code.claude.com/docs/en/sandbox-environments

eqmvii · 2026-06-12T14:17:19 1781273839

That's more or less what Claude Cowork is.

Every serious engineer I've seen try to use it ran away screaming, because of limitations in the sandbox.

I've also seen people set their coding agents up entirely within containers -- that may be the better way going forward, but it's an extra stop and a lot of extra plumbing to maintain.

not_kurt_godel · 2026-06-12T13:18:29 1781270309

Doing so would be an effective admission that LLM guardrails are inherently probabilistic, unpredictable, and insecure. Plus the only truly robust sandbox approach would be clunky setup of a local VM.

simonw · 2026-06-12T14:04:07 1781273047

That clunky VM setup is a what Claude Cowork does, which is Claude Code with extra safety features for non-programmers.

There was a big thread about that here the other day: https://news.ycombinator.com/item?id=48479452

onlyrealcuzzo · 2026-06-12T13:06:58 1781269618

In my experience, there's little difference between implementing individual functions between frontier models and SotA ~30B param models.

Once you have a coherent design (the hard part), you can feed it to a pretty small model and get basically the same quality.

They'll not one-shot, but they're faster and cheaper, so it still works out in your favor.

Plus you can do it locally...

jdw64 · 2026-06-12T13:14:07 1781270047

I have a similar experience. However, when including code review, I think the GPT model is the most impressive

onlyrealcuzzo · 2026-06-12T01:37:03 1781228223

It'll just rewrite tailwind badly...

onlyrealcuzzo · 2026-06-11T20:44:34 1781210674

> I have the feeling that the introduction of automatic QA may raise the bar of quality for new releases of software, and maybe partially compensate for the lower quality of the code produced at high speed with the use of automatic programming.

I've been building a compiler with LLMs for a memory safe language like Rust with near zero cost abstractions (no GC), but with WAY less cognitive overhead.

I can tell you right now:

1) It's 100x more than I could have achieved with zero compiler design experience.

2) I'm HIGHLY skeptical that LLMs can build something of this complexity (in some ways it's more difficult than implementing a Rust compiler) - so the testing is quite robust - 3 different systems (unit, integration, fuzz tests) each with mutant testing, each with between ~65-90% line coverage and ~50-80% branch coverage, combined with ~99% line coverage and ~86% branch coverage.

There is ZERO chance I could get something even close to this level of "working" by myself ever - let alone with minimal effort.

The test is kind of simple - if LLM's can do this... They should be able to do just about anything... Compilers are notoriously difficult to verify they actually work, rather than just kind of work sometimes...

People can say I'm wasting my time all they want.

But, one, it's been enlightening. I'm literally in awe of what they can do and have done.

Two, I've developed a bunch of tooling / metrics necessary to get them to be able to do something at this level of complexity without falling over themselves. And I think it can work at scale pretty easily.

Nearly all of the research comes from the 80s or farther back for the complexity metrics.

wavemode · 2026-06-12T20:07:33 1781294853

You're not wasting your time; LLMs have written plenty of compilers. Compilers are easy for LLMs to work on, because their level of verifiability is very high. That is, an LLM can easily determine whether what a compiler is doing is correct or incorrect.

Automated verifiability goes down once a software project incorporates things like:

- Concurrency

- Networking / distributed systems

- Visuals / animations

- Domain knowledge (e.g. banking, finance)

achierius · 2026-06-11T21:20:22 1781212822

Hate to be a pedant, but that's really not what "zero cost abstractions" means. The idea behind those is that you get a cleaner interface to some gross machine functionality/OS API/etc. layer, but don't pay a performance cost vs. using the gross lower-level layer. E.g. Rust's Option, unlike C++'s std::optional.

What you're thinking of is "no runtime" or "lightweight runtime", which does often mean "no garbage collector".

onlyrealcuzzo · 2026-06-11T21:27:35 1781213255

Rust's zero cost abstractions mainly stem from its affine ownership model managing memory lifetimes safely and correctly with zero cost - as that is the killer feature... That's what I do.

When people think of "zero cost" they don't think about std::optional. They think about not having to manage memory lifetimes AND NOT having to pay for a Garbage Collector to do it for you. That was always the trade you made until Rust.

I add on some cost to locks to prevent deadlock, and some cost to loops to insert co-operative yields in concurrent contexts unless you turn it off.

8note · 2026-06-11T22:56:19 1781218579

> affine ownership

huh? you can rotate and scale the ownership?

AlotOfReading · 2026-06-12T07:25:30 1781249130

Affine as in substructural linear types. They correspond to linear logic [0], and affine logic is named such because the way it's defined corresponds to affine functions. You don't literally need to scale your pointers though.

[0] https://en.wikipedia.org/wiki/Linear_logic

onlyrealcuzzo · 2026-06-11T18:34:08 1781202848

> while simultaneously making themselves look good to investors by showing they're embracing the hip new technologies to become a more streamlined and cost-efficient operation than ever.

This is not anything new... It just has a new name...

onlyrealcuzzo · 2026-06-11T17:48:14 1781200094

Contrary to popular belief, solar panels don't generate zero power on cloudy days.

They typically generate 10-25% of their maximum output on the cloudiest of days. Most cloudy days are not maximally cloudy.

We don't need solar panels everywhere to get even close to ~100% renewables (with nuclear, wind, new geothermal, and hydro). The areas where you put them are distributed enough that it would be exceptionally rare to ever encounter a meaningful need to ration.

So, storage is an issue, but not as big of an issue as most people think, and we do not generate anywhere near enough solar energy for it to be a reasonable concern yet...

There's also more solutions than just conventional batteries. There's pumped hydro, etc...

Marsymars · 2026-06-11T18:02:22 1781200942

> They typically generate 10-25% of their maximum output on the cloudiest of days. Most cloudy days are not maximally cloudy.

If you're at higher latitudes, this is notably less of a drop-off than you see between high/low season.

My friends with residential solar see <10% overall output in January vs July. (~60% drop from fewer sunshine hours, ~80% drop from decreased solar irradiance.)

jwr · 2026-06-11T19:10:39 1781205039

This gets complex quickly, because temperature matters too: cells are more efficient when they are cold. These effects interact and the results are sometimes surprising.

Many pure-numbers theoretical comparisons also make the assumption that you can consume all the power that the cells generate, which is not always the case. In an off-grid installation with a battery, for example, you might not be able to consume everything, depending on the month of the year. Practical example: my installation gets some of peak usage numbers in March/April, because that's when it's still cold and I use the power for heating. The cells are cold, I need the power, and there is some sunshine, all this combines. It's not obvious.

Marsymars · 2026-06-11T23:49:41 1781221781

Yeah, I mean these aren't entirely theoretical, like observationally, people I know locally are getting <10% January vs July generation - I'm working backwards to get the relative proportion of the drops due to solar hours vs irradiance.

They all have a relatively generous (I think - I'm not especially familiar with policies anywhere else) grid policy where they sell back any over-production in the summer. (They switch between summer/winter rates, so in the summer they buy/sell at ~35c/kWh and in the winter they buy/sell at ~8c/kWh. These rates are only effective as long as you don't have a net-surplus of generation in the year, so it doesn't make sense economically to oversize the system for more winter generation, as then you'll be generating more in the summer than you can use or sell back.)

jaggederest · 2026-06-11T19:56:53 1781207813

Curtailment and dump loads are pretty straightforward, though, so using all the power isn't as critical as people might imagine either.

It's better to overbuild the dc-to-ac ratio moderately and just accept that on a summer noon you'll be dumping or curtailing, and still get useful percentages in the winter. I'm in the fortunate position of having an essentially infinite dump load (water pumping and heating) that would effectively turn most of my solar into real usage, but even most people can preheat a hot water tank and things like that. With electric cars it's even better.

pfdietz · 2026-06-11T21:35:33 1781213733

One of Standard Thermal's use cases is excess DC power from existing solar farms that would otherwise be curtailed because of inverter/interconnect limits.

int_19h · 2026-06-12T02:48:38 1781232518

There's also the angle of the sun to consider, it changes quite a bit in higher latitudes between summer and winter so if you want maximum efficiency you need to tilt the cells accordingly. But I don't think most residential solar does that.

Marsymars · 2026-06-12T18:05:32 1781287532

The way the math works for grid-connected residential here, if you're not adjusting the angle, between seasons, you'd be best off leaving it at the optimal winter angle all year, which would minimize the difference between peak/trough generation.

flumes_whims_ · 2026-06-11T17:58:44 1781200724

But they do generate zero power at night.

oblio · 2026-06-11T18:08:08 1781201288

And people use less energy at night. Yes, they do need heating/cooling and a few other things at night, but the peak is during the day and in the evening.

This argument is almost closed at this point, with PV + batteries being quite price competitive. We're no longer in 2018.

fragmede · 2026-06-11T18:30:58 1781202658

Solution? Send large mirrors into space so it never stops shining.

https://www.reflectorbital.com/

magicalhippo · 2026-06-11T19:06:28 1781204788

That surely won't interfere with the ecosystem at all! /s

onlyrealcuzzo · 2026-06-11T15:58:53 1781193533

I've worked in many places on many teams and never met anyone that essentially does nothing besides write code...

I question the obsession engineers have with their "code writing" being replaced by a machine.

Do you really think that's the value you bring to the table?

Non-engineers don't want to sit down and think about anything, they don't want to sit down and test that thinks actually work, they don't want to think about all the failure cases that could go wrong besides a few shallow tests, and they definitely don't want to have to pick up the mess if something does go wrong...

This is what you get paid to do. Coding is a small part of that.

tossandthrow · 2026-06-11T17:53:37 1781200417

You are on point: The developers of the future need to hold much more of the domain that is being developed for. It is not a job to write JSX and tailwind classes anymore, so you need to move up in abstraction - and complexity.

Not all can do that.

onlyrealcuzzo · 2026-06-10T18:05:27 1781114727

If you can run your tests fast and cheaply, and have metrics that show what bad/sloppy code is that are cheap & fast to generate, a worse fast model can outperform a far better far slower model if you value time...

I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic), and automatically pushing back everything until the added complexity is within reason for the feature.

bee_rider · 2026-06-10T20:10:23 1781122223

How do you measure “true” complexity? Cyclomatic seems a bit… I dunno, artificial? Blunt? But it has the benefit of being defined.

onlyrealcuzzo · 2026-06-10T20:30:47 1781123447

There's a ton of research on this in the 80s... and interestingly, I haven't seen a lot of recent research.

Surprisingly, it seems most languages don't have a standard package to do a lot of these detections.

Ruby has Flay to detect similarity (something LLMs are prone to do). Basically re-write a huge function with only a couple of minor differences that should probably be params...

One of the things I rely on most is "pressure" -> which conditions are causing the most checks throughout the code-base. Those are things you should Type away.

Dynamically typed languages like Ruby create a huge surface area for type slop for LLMs, and why I would not recommend using a dynamically typed language for vibe coding.

You can have type "pressure" and nil "pressure" -> where you set a value to nil somewhere (that you probably shouldn't have) -> and that has ripple effects all throughout your codebase. Similarly, you can do this for values -> one place it's a string (where it shouldn't be), everywhere else a symbol (what it should be) -> but now you've got hundreds of casts to_sym or to_s in your codebase.

There's also state drift & reification misses -> you constantly update two states (that should probably just be one new value or a function) and sometimes you forget to update one (more of a bug possibility than complexity). Same for reification misses -> you constantly check for multiple conditions -> that should probably be one value or a function, and similarly (buggy, you may sometimes miss one).

Complexity comes down to state and control flow -> so you want to check what's causing you to make the most decisions (especially state/time based), and where it's coming from. Where do you have the most state and why...

I'm hoping to release everything in the next few weeks, but it takes a while to polish things, especially when it's a side-quest of a side project...

aleksiy123 · 2026-06-10T22:23:22 1781130202

Interesting, I do think blending the fuzziness of models with the determinism of hard checks/conformance is the way too go.

But using some kind of metrics as guardrails/steering seems interesting.

epiccoleman · 2026-06-10T20:50:27 1781124627

> Dynamically typed languages like Ruby create a huge surface area for type slop for LLMs, and why I would not recommend using a dynamically typed language for vibe coding.

I totally understand this, and have seen the problems firsthand. But Elixir / Phoenix / LiveView, along with Tidewave, have become my favorite "vibe slop stack." Just so quick and easy, and the LLM seems to get things right quite often.

fridder · 2026-06-11T17:47:50 1781200070

I wonder if a dedicated client or mode in a client would provide some benefits. Might also be interesting to do adversarial stuff too where it argues with itself or another model

Daishiman · 2026-06-10T19:22:59 1781119379

What metrics have you found useful?

onlyrealcuzzo · 2026-06-10T16:23:47 1781108627

Typically, you need a little more to make up for the difference in how much more taxes you pay at the marginal end vs the average for your total income...

The median earner with a standard deduction would need a ~4.7% raise to stay even...

"Inflation" is also increasingly distributed unevenly. The top 10% continues to make up a larger and larger portion of spending. It is entirely possible for ~4.2% inflation to be substantially higher (or lower) for the median household than the overall reported number.

madcaptenor · 2026-06-10T16:38:47 1781109527

Tax brackets are also inflation-adjusted, so shouldn't that cancel out?

khuey · 2026-06-10T16:55:39 1781110539

Most of the relevant numbers in the American tax code are inflation adjusted, but not all of them. The biggest ones for people on this website are probably the value of the Child Tax Credit and the thresholds at which the Net Investment Income Tax/Additional Medicare Tax kick in.

asdff · 2026-06-10T17:52:40 1781113960

Don't forget prop 13 in california, probably many beneficiaries of that policy on this forum.

khuey · 2026-06-11T02:07:54 1781143674

Prop 13 is the other way around, inflation is to the taxpayer's advantage.

onlyrealcuzzo · 2026-06-11T18:51:34 1781203894

Important distinction: to the "HOME OWNER"s advantage.

nothercastle · 2026-06-10T16:54:09 1781110449

No it pushes you into a higher tax bracket earlier so also acts as a tax increase

kennywinker · 2026-06-10T17:06:22 1781111182

I think the point is the tax brackets are supposed to be inflation-adjusted. So all the brackets go up 4.2% too. Idk if the implementation details make this actually work out 1:1 but that’s the idea.

madcaptenor · 2026-06-10T17:58:57 1781114337

That's what I was thinking.

onlyrealcuzzo · 2026-06-09T15:00:30 1781017230

Most devs will say less, but the reality is most of the time you build the wrong thing first, and the price you quote almost never gets what the client wants/expects and includes many overruns.

Every single contractor will say: that never happens when I do it...

This is a better solution for many businesses.

One, they already have a steaming pile of crap that kind of works. Forking that over to someone to "fix" for €10-50k is a steal - even if there's a decent chance they deliver nothing valuable.

You're talking about on the low end a few weeks of a dev's salary. You get next to nothing for that most of the time...

You could easily spend 4-5x that, take 3-4x longer and get something you can't use at all.

Happens all the time.