More

rob_c · 2026-03-10T16:21:16 1773159676

very awesome writeup, glad to see someone with access to hw actually playing with this.

Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.

The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.

The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.

rob_c · 2026-03-03T12:08:39 1772539719

> It’s sad to see Ars Technica at this level.

This was from a journalist _who_is_hired_as_expert_ at knowing of/about tooling that hallucinates (LLM ((AI)) chatbots). Decides to implicitly trust said technology to write a "hit piece" (lets be honest it was).

In several territories that would fall under slander and if is untrue is a major journalistic mis-step and career ending faux-pas.

Why in any situation would their position now be defendable?

This is akin to being a journalist of iron-mongering writing a "truth" piece on how "jet fuel can't melt steel beams" (if you don't get my reference here, lucky you). It's outright un-professional.

Blaming it on illness allows everyone to save face, but they were compos mentis enough to hit publish at the time. That itself carries a certain "I'm well enough to agree this is a good article" from said author.

vor_ · 2026-03-11T21:20:17 1773264017

I wouldn't characterize the story as a hit piece; the misquotes didn't distort his position. What happened is the LLM accurately summarized his position, but they weren't actually his words in the quotes.

rob_c · 2026-02-19T01:18:11 1771463891

The article author and the uploader should _BOTH_ be sentient enough to engage brain and not just ignore it because they feel "it's an abstract concept I'd not get in trouble for when not working in the US or EU".

rob_c · 2026-02-19T01:16:20 1771463780

I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright, or not blindly copying anything that is out there.

This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.

This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.

hulitu · 2026-02-22T20:07:45 1771790865

> I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright

Yes, like USA. Copyright, and laws in general, are for you but not for me.

rob_c · 2026-02-15T15:18:00 1771168680

Just for reference, teams is not an astounding success it's forced on the workforce by management who want to pay less. It's a classic management square peg into workforce round hole.

Yes I understand sometimes something is better than nothing but teams is _so_ bad it causes user communities to fracture when they would previously congregate on the same platform.

Sure if deployed correctly and not by ape sysadmins with a thump of "deny everything in terms of security" I'm sure teams is a reasonable product, but in the real world, no, it's a nightmare.

rob_c · 2026-02-12T02:09:13 1770862153

So you just discovered pca in some other form?

rob_c · 2026-02-05T17:17:21 1770311841

And finally we reach the point where you're not shot for explaining if you invest in ownership after everything is over you have something left that has intrinsic value regardless of what you were doing with it.

Otherwise, well just like that gym membership, you get out what you put into it...

rob_c · 2026-01-17T21:56:56 1768687016

I think this is akin to x% of the worker ants doing all the work. Once you get to a big enough scale and have to delegate I'm sure every company hits this.

I just wish we didn't have to rely on hiring 100 on paper workers for 5 excellent people committed to the company...

rob_c · 2026-01-11T10:14:27 1768126467

Which when it leads to abuse it's saving face and when it's incompetence it's saving face.

For a competent doctor it's used too let a patient know they're doing their job and an acknowledgement of symptoms.

Unfortunately to a _lot_ of the field "catch-all" "diagnoses" (in intentionally separating these labels). It's the same as diagnosing someone with chronic fatigue. It's diagnosing via exclusion.

The difference between chronic fatigue and brain disorders being that you're more likely to get someone looking to make a "name for themselves" diagnosing or curing the latter vs the former...

rob_c · 2026-01-06T02:47:26 1767667646

This is basically just a rehash of "trained" DNN are a function which is strongly dependent on the initialization parameters. (Easily provable)

It would be awesome to have a way of finding them in advance but this is also just a case of avoid pure DNNs due to their strong reliance on initialization parameters.

Looking at transformers by comparison you see a much much weaker dependence of the model on the input initial parameters. Does this mean the model is better or worse at learning or just more stable?

snaking0776 · 2026-01-06T04:21:58 1767673318

This is an interesting insight I hadn’t thought much about before. Reminds me a bit of some of the mechanistic interpretability work that looked at branch specialization in CNNs and found that architectures which had built in branches tended to have those branches specialize in a way that was consistent across multiple training runs [1]. Maybe the multi-headed and branching nature of transformers adds and inductive bias that is useful for stable training over larger scales.

[1] https://distill.pub/2020/circuits/branch-specialization/