More

numeri · 2026-06-11T20:36:59 1781210219

To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)

It's not a great sign for alignment.

bensyverson · 2026-06-11T21:02:47 1781211767

Agreed, alignment is just a separate issue that a vuln fixing benchmark doesn't need to be testing.

numeri · 2026-06-05T05:28:09 1780637289

I would just warn that you may not be able to recognize what is worth learning at your stage.

Intuition for library design and the architecture of software packages/external APIs is something you can only learn by doing.

numeri · 2026-04-15T20:51:40 1776286300

I have DSPD as well, and was pleasantly surprised to see how much of the article discussed DSPD.

That being said, I do think a lot of what the author is saying flies right in the face of traditional advice, esp. the suggestion that we should all just free-sleep and rotate around the clock. I personally find myself happiest when I'm entrained to the 24-hour cycle, but at my own natural offset. Whenever I've been cycling the day it's felt miserable, uncontrollable and exhausting.

To be fair, the author did claim that you can fully solve this by completely cutting out after-dark electronics, but I've tried pretty intensely to do exactly that for extended periods in the past, and didn't see any progress. I do sleep amazingly when camping, though, and the delay is lesser than normal (still definitely there).

numeri · 2026-04-06T23:36:53 1775518613

11/20 for qwen/qwen3.5-flash-02-23 in Claude Code, with effort set to low.

numeri · 2026-03-09T13:53:22 1773064402

No, that's what the headline implies, and the body of the article doesn't support at all. It's (currently, and with no indication of intent to change this) two separate branches of their business.

numeri · 2026-02-25T15:24:01 1772033041

but Taalas had to quantize Llama 3.1 8B to death to get it to fit. It can't produce coherent non-English text at all.

numeri · 2026-02-16T17:08:20 1771261700

and if I was to guess, the latest generation of models (Claude Opus 4.6, GPT-5.3-codex, etc.) differ from Opus 4.5, GPT 5.2 primarily in the addition of deeper, more difficult (most likely agentic and coding-based, like Terminal Bench) tasks to their RLVR training.

I could be completely off, as my intuition here is fully based on public research papers, but it seems to explain the current state of things fairly well.

numeri · 2026-02-01T23:10:55 1769987455

No, Python or units[1] is always a better choice if I'm near a computer (and I nearly always am these days, unfortunately, I suppose). I do have three wonderful slide rules, though.

[1]: https://www.gnu.org/software/units/

numeri · 2026-02-01T23:00:13 1769986813

Introducing a solid zero-knowledge age verification option is the opposite direction of ending anonymity in the Internet, which other parts of the same governments are also working on.

So yeah, I'll gladly trust and cheer on the part working in the right direction.

numeri · 2025-10-11T00:15:40 1760141740

I'll just throw in support for gaming on Linux – it's pretty nice feeling these days! I still have the occasional (once every 5–8 months?) update cause a short-lived bug, but it's a very justifiable trade-off to avoid Windows these days.