I don’t see how that helps, unless you actually mean open source, rather than open weights like most people do. Without everything that goes into the model, including training data, these things are opaque.
The raw training data is so large that very few parties could host it for free even if there weren't copyright barriers.
But I think you could have a full open source training software pipeline that's set up to work with Wikipedia, Common Crawl, Books3, Library Genesis, Anna's Archive, and whatever other useful data sets people can name. There would just be a step where you have to provide your own copy of Library Genesis (or whatever subset of it you have managed to obtain).
That may very well be the case. In fact, I'm nearly certain that you're right. But it doesn't change the fact that open weight models are altogether insufficient on a number of important dimensions regarding freedom and transparency. And so often (such as the comment I replied to, I think), even technical people seem to just ignore the difference. Open weights are just weights. No amount of open-washing changes that.
Honest question, I wonder why that is? Surely we have smart humans that did not read and learn "all the books". Can AI not be trained by re-reading material multiple times to reinforce?
Start up a seti at home style of open source LLM training! Assuming there is an ability to merge the sub models trained on each user's home PC into a larger model...
That's not something that is known how to do in a reliable fashion, right? It sounds quite like the problem where transformers are unable to be updated/taught over time.
Outside of interface identifiers, what is so complex about them? I think they end up being purely simpler than IPv4 addresses since they can’t be mistaken for DNS names.
Beyond the :: stuff, I can only think of IPv4-mapped IPv6 addresses, where you can represent a trailing 32 bits as dotted decimal (e.g. 2001:db8::192.0.2.1). And the :: stuff also exists in IPv4 in the same way, just using dots instead of colons.
ULA is equivalent to RFC1918.
LL also exists in IPv4 (PIPA), but I take your point that it's not common in most environments.
Yes, privacy addressing is different.
But, the context that you were commenting in was about the representation of addresses, not the semantics themselves ("what is a valid IPv6 string"). And there doesn't seem to be any greater complexity other than the IPv4-mapped IPv6 addresses thing. Which doesn't seem all that complex, especially if you see it as a tradeoff to escape the DNS name ambiguity of IPv4.
> Zero days for chrome will cost more than zero days for Firefox because Chrome takes security more seriously
They may cost more for Chrome, but it needn’t be because Chrome takes security more seriously; Chrome’s greater market share alone would be enough to account for this.
Not that I’m denying the overall conclusion. Just this bit of reasoning.
That wouldn't help in that case as exfiltrated data is committed to public GitHub repositories. Unless you have to accept every time an app posts or requests data from known hosts?
Personally I don't allow outbound connections from almost any app, except web browsers to port 80/443. So nodejs, pip, ruby, curl, wget, etc, opening unexpected outbound connections is a big red flag for me.
In some cases, maybe you need to allow permanently git to open outbound resquests to github.com (or gitlab, etc), but at least in my case, I'm okey allowing these connections manually.
> preinstall script: bun run index.js
> Dual exfiltration:
> stolen data is committed as Git objects to public GitHub repositories (api.github.com)
> and sent as RSA+AES encrypted HTTPS POSTs to hxxps://t.m-kosche[.]com/api/public/otel/v1/traces (disguised as OpenTelemetry traces)
> The Bun installer command (command -v bun >/dev/null 2>&1 || (curl -fsSL https://bun.sh/install | bash && export PATH=$HOME/.bun/bin:$PATH)) prepends every injected hook to guarantee Bun availability
> A separate gh-token-monitor daemon (decrypted from J7, deployed by class so) installs to ~/.local/bin/gh-token-monitor.sh with its own systemd service and LaunchAgent. It polls stolen GitHub tokens at 60-second intervals with a 24-hour TTL
This attack in particular would have caused OpenSnitch to go crazy, giving you the opportunity to review what's going on.
1) write a well crafted exfil payload to mozilla or chrome directory (there are sqlite databases and files that store eg. indexeddb content)
2) trigger a tab open to attacker's website, website takes the exfil data from indexeddb and posts it to the server (have something inocuous looking on that website - like a fake npm homepage or whatever, so you don't close it fast enough)
from one step process, this will become universally usable two step process
be sure not to use extra cli parameters like "firefox --new-tab <url>", because if the rule is filtering by process path + cmdline it'll trigger a pop-up to allow the outbound request.
> Personally I don't allow outbound connections from almost any app, except web browsers to port 80/443. So nodejs, pip, ruby, curl, wget, etc, opening unexpected outbound connections is a big red flag for me.
Yep, exactly. Reject by default, with reasonably judicious always-allow rules.
> That wouldn't help in that case as exfiltrated data is committed to public GitHub repositories
Correct in general that it doesn't protect against stuff like that. But this whitelisting is done per-command (in this case, the whitelisting is scoped to the node executable). I've had no need to allow node access to Git in the first place, so no problem there.
> Unless you have to accept every time an app posts or requests data from known hosts?
OpenSnitch doesn't have access to application-level information, so it has no concept of "post" or "request." It's got DNS names, layer 3 info, layer 4 info, and other such things that are visible to the kernel. Your rules get matched to network traffic based on these various properties.
btw, this analysis of a node linux malware with OpenSnitch and other tools was published on reddit a year ago (a malicious linkedin interview targeting web3/crypto devs that resulted in a system compromise):
Excellent example, thank you. This is the kind of stuff that skeeves me out and is entirely within the model of threats that I want to guard against. Sandboxing + OpenSnitch is good stuff. And, ofc, npm bad.
I originally bought the touchpad for my UHK. But, much to my surprise, I have gravitated towards the keyboard's built-in mouse layer over time! Now I scarcely plug in the touchpad (or even key cluster) modules at all.
As a sidenote, I love my UHK. Just a joy to use, and it's so easy to customize. I don't have any experience with competitors like the ZSA Voyager, but the UHK's configuration software and macro language do make it quite pleasant to bend to your will. For instance, I do some funky stuff with macros and lighting here: https://www.cgl.sh/blog/posts/wnl.html
I had a UHK for a few months before refunding it because the shielding wasn't good enough to stop it becoming unreliable when my mobile phone was within about 30cm of my keyboard. I contacted support and the solution was to move the phone away from the keyboard, which is kind of irritating for such an expensive piece of kit.
But, just wanted to share that I was similarly surprised to land on mouse keys as a preference. I tried most of the UHK modules which were also pretty good and have since tried various other trackballs and pads, but since trying UHK mouse keys, they're what I keep coming back to most, even since switching to new keyboards.
One issue I have with mouse keys is fear of using them in front of others though: every so often, if I need to click something particularly small and don't have a keyboard shortcut memorised (vscode panel resizing is one) it can sometimes take me a fair few embarrassing seconds drawing small squares around my target before I resort to actual mouse hardware.
For the amount of time and thought and effort people have put into alternative mice, I feel mouse keys are massively overlooked and probably have a lot of room for software/firmware innovation without hardware costs.
You bring up a good point -- I have the same issue with mouse keys. I wonder how the track point gets around this. Is the tracepoint "progressive" in that it allows various speeds depends on deflection from center?
That's a good question and make me think. I always thought the trackpoint nubs were binary too, basically just a stick in the middle of up/down/left/right mousekey buttons, bit it turns out they're not!
For original trackpoints, it's basically a stick in the middle of an up/down/left/right resistive strain gauge.
For the ploopy beans here, they use hall effect sensors instead of resistive strain to get a bit more movement.
As soon as you have non-binary up/down/left/right values, the mouse direction and speed can be interpolated to so many values that mousekey accidental squares become impossible.
Yes, but that’s not the threat model I was alluding to. The threat model was, you get tricked into executing malware, that will steal your passkey (and your entire password database in fact), and log your master password as soon as you use it.
When the passkey is protected behind an HSM (TPM, Yubikey, Tkey…), even a compromise of your main computer can’t steal it. Attackers can still temporarily log in on your behalf, but they can’t do anything with your passkey as long as your computer is turned off. Which means you can un-pwn yourself out of this situation by reinstalling everything (but do keep your HSM!).
Overall, we have several levels of security here:
- Weak password, (potentially reused everywhere). Fished once, pwned everywhere. Not to mention password database leaks.
- Very strong unique password from your password vault (KeepassXC). Note that with automatic login, password managers may provide good phishing resistance. Manual copy pasta is still vulnerable, but at least you only compromise that one account.
- Passkey stored in your password database. Phishing proof as you say, but falls to a keylogger.
- Passkey sorted in a hardware security module. Can’t be stolen ever, save for a vulnerability in the HSM itself, or, if you haven’t set up a password for your HSM, theft.
Clearly that last option is the most secure. Clearly it would be nice if everyone could do that, though we do need a way to recover from the loss or destruction of the HSM (which in the case of the TPM may mean something as mundane as changing your graphics card). Yet often, other ways are more convenient.
Still, I strongly believe companies should not force people into one method or another. Okay, I could maybe tolerate passkeys being forced on me, but not the remote attestation part. Let me manage my own security, with my own tools (preferably open source), thank you very much. There is one use case for which I may approve of remote attestation: work accounts. Because at this point it’s not about the safety of the customer, it’s about the safety of the company itself. It makes sense then that the company (or government agency) impose whatever stringent restrictions on how to access their network. They do have to provide any required tool (company laptop, company palmtop, company dongle…), same way many companies are required to provide individual safety equipment to any of their employees working in hazardous environments.
Yes, I agree that device-bound credentials (DBC?) are a really big deal here. Just wanted to get the story straight.
When it comes to the notion of requiring DBCs without also requiring remote attestation, how do you deal with solving the problem of virtualized credential devices, e.g. swtpm? If some application wants to leverage DBCs, it will make some DBC API call, e.g. call out to a TPM. However, without some sort of attestation scheme, there's no way to verify who/what is on the other end of that API call.
Maybe it's not important for applications to be able to require DBCs without attestation. But at first blush it seems like a valid thing to want.
> Maybe it's not important for applications to be able to require DBCs without attestation. But at first blush it seems like a valid thing to want.
It’s definitely something I would want, but as you hinted at yourself, if there’s no remote attestation, the user can just use a software TPM. So, a company using passkeys has two choices:
- Enforce DBC with remote attestation. This raises the security floor, but enforces device vendor lock-in, and prevent users from selecting unapproved, but potentially even more secure, devices.
- Do not enforce DBC. This lets users use less secure virtualised devices, but there’s no vendor lock-in, and those who want may use the latest most secure device ever.
Which alternative is appropriate is now a social & political problem. My opinion is that for general computers released to the general public, remote attestation is never legitimate. Even with the best of intentions it is fundamentally uncompetitive, and they make it way too easy to go full Evil Corp. Specialised appliances and employees however are different stories.
---
Anecdotally, I have worked on TPM provisioning a couple years back, and I had to warn my hierarchy that doing it the way they specified, the TPM could be impersonated: we checked the signature of the certificate, but failed to compare the certificate root with the manufacturer’s keys. My boss didn’t believe me, until I showed the production code happily provisioned a software TPM, without detecting the impersonation. (Actually, he didn’t believe me even then, I had to go over him to the security specialist.)
This was totally a case of remote attestation. But I believe this particular case was legitimate, because it was a specialised appliance (electric car charging station), that was meant to process payments, similar to a gas station terminal.
One man's bloat is another man's batteries-included, I guess?
My argument would be that if a more featureful standard library could get Rust closer to the superior dependency culture of Go, it'd be worth it. As-is, Rust dependency trees are just wild.
The rust team is already stretched pretty thin. A larger library is going to put more pressure on them. These libraries are already maintained and used. The rust project should just directly, fund, Shepard and guarantee a level of quality for the packages. The foundation has started some of this with the maintainers fund. No need to force it all into the std lib. Go has experienced breaking issues with changes in the crypto library causing churn in the ecosystem.
Point taken about the core team being stretched thin. But I don't see how the "increase stability of some core crates" is enough to change the packaging practices/culture. Maybe I'm wrong, but you really don't get those ecosystem benefits unless the ~entire ecosystem buys into that set of packages. Which really doesn't happen without stdlib.
Also, I think that your example of Go's breaking crypto changes misses the forest for the trees--the stdlib has been incredibly stable through its history, and the vast majority of packages just never have to worry about it. I'm honestly not aware of a language out there with similar adoption, featureset, and robustness. More to the point, I'm not aware of a language out there with a more reliable stdlib that permits the ecosystem to have small dependency graphs.
> Even for sites that don't offer granular feeds, every major feed reader offers filtering options, a lot of them offer fairly complex regex filtering.
This is true to some degree, but regex filtering or really any strictly logical filtering is often too coarse or too fragile to work well in practice for all but the most dedicated RSS gardeners.
What I’d really like to see is some sort of fuzzier logic that gives the user a more semantic interface to filtering and/or ranking feed items.
Take, for example, the feed for a medium-sized newspaper where you want to filter for/prioritize articles about your local area and particular topics of interest. Those news feeds are often very high volume and don’t make good use of tags or other metadata that can be consumed by strictly logical filtering systems. So a fragile, badly-behaving filtered list is what you’re likely to get. Whereas a fuzzier, more semantic interface (local LLMs?) would be far more reliable and easier to use.
reply