Maybe I’m naive about this, but I didn’t expect AI scrapers to be that big of a load? I mean, it’s not that they need to scrape the same at 1000+ QPS, and even then I wouldn’t expect them to download all media and images either?
What am I missing that explains the gap between this and “constant DDoS” of the site?
You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue.
Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve.
Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic.
Thanks, appreciate the details. 99% is far above the amount I expected, and if it specifically hits hard to cache data then I can see how that brings a system to its knees.
You kind of can though. You serve cached assets and then use JavaScript to modify it for the individual user. The specific user actions can't be cached, but the rest of it can.
Totally. Remember slashdot in the 1990s used to house a dynamic page on a handful of servers with horsepower dwarfed by a Nintendo Switch that had a user base capable of bringing major properties down.
The "can't" comes from the fact that VLC is not going to rewrite their forum software or software forge.
Software written in PHP is in most cases frankly still abysmally slow and inefficient. Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better. Pathetic throughput at best and it has not gotten better in decades now.
I don't know how GitLab became so disgustingly slow. But yeah, I'm not surprised bots can easily bring it to its knees.
> Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better.
At least phpBB died 15 years ago with most communities migrating to Xenforo. I'm not quite sure how or why WP is still around with so many SSGs and SaaS site builders floating around these days.
The funniest part about WordPress is that you can usually achieve at least a 50% speed boost or more by adding a plugin that just minifies and caches the ridiculous number of dynamic CSS and JS files that most themes and plugins add to every page. Set those up with HTTP 103 Early Hints preload headers (so the browser can start sending subresource requests in the background before the HTML is even sent out, exactly the kind of thing HTTP/2 and /3 were designed to make possible) and then throw Cloudflare or another decent CDN on top, and you're suddenly getting TTFBs much closer to a more "modern" stack.
The bizarre thing is that pretty much no CMS, even the "new" ones, seems to automate all of that by default. None of those steps are that difficult to implement, and provide a serious speed boost to everything from WordPress to MediaWiki in my experience, and yet the only service that seems to get close to offering it is Cloudflare.
Even then, Cloudflare's tooling only works its best if you're already emitting minified and compressed files and custom written preload headers on the origin side, since the hit on decompressing all the origin traffic to make those adjustments and analyses is way worse for performance than just forwarding your compressed responses directly, hence why they removed Auto Minify[1] and encourage sending pre-compressed Brotli level 11 responses from the origin[2] so people on recent browsers get pass-through compression without extra cycles being spent on Cloudflare's servers.
The solution seems pretty clear: aim to get as much stuff served statically, preferably pre-compressed, as you can. But it's still weird that actually implementing that is still a manual process on most CMSes, when it shouldn't be that hard to make it a standard feature.
And as for Git web interfaces, the correct solution is to require logins to view complete history. Nobody likes saying it, nobody likes hearing it. But Git is not efficient enough on its own to handle the constant bombardment of random history paginations and diffs that AI crawlers seem to love. It wasn't an issue before, because old crawlers for things like search engines were smart enough to ignore those types of pages, or at least to accept when the sysadmin says it should ignore those types of pages. AI crawlers have no limits, ignore signals from site operators, make no attempts to skip redundant content, and in general are very dumb about how they send requests (this is a large part of why Anubis works so well; it's not a particularly complex or hard to bypass proof of work system[3], but AI bots genuinely don't care about anything but consuming as many HTTP 200s as a server can return, and give up at the slightest hint of pushback (but do at least try randomizing IPs and User-Agents, since those are effectively zero-cost to attempt).
[3]: https://lock.cmpxchg8b.com/anubis.html but see also https://news.ycombinator.com/item?id=45787775 and then https://news.ycombinator.com/item?id=43668433 and https://news.ycombinator.com/item?id=43864108 for how it's working in the real world. Clearly Anubis actually does work, given testimonials from admins and wide deployment numbers, but that can only mean that AI scrapers aren't actually implementing effective bypass measures. Which does seem pretty in line with what I've heard about AI scrapers, summarized well in https://news.ycombinator.com/item?id=43397361, in that they are basically making no attempt to actually optimize how they're crawling. The general consensus seems to be that if they were going to crawl optimally, they'd just pull down a copy of Common Crawl like every other major data analysis project has done for the last two decades, but all the AI companies are so desperate to get just slightly more training data than their competitors that they're repeatedly crawling near-identical Git diffs just on the off-chance they reveal some slightly different permutation of text to use. This is also why open source models have been able to almost keep pace with the state of the art models coming out of the big firms: they're just designing way more efficient training processes, while the big guys are desperately throwing hardware and crawlers at the problem in the desperate hope that they can will it into an Amazon model instead of a Ben and Jerry’s model[4].
> And as for Git web interfaces, the correct solution is to require logins to view complete history.
Why logins, exactly? Who would have such logins; developers only, or anyone who signs up? I'm not sure if this is an effective long-term mitigation, or simply a “wall of minimal height” like you point out that Anubis is.
- AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites)
- AI will crawl the site looking for the correct answer which may hit a handful of pages
- AI sends requests in quick succession (big bursts instead of small trickle over longer time)
- Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents)
- At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic
That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology.
Also, relevant for forges: AI doesn't understand what it's clicking on. Git forges tend to e.g. have a lot of links like “download a tarball at this revision” which are super-expensive as far as resources go, and AI crawlers will click on those because they click on every link that looks shiny. (And there are a lot of revisions in a project like VLC!) Much, much more often than humans do.
This is also irrelevant to the original comment which is complaining about bot checks for looking at the root of the repositiory - which is probably the highest requested resource and should be 100% served from cache with a cost much less than running the bot checks.
It's simply bad, inefficient software and we shouldn't keep making excuses for it.
Agree. Did some basic searching and looks like Gitlab is particularly bad. It ships with built in rate-limiting but the backend marks all pages as uncacheable on top of them being somewhat dynamically generated (I guess it caches "page fragments").
The only issues I found amounted to "here's how to use Anubis to block everything"
They are a scourge, they never rate-limit themselves, there are a hundred of them, and a significant number don’t respect robots.txt. Many of them also end up our meta:no-index,no-follow search pages leading to cost overruns on our Algolia usage. We spend way too much time adjusting WAF and other bot-controls than we should have.
Thanks. I imagine there is a (a) a lot of interest in scraping source code, and (b) many requests to forges hitting expensive paths. 99% of volume though, wow, much more than expected.
You've gotten several comprehensive responses so far and I want to add a niche corner that people might assume might not have the bot problem but still does.
I run a website that hosts tools for my family: games and a TV interface for the kids, remote access to our family cloud and cameras, etc. Sensitive things require log in and have additional parameters required for access of course.
I specifically blocked bots from search engines so my site is never indexed, as I'm not selling anything nor want any attention, as well as some other public non-malicious bots in case they communicate with Google, just to be safe there, and my robots.txt doesn't allow anything.
I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known (and isn't simple enough to be quickly guessed), and most traffic must be malicious, or indexing (AI overview and other scrapers won't be finding it via web search).
Since it isn't indexing, and keeping everything in simple black and white boxes, my remaining traffic is family or malicious bots, and 99.9% isn't family.
I currently have the most strict bot-blocking setup I could come up with, which nicely cut down on quite a bit of traffic, but I do still receive ~2k attempts per day, which as you can imagine, still is around 99% not traffic, as I have fewer than 20 kids, and my kids aren't using the site nonstop.
Conveniently, my setup has never accidentally blocked a family member, so I'm pleased with the setup.
> I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known
If your site uses https, they could also get your domain from the certificate transparency logs for the certificate you use.
I didn't think of that, but that makes complete sense, as it is https. I think my info was sold by my registrar as well because solicitors call or email me on occassion because they "accidentally came across my site" and want to provide the design/js/etc help.
Stellar | Amsterdam, the Netherlands | Onsite (2 days remote OK) | €70-100k + equity | https://stellarcs.ai
Hey, I'm one of Stellar's founders. We're building AI for large contact centers, our primary product is voice.
Contact centers are the last place where companies actually talk to their customers. We listen to calls every week and it's fascinating. AI here helps real people: shorter wait times, 24/7 support, lower cost. We’ve helped companies go from 60 to 98% pickup rate in weeks. We’ve had (human) agents asking us to accelerate our roll out plan because the days with Stellar are so much better than the days without.
We’re all builders. Everybody in our company still writes code. We have one role open:
We are looking for implementation leads that love working with clients. FDEs welcome as well.
This role is perfect if you:
* Want massive ownership at a small team
* Actually enjoy solving hard problems (real-time audio at enterprise scale)
* Think making AI sound human in niche dialects is a fun challenge
Stellar | Amsterdam, the Netherlands | Onsite (2 days remote OK) | €70-100k + equity | https://stellarcs.ai
Hey, I'm one of Stellar's founders. We're building AI for large contact centers, our primary product is voice.
Everyone thinks contact centers are boring. They're wrong. It's the last place where companies actually talk to their customers. We listen to calls every week and it's fascinating. AI here helps real people: shorter wait times, 24/7 support, lower cost. We’ve helped companies go from 60 to 98% pickup rate in weeks.
We're bootstrapped, 10 FTE, cash-flow positive, and growing fast with household names in the Netherlands and Belgium. Doubling our volume every month for the last 4 months.
We’re all builders. Everybody in our company still writes code. We have 2 roles open:
1. We need product minded engineers who can jump between our Go and TS backends and React frontend.
2. We need implementation leads that love working with clients. FDEs welcome as well.
This role is perfect if you:
* Want massive ownership at a small team
* Actually enjoy solving hard problems (real-time audio at enterprise scale)
* Think making AI sound human in niche dialects is a fun challenge
Indeed. Similar accident (USAir 1493/Skywest 5569) shows that thinking exactly.[1] Was easy to pin on the controller, they went far beyond that in their analysis. Almost always impressed with the professionalism of those organizations. I sometimes wonder how software would look if we had such investigations for major incidents.
For some work, similar to the philosophy example of GP, LLMs can help with depth/quality. Is additive to your own thinking. -> quality approach
For other things. I take a quantity approach. Having 8 subagents research, implement, review, improve, review (etc) a feature in a non critical part of our code, or investigate a bug together with some traces. It’s displacing my own thinking, but that’s ok, it makes up for it with the speed and amount of work it can do. —> quantity approach.
It’s become mostly a matter of picking the right approach depending on the problem I’m trying to solve.
Datapoint: I’m running nightly now because the latest 1.2 release has a known crash-on-wake bug when disconnecting an external monitor. Accordingly the issue tracker it was fixed months ago, but it’s not in a release. The nightly is stable though.
Didn’t realize how may projects use libghostty, will try cmux one of these days.
We're doing a lot with the realtime models. Happy to see a new release.
Initial feel from a few calls is that it seems to perform better with alphanumeric inputs. Voice seems consistent. Recognition on a few tests seems to be somewhat better, especially did much better on the two 8-bit 8-kHz mulaw calls I tried.
It does still struggle a bit with some specifics in other languages (e.g., that the Dutch/German pronunciation of 53 'fifty-three' is effectively 'three-and-fifty').
SQLC for me has been able to replace most cases of use an ORM for. It made most of the boilerplate of using plain SQL go away, I get type safe responses, and it forces me to be more mindful of the queries I write.
In an app where we do use an ORM (Prisma), we sometimes have weird database spikes and it’s almost always an unintended heavy ORM query.
The only two things I miss in solutions like sqlc are dynamic queries (filters, partial inserts) and the lack of a way to add something to every query by default (e.g., always filtering by tenant_id.)
Ideally write the hiring manager and not HR. And, write something that makes it hard to not want to talk to you.
1: Minimal hygiene is writing something that shows you read the vacancy (if any). Don't: "I'm interested in the role, CV attached". do: "You want onsite in Amsterdam, I'm living in Milan but already planning to move to Amsterdam for reason X".
2: Stand out from the average applicant. Someone recently applied with a personal website that was a kinda-functioning OS (with some apps). Someone else applied with a YouTube channel hacking an ESP32 into their coffee machine. Someone applied with a tool on their GitHub profile, super well written, in our target language, doing interesting things on the database we're working with, etc., etc. how could I _not_ talk these applicants? All of these are soft signals that show affinity for their work as engineers. Don't: generic application letter combined with 3+ pages resume with too much detail.
3: if invited: get curious (but not overly opinionated/combative) about their stack. Candidates we've been most excited about have come in asking questions on how we're setup, and why we've made certain choices. Don't: expect the interviewer to ask all the questions, or bring only a prepared question that misses the mark.
4: Its a people process, if that's your challenge, work on that. Maybe you share a hobby with the interviewer, maybe you've both solved similar problems in earlier jobs, maybe you both like Haskell, maybe something else to connect over. Connection matters to most hiring managers.
Context: I’m in the Netherlands. With taxes, power is around 25cent/kWh for me. For reference: Amsterdam is around a latitude of 52N, which is north enough that it only hits Alaska, not the US mainland.
I installed 2800Wp solar for about €2800 ($3000, payback in: 4-5 years), and a 5kWh battery for €1200 ($1300) all in. The battery has an expected payback time of just over 5 years, and I have some backup power if I need it.
I’m pretty sure about the battery payback, because I have a few years of per second consumption data in clickhouse and (very conservatively) simulated the battery. A few years ago any business case on storage was completely impossible, and now suddenly we’re here.
I could totally see this happen for the US as prices improve further, even if it’s not feasible today.
What am I missing that explains the gap between this and “constant DDoS” of the site?
reply