If I asserted: "It's faster to put 10,000 rows of csv data in single file instead of 10,000 individual files" even the most junior programmer would likely say "Well, duh, it's 1 file instead of 10,000".
Yet this benchmark is at the top of HN for some reason.
"Measure before optimizing" has limited application. There are many things that need not be measured. Appending to a file is definitely not slower than doing memory management inside the file to allocate a new chunk, seeking to that position and then writing out the chunk. Period. (And the serialization overhead is negligible compared to the disk I/O).
It's not only "writing" though. It's "writing" AND "reading".
Can you read the correct line in in CSV faster? Can you find the row with specific critiria like SQLite faster? Can you handle concurrency correctly like SQLite? Can you make sure your CSV is always in consistent state like SQLite?
The point of the article is that SQLite is still fast even with all the benefit of database.
The idea of SQLite is that you could default to using it, and move away when it hits its limit. Nobody should default to using CSV, you use CSV when you have to, despite all its limitation.
I'm not attempting to compare the usefulness of CSV to a RDBMS. I'm simply pointing out it should be inherently obvious that writing to 1 file is faster than writing to N files, which is the only thing the benchmark is in question is concerned with as well.
Opening a file means locating it physically on the disk by following references (usually at least one per directory level). This is a costly operation.
Compare to the typical case of adding another row by appending to an already allocated disk block. No seek operation needed.
This analysis is incorrect. SQLite must allocate database pages for rows just like the kernel must allocate filesystem blocks for files. Reading a row in SQLite means traversing a B-tree to find the row. Reading a file from the filesystem means traversing a B-tree to find the file metadata. Very similar. The kernel doesn't need to "seek" any more than SQLite does, the article says that all the hard drives are SSDs which don't have heads so they don't seek.
The major differences here are due to the different levels of isolation and authentication that the kernel and SQLite provide. Another difference is the fact that when you find a row in SQLite by primary key, you don't have to do a second lookup to get the data from the metadata. This is a good optimization to use when storing many small pieces of data, but a bad optimization for large pieces of data.
This was just about opening a file and appending a few bytes vs appending to an open file. Obviously the latter can't be slower (and is typically faster).
Because SQLite isn't meant as a general purpose filesystem. It can do some things better (like, apparently, small blob storage), but not everything.
For example SQLlite vacuuming to free up deleted data can be slow, on large SQLite databases, we've found it much faster to rewrite the entire file than to vacuum it.
It also has some scalability limits, it uses locking to limit to a single concurrent writer (short duration locks), which only scales up to a point.
There're a lot of file system characteristics that differ from a database file format, SQLite could probably be more file-system like,but then it would diverge from being a fast and lightweight database format.
The parent comment is a bit misleading, or completely wrong. SQLite also has to allocate blocks in the database for anything you store. Some of the structures and techniques that SQLite uses for doing this are very similar to the way a filesystem does it.
Instead, think about it this way. With a filesystem, the database is managed by the kernel. Every time you want to read a file, you might do four system calls: open, fstat, read, close. Or you might do mmap instead of read, but you're still doing four context switches, at least in typical cases. Switching to the kernel and back has cost. Normally this cost is small, but if you make a lot of system calls you'll notice the costs piling up. The kernel also has to check permissions to make sure that you have permission to read the file.
With SQLite, the database is inside your application. When you read a row, there's a chance that the row is already in your application's memory. This means no context switches back and forth between application and kernel.
Additionally, when you read a row, the entire database page is read into memory, which includes other rows too. The kernel won't do anything like that with your application--it won't give you img1.png and img2.png if you just ask for img1.png. Maybe they'll both be in the kernel's page cache, but you still have to open and read the file.
Microsoft tried doing this and there are file systems that have tried or are trying. It just opens up multiple cans of worms but it's a pretty interesting idea.
Now I'm confused myself: Parent's context was: Why appending to a CSV is faster than opening a file and appending to that. There was no word about a database.
Yet this benchmark is at the top of HN for some reason.