Ask YC: How to write a database

stonemetal · on Oct 26, 2010

Most algorithm and data structure texts deal with in-memory algorithms

Nope, they deal with Algorithms in general without regard to storage. Usually when an algorithm has been developed with some sort of memory in mind it is titled as such. As an example cache oblivious algorithms are algorithms which are cache aware but oblivious to cache size (aka supposed to be good no matter the cache size).

Sqlite is the only database I can think of that is small enough to be read. If all you are interested in is algorithms applied to disk storage I don't think I would recommend writing your own database since that is going to involve writing a large chunk of code for SQL parsing, query planing, etc. Why not just write some data processing tools for data mining.

tumba · on Oct 26, 2010

I agree and should have been more precise. Most algorithm and data structure texts fail to deal with the issues surrounding secondary-storage application, including optimizing disk accesses, buffering, etc.

_3u10 · on Oct 26, 2010

Optimization is a black art. It generally only applies to very specific cases. This is why most RDBM systems expose a lot of knobs to tune. The DB must be tuned both to hardware and work load, what is a good idea for one workload may be a disaster for another.

Here is an example, you can usually increase DB performance by about 15% by not allocating the first 8 MB of your hard drive. (Hint: the partition table causes misaligned reads/writes on raid systems if your partition starts at sector 1)

If you want to see the tricks of the trade, go read the TPC results, find the auditor's report and look through the source, the hardware configuration, and the software configuration.

_3u10 · on Oct 26, 2010

Theoretically, B-trees/ISAM. That's pretty much all you need to know. Most algorithms aren't optimized for their storage layer(s), they exist primarily in big-O notation that ignores latencies, and especially non-uniform latencies.

On most systems RAM is the 5th level of storage (registers,l1 cache,l2 cache,l3 cache, ram). Just because L1-3 cache isn't addressable doesn't mean it won't have a serious effect on the performance of your code.

bockris · on Oct 26, 2010

Check out the source of SQLite.

gdl · on Oct 26, 2010

I'll second this. I haven't looked through it myself, but have heard it praised many times as an example of exceptionally clean and efficient code.

1331 · on Oct 27, 2010

If you would like to learn database theory, then I recommend the following book:

_An Introduction to Database Systems_ (8th Ed.) By C. J. Date http://www.amazon.com/Introduction-Database-Systems-8th/dp/0...