Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask YC: How to write a database
2 points by tumba on Oct 26, 2010 | hide | past | favorite | 7 comments
Most algorithm and data structure texts deal with in-memory algorithms and I'm having difficulty making the leap to secondary-storage equivalents. I know that one should think twice before writing one's own database, but I want to learn how they work. What books and/or source code would you recommend I read?


Most algorithm and data structure texts deal with in-memory algorithms

Nope, they deal with Algorithms in general without regard to storage. Usually when an algorithm has been developed with some sort of memory in mind it is titled as such. As an example cache oblivious algorithms are algorithms which are cache aware but oblivious to cache size (aka supposed to be good no matter the cache size).

Sqlite is the only database I can think of that is small enough to be read. If all you are interested in is algorithms applied to disk storage I don't think I would recommend writing your own database since that is going to involve writing a large chunk of code for SQL parsing, query planing, etc. Why not just write some data processing tools for data mining.


I agree and should have been more precise. Most algorithm and data structure texts fail to deal with the issues surrounding secondary-storage application, including optimizing disk accesses, buffering, etc.


Optimization is a black art. It generally only applies to very specific cases. This is why most RDBM systems expose a lot of knobs to tune. The DB must be tuned both to hardware and work load, what is a good idea for one workload may be a disaster for another.

Here is an example, you can usually increase DB performance by about 15% by not allocating the first 8 MB of your hard drive. (Hint: the partition table causes misaligned reads/writes on raid systems if your partition starts at sector 1)

If you want to see the tricks of the trade, go read the TPC results, find the auditor's report and look through the source, the hardware configuration, and the software configuration.


Theoretically, B-trees/ISAM. That's pretty much all you need to know. Most algorithms aren't optimized for their storage layer(s), they exist primarily in big-O notation that ignores latencies, and especially non-uniform latencies.

On most systems RAM is the 5th level of storage (registers,l1 cache,l2 cache,l3 cache, ram). Just because L1-3 cache isn't addressable doesn't mean it won't have a serious effect on the performance of your code.


Check out the source of SQLite.


I'll second this. I haven't looked through it myself, but have heard it praised many times as an example of exceptionally clean and efficient code.


If you would like to learn database theory, then I recommend the following book:

_An Introduction to Database Systems_ (8th Ed.) By C. J. Date http://www.amazon.com/Introduction-Database-Systems-8th/dp/0...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: