Compilers Are Databases [video]

Rexxar · on Aug 11, 2015

I think we can go further :

- every non-trivial software is a compiler

- every non-trivial software is a database

Every non trivial software have to read/parse/manipulate/store a lot of data. I think it ultimately permit to simplify thinks to just admit that we are doing a specialized database/compiler. The fact that input data are coming from file, database, network, command line or the code itself and are going to files, database or network is just an implementation detail.

pjmorris · on Aug 11, 2015

'Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.' - Greenspun;s tenth rule [1]

[1] https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

Rexxar · on Aug 11, 2015

The real failure of this law is that, even if you use Common Lisp (or every other language), to implement your software you will finally have an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp implemented in your software.

The solution is to design your software as a compiler from the start.

joostdevries · on Aug 11, 2015

It seems you've only read the title of this post.

Too bad. It's an interesting talk on the architecture of an ambitious compiler redesign. The title is a simplification for the sake of being a keynote.

Rexxar · on Aug 11, 2015

The fact that I find the title inspiring and that we can go further doesn't mean I have not watched the video. Please stop to criticizing people on imaginary facts.

There was a lot of interesting comments yesterday on a post called "Presumption of stupidity" (https://news.ycombinator.com/item?id=10034883). This could also applied to comments: Do not assume the worse you can imagine by default.

And yes, it's indeed a very interesting talk.

sklogic · on Aug 11, 2015

> every non-trivial software is a compiler

Actually this is even more true than you probably think it is.

chubot · on Aug 11, 2015

I watched the first part of this, and this is apparently the inventor of Scala talking about the Scala compiler.

He was talking about how deep the transformation pipeline in scalac is, compared to javac.

I would say that something close to the opposite statement is justified: databases are interpreters, e.g. sqlite is certainly an interpreter with a storage layer.

I didn't watch enough of it to see why he thinks that compilers are databases. Where's the storage layer? I'm interested in compilers and databases, but TBH I am not that interested in Scala, so I stopped watching.

jzelinskie · on Aug 11, 2015

The storage isn't the important part. I think his point was more about _functional_ databases than your typical RDBMS in the sense that the experimental Scala compiler needed to model time (in terms of compiler passes and phases) in order to be able to answer "What is the value/type of this thing _at a given time_?". TBH, the talk was very little about Scala and mostly about compiler architecture.

chubot · on Aug 11, 2015

What's an example of a "functional database"? Datomic? What's the most deployed functional database?

It's misleading to say that compilers are databases if they're not anything like databases people use and know. Probably a more accurate title would have been "Compilers are like functional databases".

Storage is obviously an important part of database architecture. If a compiler doesn't deal with storage, there's only so much resemblance that a compiler and a database can have.

joostdevries · on Aug 11, 2015

I think "functional database" can be seen as a wider data pattern than just a datastore that exclusively implements that pattern. It's a design principle where a piece of data once created is never changed. That leads to event sourcing or Twitters "lambda architecture".

The compiler is a database in the sense that it has a memory across incremental invocations. So the 'storage' happens to be in memory but could be on disk.

jzelinskie · on Aug 11, 2015

I'm definitely not a DB guy, but Datomic was the first thing that came to mind. You could also include databases that require linear consistency like etcd if you consider "functional databases" to be any database that represents state as a function of time.

There was an early slide that explicitly said FRP and functional databases were inspiration for the compiler design, but I do agree that the title should have reflected that fact.

bobm_kite9 · on Aug 11, 2015

He means that compilers create an in-memory database, representing your code, which could be put in an actual database.

Not that compilers can be used as a general DBMS. I think.

sklogic · on Aug 11, 2015

Thanks for the TL;DR! I see the point, it's not that new at all. For example, Coverity tools are using sqlite to store the huge pile of IR.

erikb · on Aug 11, 2015

Hm, I always thought that the sql-layer is a communication layer on top of the actual db. So sqlite as a database doesn't interpret much. On the other hand the sqlite shell tool of course is an interpreter for SQL.

jbottt · on Aug 11, 2015

I would say both statements are justified. Transitive property being what it is

chubot · on Aug 11, 2015

I think you mean "symmetric" rather than "transitive", and the "is a" relation isn't symmetric. (We know this from OOP -- i.e. the Liskov substitution principle.)

A database is an interpreter because it has a lexer, parser, byte code compiler, optimizer, interpreter loop, etc. It also has a storage layer (i.e. the B-tree implementation, paging, etc.).

A compiler (like javac) or interpreter isn't a database because it doesn't have a storage layer.

Even a "functional database" has to have a storage layer!

chipsy · on Aug 11, 2015

Once you add IDE and REPL support (Which Scala does) you need a storage layer as an optimization. Odersky spends most of the lecture time detailing resulting performance goals and effects on internals to support these features. This leads him towards a normal-form topology to minimize data duplication, of course, which is why he gave it the title "Compilers are Databases".

pjmlp · on Aug 11, 2015

For those that care about language development on the JVM, the JVM Language Summit 2015 agenda (this is one of the talks) is here:

http://openjdk.java.net/projects/mlvm/jvmlangsummit/agenda.h...

And there is an You Tube playlist for the talks already

https://www.youtube.com/playlist?list=PLX8CzqL3ArzUo2dtMurvp...

bobm_kite9 · on Aug 11, 2015

Martin Odersky is a clever guy, there's no doubt about that. But, at the beginning of the talk he introduces the scala compiler as having 40 steps. And he says that translating from scala to the JVM is a translation from a complex input to a complex output. And now, Databases. So.. I just don't understand why he would even want to take on such a difficult task.

It strikes me that bringing functional programming to the JVM can / has been done simpler: Clojure being the obvious one, but F# on the CLR is another good template.

Third, Frege is a Haskell implementation for the JVM which to my mind is a much better long-term bet than Scala: it's already a tried-and-tested language, and the work involved in the translation has taken an order of magnitude less time.

It's all very well having some new great language on the JVM, but simplicity is a huge virtue if you want people to build an ecosystem around it.

sklogic · on Aug 11, 2015

Number of compiler passes does not reflect its "complexity". Can in fact be the opposite - the more distinct passes there are, the simpler each of the passes is, together making the whole thing simpler.

> Clojure being the obvious one

Clojure, to be honest, is an example of a very convoluted, not very well thought out compiler design.

Disclaimer: I know nothing about the actual Scala compiler passes, just commenting on the number of passes as a proxy for complexity.

justthistime_ · on Aug 11, 2015

> It strikes me that bringing functional programming to the JVM can / has been done simpler: Clojure being the obvious one, but F# on the CLR is another good template.

Scala's FP is more expressive than all of them, by a large margin. Additionally, Scala also supports an useful OO system, unlike for instance the stuff F#/OCaml/... ships.

> It's all very well having some new great language on the JVM, but simplicity is a huge virtue if you want people to build an ecosystem around it.

You realize that Scala has probably the largest, and most meaningful ecosystem of all (non-Java) JVM languages, and is driving a lot of the changes in the Java and JVM space? (Reactive streams, value types, specialied generics, ...)

bobm_kite9 · on Aug 12, 2015

Really good points.

I expect you're right about the expressiveness, although I don't really know exactly what you mean by this. The "buffet of abstractions" available in Scala actually put me off it. For me, there are too many options available to expect to have readable, maintainable code at the end of it all, especially if you work in a team which regularly creates almost wilfully hard-to-comprehend java.

This video is 2 years old now, but I think it marked the point when I stopped investigating scala. https://www.youtube.com/watch?v=TS1lpKBMkgg&feature=youtube_...

joostdevries · on Aug 12, 2015

Before I clicked the link I knew which video it would be. For some reason it's largely the single fact that people on nyc seem remember about Scala.

If you watch the presentation by Odersky you'll find that the new compiler addresses most if not all the things that Paul Philips mentioned.

For instance the new compiler has 5 phases. Just like javac.

justthistime_ · on Aug 12, 2015

While I agree with many points in the video, it feels kind of disheartening to me how many people just wave it around as if it could be used as an attack against the language and its community.

(A general observation, what you say here is totally fair.)

Of course the video is about Scala, but the problems mentioned are widespread throughout the industry.

What most of those people miss is the fact that Scala is sometime light years ahead of figuring out hard problems. The things considered to be an issue in Scala is often stuff that hasn't even been considered elsewhere.

The video is now 2 years old, and in terms of Scala, that's a lot of time. The strength of the language and the community is that it can adapt pretty fast, and many issues have been/are being addressed.

There are plenty of libraries out there which experiment with different approaches to collections, there is a new, faster and cleaner backend, the new compiler is shaping up nicely, the new IR will solve a lot of binary compatibility/cross compilation/Scala<->Scala.js issues, and scala.meta seems to become one of the best metaprogramming abstractions in languages.

Things are working really well in Scala, so if you have any question or concerns, feel free to voice them, and I'll give you my 2 cents about what's happening in that space! :-)

fla · on Aug 11, 2015

I guess everything becomes a database if you work at Oracle long enough.

jules · on Aug 11, 2015

I don't understand the part about the variance of the tree type. Couldn't he just use the unit/top type as a type parameter for untyped trees, rather than the nothing/bottom type? Not only does that seem more intuitively correct, but it would also give the right subtype relationship.

joostdevries · on Aug 11, 2015

Well, some things do have the type unit or Any (top). How would you express that if that means 'not known' in your compiler?

jules · on Aug 11, 2015

I think I don't understand your question. My remark was about his type Tree[T] of syntax trees annotated with values of type T. For syntax trees annotated with types he uses Tree[Type], which makes sense. Then for untyped syntax trees (i.e. trees without any annotation) he uses Tree[Nothing]. Now he needs unsafe hacks to get the desired subtype relationship Tree[Type] <: Tree[Nothing]. The compiler understands that Tree is covariant, so he overrides this with an unsafe annotation to make Tree contravariant. Using Nothing rather than the top/unit type (not sure what that's called in Scala, Any?) is what doesn't make sense to me. Intuitively an un-annotated tree is a Tree annotated with values that carry no information, not a tree annotated with values of the uninhabited type. If you use that you also get the desired subtyping relationship Tree[Type] <: Tree[Top] without any need for unsafe annotations.

modersky · on Aug 11, 2015

Yes, Tree[Top] would have been another way to model this. I wanted Tree[Nothing] to make clear that getting the type of an untyped Tree gives a Nothing, i.e. a run-time error. Getting a Top is not quite the same, but it is almost as good, since only a view things can be done with type Any in Scala.

EDIT: There's one thing I did not stress in the talk and that makes Nothing a more attractive choice than Top: We want to do copy on write on these trees. I.e. if we transform an untyped tree to a typed one using "withType", we want to be able to reuse the untyped tree and just set the type (unless somebody else had already added another type). If the tree is of type Tree[Nothing], we can do that safely, even if there are still shared references to the same node as an untyped Tree. Getting the type of the shared untyped tree will give a Nothing, and nothing can be done with it. But if the type would be a Top, we would be able to apply some operations on it, so the copy on write would be unsafe.

qznc · on Aug 11, 2015

Looking at how often this compiler must allocate (and garbage collect), no wonder Scala compilers are slow.

andrewchambers · on Aug 11, 2015

About to watch the video, In the past I've thought about trying to model llvm code inside datomic and doing transforms with the help of datalog queries.

jules · on Aug 11, 2015

People have done something like this for pointer analysis (statically determining to which memory locations a pointer can point, which is a prerequisite for many optimizations), e.g. http://yanniss.github.io/doop-datalog2.0.pdf

sklogic · on Aug 11, 2015

I never watch videos (what's wrong with you, post-literates? Forgot how to read and write?), so commenting on the title only.

Yes, compilers indeed have a lot in common with databases. Datalog (and therefore a relational algebra) is a very powerful tool for implementing compiler passes, for asking questions about the AST or various IRs.

EDIT: I'm talking about things like this: http://people.csail.mit.edu/mcarbin/papers/aplas05.pdf

steego · on Aug 11, 2015

It really sad Prolog/Datalog based languages never took off. Our industry has no problem embedding SQL strings and mini relational engines in our imperative programs. If you're going to encode your problems in terms of relations, why embed something something more expressive and powerful?

Good paper BTW, though I wish you hadn't prefaced the post by calling people post-literates. This paper deserves more reads.

sklogic · on Aug 11, 2015

Currently Prolog-derived languages are making a sort of a comeback - Kanren is widely used, just look at the number of ports and variations. My favourite recent example of cKanren used inside a compiler is Harlan [1] (see the region inference pass).

P.S. I'm really getting tired of the ever growing proportion of video vs. text in the otherwise valuable content. I'm genuinely interested in the topic but there is no way I can watch it. Dreaming of a service which would transcribe arbitrary videos for a tiny fee.

[1] https://github.com/eholk/harlan