Well, that's quite the subtle question. Yes, in my plan for world domination, the representation of display lists is also immutable trees (or, more precisely, DAG's) hosted on the GPU, and the incremental nature of the pipeline is preserved, so small diffs to the UI result in small uploads of graph to the GPU. But it also includes damage region tracking, which is not really a tree, and the very end of the pipeline is pixels, which is definitely not a tree.
Side question: How do you make DAGs immutable? Unlike trees where changing a node only requires a path copy of all ancestors, changing a node in a DAG has the potential to affect all other nodes. There are techniques such as using fat nodes[1] but I don't know of anyone using them in the real world.
You do need to trace all predecessors in this case, but in practice I don't think it's a serious problem. The kinds of graphs you'll get in UI are probably best thought of as trees with the possibility of shared resources (for example, all buttons of the same size might share the same background). On something like a theme change, I'd just rebuild the tree, although in theory you could be smarter and apply changes incrementally at a fine grain.
There seems to be a misunderstanding. On all mobile phones and increasingly desktops, if your UI involves the GPU, you have failed; the goal is to have large layers of a scene handled entirely by the display compositor that produces the final pixels. Using the GPU as an enhanced blit engine is a colossal waste of memory bandwidth and power that no one can afford.
This will the the tl;dr of my upcoming blog post "the compositor is evil", referenced in the post.
I hear what you are saying, but I think the situation is a lot more complicated than that. Certainly, these days, if you're not using the GPU then you have failed; with high dpi and high frame rate displays, combined with the end of Moore's Law, CPU rendering is no longer viable. So the question is how you use the GPU. I agree with you that power is the primary metric of how we should measure this (it certainly correlates strongly to memory bandwidth, and is the metric users will actually care about; if there were a magical way to achieve high memory bandwidth at low power, I don't see any problem with that).
One way certainly is to use the compositor, and this is very seductive. The major platforms all expose an interface to the compositor, and it's generally pretty well optimized for both performance and power usage. Since animation is part of the API, it's possible to do quite a bit (including, for example, cursor blink in an editor) without even waking up the app process. Scrolling is another classic example where a compositor can do very well.
However, the compositor comes with downsides. For one, it forces the UI into patterns that fit into the compositor's data model. This is one reason the aesthetic of mobile UI design is so focused on sliding alpha-blended panes of mostly static content.
But even from a pure performance standpoint, heavy reliance on the compositor has risks. It's tempting to think of the composition stage as "free" because it has to blit the intermediate surfaces to the final composited desktop, but this is not strictly true. On mobile devices, and in the process of coming to desktops (there's some implementation in Windows 10 for Kaby Lake and higher integrated graphics) is the use of hardware overlays to replace actually blitting the active window to an in-memory surface. When the overlay is available, it's both a lower latency path and also saves the GPU memory bandwidth of that blit. And generally the heuristic is that the application window is a single surface, in other words that it does not rely on the compositor.
In order to justify doing the final scene assembly under GPU control in the app, that has to be as efficient or more so than the compositor. I have some evidence that, for 2D scenes typical of UI, this is possible. Regions that are flat color, for example (which occupy a nontrivial fraction of total area) can be rendered with no memory traffic at all, save the final output stage. And most of the other elements can be computed cheaply in a compute kernel on the GPU.
The tradeoffs are complicated, and depend a lot on the details of the application and the platform it's running on. In the 2010s, a strong case can be made that a compositor-centric approach is ideal. But in the 2020s, I think, it's increasingly not. The compositor is evil because it saps latency, and because its seductive promise of power and performance gains holds us back from the future where we can use the GPU more directly to achieve the application's rendering goals.
> Regions that are flat color, for example (which occupy a nontrivial fraction of total area) can be rendered with no memory traffic at all, save the final output stage.
GPU rendering and compositing amount to the exact same thing most of the time. A "region of flat color", at least in principle, is just a 1x1 pixel texture that's "mapped" onto some sort of GPU-implemented surface that in turn is rendered onto the screen.
Hardware overlays merely accelerate the final rendering step; one can implement the exact same process either in hardware, or as a software-based "blitting" step.
This is good feedback that I need to be clear and avoid terminological confusion when I write that blog post.
Of course the compositor is using the GPU. The difference is entirely in how the capabilities of the GPU are exposed (or not) to the application. My thesis is that doing 2D rendering in a compute kernel is ideal for 2D workloads, because it lets the application express its scene graph in the most natural way, then computes it efficiently (in particular, avoiding global GPU memory traffic for intermediate textures) using the GPU's compute resources.
Of course you could in theory have a compositor API that lets you do region-tracking for flat color areas, and this would save some power, but no system I know works that way.