I'm sure that there are plenty of cases where model designers are at least attempting to pursue new capabilities in a targeted approach (but how often to this degree of complexity?), while at the same time realizing that new models+datasets will also have unanticipated new capabilities.
Before LLMs (whether transformer-based or not) most NNs were built to perform single tasks - a single objective, so having multiple higher-level capabilities was essentially out of the question. Of course LLMs nominally only have a single objective too, predict next word, but really they are targeting language.
In the GOFAI era of rule-based symbolic AI there were also some systems/approaches that had multiple skills (e.g. expert systems like CYC, or cognitive architectures like SOAR), so maybe there are forgotten lessons there on decomposability of skills.
Before LLMs (whether transformer-based or not) most NNs were built to perform single tasks - a single objective, so having multiple higher-level capabilities was essentially out of the question. Of course LLMs nominally only have a single objective too, predict next word, but really they are targeting language.
In the GOFAI era of rule-based symbolic AI there were also some systems/approaches that had multiple skills (e.g. expert systems like CYC, or cognitive architectures like SOAR), so maybe there are forgotten lessons there on decomposability of skills.