Current LLMs fail if what you're coding is not the most common of tasks. And a s...

dathinab · on Sept 27, 2024

I don't think if complexity is the right metric.

front-end JS can easily also become very complex

I think a better metric is how close you are to reinventing a wheel for the thousands time. Because that is what LLMs are good at: Helping you write code which nearly the same way has already been written thousands of times.

But that is also something you find in backend code, too.

But that is also something where we as a industry kinda failed to produce good tooling. And worse if you are in the industry it's kinda hard to spot without very carefully taking a hounded (mental) steps back from what you are used to and what biases you might have.

mrybczyn · on Sept 27, 2024

LLM Code Assistants have succeeded at facilitating reusable code. The grail of OOP and many other paradigms.

We should not have an entire industry of 10,000,000 devs reinventing the JS/React/Spring/FastCGi wheel. Im sure those humans can contribute in much better ways to society and progress.

itishappy · on Sept 27, 2024

> LLM Code Assistants have succeeded at facilitating reusable code.

I'd have said the opposite. I think LLMs facilitate disposable code. It might use the same paradigms and patterns, but my bet is that most LLM written code is written specifically for the app under development. Are there LLM written libraries that are eating the world?

dbmikus · on Sept 27, 2024

I believe you're both saying the same thing. LLMs write "re-usable code" at the meta level.

The code itself is not clean and reusable across implementations, but you don't even need that clean packaged library. You just have an LLM regenerate the same code for every project you need it in.

The LLM itself, combined with your prompts, is effectively the reusable code.

Now, this generates a lot of slop, so we also need better AI tools to help humans interpret the code, and better tools to autotest the code to make sure it's working.

I've definitely replaced instances where I'd reach for a utility library, instead just generating the code with AI.

I think we also have an opportunity to merge the old and the new. We can have AI that can find and integrate existing packages, or it could generate code, and after it's tested enough, help extract and package it up as a battle tested library.

itishappy · on Sept 27, 2024

Agreed. But this terrifies me. The goal of reusable code (to my mind) is that with everybody building from the same foundations we can enable more functional and secure software. Library users contributing back (even just bug reports) is the whole point! With LLMs creating everything from scratch, I think we're setting ourselves on a path towards less secure and less maintainable software.

thelastparadise · on Sept 27, 2024

I (20+ years experience programmer) find it leads to a much higher quality output as I can now afford to do all the mundane, time-consuming housekeeping (refactors, more tests, making things testable).

E.g. let's say I'm working on a production thing and features/bugfixes accumulate and some file in the codebase starts to resemble spaghetti. The LLM can help me unfuck that way faster and get to a state of very clean code, across many files at once.

erosivesoul · on Sept 27, 2024

What LLM do you use? I've not gotten a lot of use out of Copilot, except for filling in generic algorithms or setting up boilerplate. Sometimes I use it for documentation but it often overlooks important details, or provides a description so generic as to be pointless. I've heard about Cursor but haven't tried it yet.

dbmikus · on Sept 27, 2024

Cursor is much better than Copilot. Also, change it to use Claude, and then use the Inspector with ctrl-I

KoolKat23 · on Sept 27, 2024

This is the thing it works both ways, it's really good at interpreting existing codebases too.

Could potentially mean just a change in time allocation/priority. As it's easier and faster to locate and potentially resolve issues later, it is less important for code to be consistent and perfectly documented.

Not fool proof and who knows how that could evolve, but just an alternative view. One of these big names in the industry said we'll have AGI when it speaks it's own language. :P.

znpy · on Sept 27, 2024

I had similar experiences:

1. Aasked ChatGPT to write a simple echo server in C but with this twist: use io_uring rather than the classic sendmsg/recvmsg. The code it spat out wouldn't compile, let alone work. It was wrong on many points. It was clearly pieces of who-knows-what cut and pasted together. However after having banged my head on the docs for a while I could clearly determine from which sources the code io_uring code segments were coming. The code barely made any sense and it was completely incorrect both syntactically and semantically.

2. Asked another LLM to write an AWS IAM policy according to some specifications. It hallucinated and used predicates that do not exist at all. I mean, I could have done it myself if I just could have made predicates up.

> But for anything even mildly complex, LLMs are still not suited.

Agreed, and I'm not sure we are any close to them being.

mattgreenrocks · on Sept 27, 2024

Yep. LLMs don’t really reason about code, which turns out to not be a problem for a lot of programming nowadays. I think devs don’t even realize that the substrate they build on requires this sort of reasoning.

This is probably why there’s such a divide when you try to talk about software dev online. One camp believes that it boils down to duct taping as many ready made components together all in pursuit of impact and business value. Another wants to really understand all the moving parts to ensure it doesn’t fall apart.

typedef_struct · on Sept 27, 2024

My test is to take a sized chunk of memory containing a TrueType/OpenType font and output a map of glyphs to curves. Bot is nowhere close.

PaulHoule · on Sept 27, 2024

Roughly LLMs are great at things that involve a series of (near) 1-1 correspondences like “translate 同时采访了一些参与其中的活跃用户 to English” or “How do I move something up 5px in CSS without changing the rest of the layout?” but if the relationship of several parts is complex (those Rust traits or anything involving a fight with the borrow checker) or things have to go in some particular order it hasn’t seen (say US states in order of percent water area) they struggle.

SQL is a good target language because the translation from ideas (or written description) is more or less linear, the SQL engine uses entirely different techniques to turn that query into a set of relational operators which can be rewritten for efficiency and compiled or interpreted. The LLM and the SQL engine make a good team.

infecto · on Sept 27, 2024

I’d bet that about 90% of software engineers today are just rewriting variations of what’s already been done. Most problems can be reduced to similar patterns. Of course, the quality of a model depends on its training data—if a library is new or the language isn’t widely used, the output may struggle. However, this is a challenge people are actively working on, and I believe it’s solvable.

LLMs are definitely suited for tasks of varying complexity, but like any tool, their effectiveness depends on knowing when and how to use them.

ben_w · on Sept 27, 2024

> Current LLMs fail if what you're coding is not the most common of tasks

Succeeding on the most common tasks (which isn't exactly what you said) is identical to "they're useful".

abm53 · on Sept 27, 2024

And I would go further… these “common tasks” cover 80% of the work in even the most demanding engineering or research positions.

layer8 · on Sept 27, 2024

That’s absolutely not my experience. I struggle to find tasks in my day to day work where LLMs are saving me time. One reason is that the systems and domains I work with are hardly represented at all on the internet.

scruple · on Sept 27, 2024

I have the same experience. I'm in gamesdev and we've been encouraged to test out LLM tooling. Most of us at/above the senior level report the same experience: it sucks, it doesn't grasp the broader context of the systems that these problems exist inside of, even when you prompt it as best as you can, and it makes a lot of wild assed, incorrect assumptions about what it doesn't know and which are often hard to detect.

But it's also utterly failed to handle mundane tasks, like porting legacy code from one language and ecosystem to another, which is frankly surprising to me because I'd have assumed it would be perfectly suited for that task.

nicolas_t · on Sept 27, 2024

In my experience, AI for coding is having a rather stupid very junior dev at your beck and call but who can produce the results instantly. It's just often very mediocre and getting it fixed often takes longer than writing it on your own.

ben_w · on Sept 28, 2024

My experience is that it varies a lot by model, dev, and field — I've seen juniors (and indeed people with a decade of experience) keeping thousands of lines of unused code around for reference, or not understanding how optionals work, or leaving the FAQ full of placeholder values in English when the app is only on the German market, and so on. Good LLMs don't make those mistakes.

But the worst LLMs? One of my personal tests is "write Tetris as a web app", and the worst local LLM I've tried, started bad and then half way through switched to "write a toy ML project in python".

abm53 · on Oct 5, 2024

I think this illustrates the biggest failure mode when people start using LLMs: asking it to do too much in one step.

It’s a very useful tool, not magic.

bee_rider · on Sept 27, 2024

> Not once did he need to ask me a question. When I asked him "how long did this take" and expected him to say "a few weeks" (it would have taken me - a far more experienced engineer - 2 months minimum).

> Current LLMs fail if what you're coding is not the most common of tasks. And a simple web app is about as basic as it gets.

These two complexity estimates don’t seem to line up.

fhd2 · on Sept 27, 2024

That's still valuable though: For problem validation. It lowers the table stakes for building any sort of useful software, which all start simple.

Personally, I just use the hell out of Django for that. And since tools like that are already ridiculously productive, I don't see much upside from coding assistants. But by and large, so many of our tools are so surprisingly _bad_ at this, that I expect the LLM hype to have a lasting impact here. Even _if_ the solutions aren't actually LLMs, but just better tools, since we reconfigured how long something _should_ take.

skydhash · on Sept 27, 2024

The problem Django solves is popular, which is why we have so many great frameworks that shorten the implementation time (I use Laravel for that). Just like game engines or GUI libraries, assuming you understand the core concepts of the domain. And if the tool was very popular and the LLMs have loads of data to train on, there may be a small productivity tick by finding common patterns (small because if the patterns are common enough, you ought to find a library/plugin for it).

Bad tools often falls in three categories. Too simple, too complex, or unsuitable. For the last two, you'd better switch but there's the human element of sunken costs.

gambiting · on Sept 27, 2024

I work in video games, I've tried several AI assistants for C++ coding and they are all borderline useless for anything beyond writing some simple for loops. Not enough training data to be useful I bet, but I guess that's where the disparity is - web apps, python....that has tonnes of publicly available code that it can train on. Writing code that manages GPU calls on a PS5? Yeah, good luck with that.

maroonblazer · on Sept 27, 2024

Presumably Sony is sitting on decades worth of code for each of the PlayStation architectures. How long before they're training their own models and making those available to their studios' developers?

skydhash · on Sept 27, 2024

I don't think sony have these codes, more likely the finished build. And all the major studios have game engines for their core product (or they license one). The most difficult part is writing new game mechanics or supporting a new platform.

ilaksh · on Sept 27, 2024

So you are basically saying "it failed on some of my Rust tasks, and those other languages aren't even real programming languages, so it's useless".

I've used LLMs to generate quite a lot of Rust code. It can definitely run into issues sometimes. But it's not really about complexity determining whether it will succeed or not. It's the stability of features or lack thereof and the number of examples in the training dataset.

aniviacat · on Sept 27, 2024

I realize my comment seems dismissive in a manner I didn't intend. I'm sorry for that, I didn't mean to belittle these programming tasks.

What I meant by complexity is not "a task that's difficult for a human to solve" but rather "a task for which the output can't be 90% copied from the training data".

Since frontend development, small scripts and SQL queries tend to be very repetitive, LLMs are useful in these environments.

As other comments in this thread suggested: If you're reinventing the wheel (but this time the wheel is yellow instead of blue), the LLM can help you get there much faster.

But if you're working with something which hasn't been done many times before, LLMs start struggling. A lot.

This doesn't mean LLMs aren't useful. (And I never suggested that.) The most common tasks are, per definition, the most common tasks. Therefore LLMs can help in many areas, and are helpful to a lot of people.

But LLMs are very specialized in that regard, and once you work on a task that doesn't fit this specialization, their usefulness drops, down to being useless.

ilaksh · on Sept 27, 2024

Which model exactly? You understand that every few months we are getting dramatically better models? Did you try the one that came out within the last week or so (o1-preview).

aniviacat · on Sept 27, 2024

I did use o1-preview.