It's a little premature, fine, but I want to start liquidating my rhetorical swaps here: I've been saying since last summer (sometimes on HN, sometimes elsewhere) that "prompt engineering" is BS and that in a world where AI gets better and better, expecting to develop lasting competency in an area of AI-adjacent performance (a.k.a. telling an AI what to do in exactly the right way to get the right result) is akin to expecting to develop a long-lasting business around hand-cranking people's cars for them when they fail to start.
Like, come on. We're now seeing AIs take on tasks many people thought would never be doable by machine. And granted, many people (myself included to some extent) have adjusted their priors properly. And yet so many people act like AI is going to stall in its current lane and leave room for human work as opposed to developing orders of magnitudes better intelligence and obliterating all of its current flaws.
Been doing a lot with prompts lately. What people are calling "prompt engineering" I'd call "knowing what to even ask for and also asking for it in a clear manner". That was a valuable skill before computers and will continue to be one as AI progresses.
I've been pretty disappointed to introduce ChatGPT to people in jobs where it would be a game changer and they just don't know what to do with it. They ask it for not-useful things or useful things in a non-productive way. "here is some ad copy I wrote, write it better". Whether you're instructing a human, chatgpt, or AI god... that's just too vague of instructions.
> I'd call "knowing what to even ask for and also asking for it in a clear manner".
It was a very important skill for searching. Nowadays, with Google "I know what you want better than you" search, it’s not that useful anymore (not useless, I get better search results by not using google and knowing what I want, just less required).
IMHO it stems from lack of imagination. Impressive as the results may sometimes be, the user interfaces for AI are still extremely crude.
Soon we will see AI being used to define semantic operations on images that are hard to define exactly (imagine a knob to make an image more or less "cyberpunk", for example).
I also expect AI-powered inpainting to become a ubiquitous piece of functionality in drawing and editing tools (there are already Photoshop plugins).
Furthermore, my hunch is that many of the use cases around image creation will gradually move towards direct manipulation. Somewhat like painting, but without a physical model. AI components will be probably applied to interpreting the user's touch input in a similar way to how they are currently deployed to understand text input.
LLMs and Image AIs are the opposite of self-driving cars. "Everybody" had concrete expectations for at least half a decade now that the moment where self-driving cars would surpass human ability was imminent, yet the tech hasn't lived up to it (yet). While practically nobody was expecting AI to be able to do the jobs of artists, programmers or poets anywhere near human level anytime soon, yet here we are.
Great work, congratulations! One question, if I understood it right you based your demo on GPT-2 - what is your experience working with those open-source language AIs. In terms of computational requirements and performance?
I'm really fascinated by all the tools the OS community is building based on StableDiffusion (like OP's), which compares favourably with the latest closed-source models like Dall-E or Midjourney, and can run reasonably well on a high-end home computer or a very reasonably-sized cloud instance. For language models, it seems the requirements are substantially higher, and it's hard to match the latest GPT versions in terms of quality.
If LLMs (etc.) had the same requirements and business models as AV cars they'd still be considered a failure. Nobody expects Stable Diffusion to have a 6-sigma accuracy rate*, nor do we expect ChatGPT to seamlessly integrate into a human community. The AV business model discourages individual or small scale participation, so we wouldn't even have SD (would anyone allow a single OSS developer to drive or develop an AV car? Ok, there's Comma, that's all there is on the OSS side).
* The amount of times that I've seem an 'impressive' selections of AI images that I consider a critical failure deserves it's own word. The AIs are impressive for even getting that far, it's just that some people have bad taste and pick the bad outputs.
Yes, it's true that not all technology evolves as fast as predicted (by some, at some point), but first of all I still believe we will see self driving cars in the future and secondly, it's one anti-example in a forest of examples of tech that evolves beyond anyone's expectations. I don't find it very convincing.
Autonomous driving as it currently exists came unexpected for most people. Now many look at it with the power of hindsight but back in the day the majority never thought we'd have cars (partly) driving on their own within a few years. The case of AI art seems the exact same to me, now that many are working on it there's lots of progress but it's still nowhere near what an experienced human can do. And that seems to be the general rule, not an exception.
We might need to create real intelligence for that to become true. A machine that can think and is aware of its purpose.
Large language models are stateless. The apps and fine-tuned models are doing prompt engineering on users' behalf. It's very much a thing for developers, with the goal of making it invisible for end users.
Has it? I mean, maybe the idea of people doing this as a long-time career has, but practically, I still find it a challenge to get those AIs to do exactly what I want. I've played around with Dreambooth-style extensions now, and that goes some way for some applications, and I'm excited to try OP's solution, but in my experience, it is still a bit of a limitation for working with those AIs right now.
Oh yeah it's definitely still an issue right now! But I think the power of ChatGPT's ability to understand and execute instructions has convinced most people that "prompt engineering" isn't going to be a career path in the future.
Absolutely. I briefly thought about asking ChatGPT to write a prompt, but then I remembered that the training corpus is probably older than those tools (I heard that if you ask it the right way, it will tell you that its corpus ended in 21 - whether it's true or not, it sounds plausible). But that's a truly temporary issue, the respective subreddits probably have enough information to train an AI for prompt engineering already (if you start from a strong foundation like the latest GPT versions).
In the near future you can totally imagine a dialogue like that you'd have with a real designer, "can you make it pop a bit more?" or "can you move that logo to the right side?". It might some trial and error but it's only going to improve.
Making the AI truly creative (which means going beyond what the client asks for, towards things the client doesn't even know they want) would be a much larger leap and potentially take a lot longer.
I don't get it. Pre-ChatGPT prompt engineering was a BS exercise in guessing how a given model's front-end tokekizes and processes the prompt. ChatGPT made it only more BS. But I've seen a paper the other day, implementing more structured, formal prompt language, with working logic operators implemented one layer below - instead of adding more cleverly structured English, they were stepping the language model with variations of the prompt (as determined by the operators), and did math on probability distributions of next tokens the model returned. That, to me, sounds like valid, non-BS approach, and strictly better than doubling down on natural language.
Think about the problem in an end-to-end fashion: the user has an idea of what sort of image they want, they just need an interface to tell the machine. A combination of natural language plus optional image/video input is probably the most intuitive interface we can provide (at least until we've made far more progress on reading brain signals more directly).
How exactly we get there, by adding layers like on top like language models, or adding layers below like what you described, doesn't seem like such a fundamental difference. It's engineering, you try different approaches, vary your parameters and see what works best. And from the onset, natural language does seem like a good candidate for encoding nuances like "make it pink, but not cheesy" or "has the vibes of a 50's Soviet propaganda poster, but with friendlier colors".
Like, come on. We're now seeing AIs take on tasks many people thought would never be doable by machine. And granted, many people (myself included to some extent) have adjusted their priors properly. And yet so many people act like AI is going to stall in its current lane and leave room for human work as opposed to developing orders of magnitudes better intelligence and obliterating all of its current flaws.