The Coming of Local LLMs

yacine_ · on April 11, 2023

I was able to run a LLaMa on my personal machine to run some labeling on my documents, as a test of its capabilities. It was instruct tune. 30b parameters

4 example labels, and I had a binary classifier in seconds. Sure, semantic text classifiers were possible for a while, but making it accessible changes everything. Giving anyone who can use a spreadsheet the power of a local LLM (or, basically free LLMs) can make them much, much more productive. A lot of office work is clicking through sheets and doing manual labeling.

It's truly wild what is becoming accessible! Really excited to see the next gen software that the open community comes up with :)

rcme · on April 11, 2023

LLMs as general purpose classifiers is a really big deal, especially because you can give them fuzzy instructions. I know people are worried about LLMs and spam, but I think LLMs may provide an opportunity to elevate online discourse by being more efficient at filtering out spam and low quality commentary.

gamegoblin · on April 11, 2023

I already have a custom browser plugin that calls out to GPT (gpt-3.5-turbo is cheap and good enough for this) to classify and filter out low-effort, overly negative, or intellectually dishonest HN comments. It significantly improves the experience on this site.

Bonus points: I had never written a browser plugin, but GPT4 helped me do it in under half an hour.

darkgreene · on April 11, 2023

Do you have a repo you can share? I'd be very interested in running my own local copy

tough · on April 11, 2023

jpe90 · on April 11, 2023

This sounds super cool, I’m very curious how you implemented it.

I’m only vaguely familiar with the API.. if I had to guess I would say you send

- a system instruction that its job is to filter unwanted content

- examples of unwanted content

- an instruction like “filter the following html:”

For every web request you want to filter, you would re-send all of those messages followed by the page HTML as the final message. Is that close?

gamegoblin · on April 11, 2023

Close, but it's a bit more specialized to just work on HN:

- Examples of unwanted content

- Then I give a large numbered list of comments and ask which numbers should be filtered

- The plugin then just deletes those comment nodes from the DOM. If HN ever updates their HTML I will have to tweak this code.

The reason to send a large list of comments is just to save on costs. It's cheaper to do it this way than one comment at a time.

So the main difference from what you've proposed is GPT never sees the HTML. My code enumerates the comments in the HTML and splices them in to the prompt in a nice numbered list, then does the reverse translation from list number to DOM element in the other direction.

jpe90 · on April 11, 2023

Oh that’s clever! Thanks for answering!

ChatGTP · on April 11, 2023

Some negative, even overly negative comments contain good ideas though so maybe it’s not a great idea?

heyoni · on April 12, 2023

Your comment has been detected to contain oppositional tones and has thus been filtered out.

You are hereby removed from the discourse.

/s

ChatGTP · on April 12, 2023

Love it!

duckmysick · on April 12, 2023

That's ok. If it's a good idea, somebody else will talk about in a positive way.

Also, unless you are reading every comment in every thread, you are going to miss a few interesting ideas anyway. That's ok too.

ChatGTP · on April 13, 2023

Stay calm and keep summarising!

metalliqaz · on April 11, 2023

How do you define the cost function for 'low effort'?

gamegoblin · on April 11, 2023

I give GPT some examples of what I consider low-effort comments and non-low-effort comments, and then I just ask it to do classification. It's quite good at few-shot classification of fuzzy stuff like this.

artificial · on April 11, 2023

This sounds great, would love to see some more about this endeavor.

xk_id · on April 11, 2023

I don’t know how anyone can trust to run code on their machine that they don’t understand and hasn’t been reviewed by a third party.

sliken · on April 11, 2023

How many line of codes are in the kernel, drivers, and apps on your laptop/desktop?

How sure are you that all that code has been reviewed by a 3rd party? How many CVEs a year impact your laptop/desktop?

Do you have any reason to think that increased productivity with LLM assistance will result in lower quality code? Personally I find LLM assistance increases productivity, decreases the penalty of using a more difficult language like rust, and makes it more palatable to spend more (LLM assisted) time writing tests.

xk_id · on April 12, 2023

I meant reviewed by a (competent) human; not, audited by a security team/whatever. Generative computer code is in a whole new category of risk, because it can hallucinate a `rm -rf *` that you might overlook if you don't understand the code yourself.

sliken · on April 12, 2023

Well at least for now, any non-trivial code is going to be written by a human+AI, likely one function at a time. It's pretty rare for the AI to give me exactly what I want, often I have to tweak my prompts multiple times, and even just give up and write a function based on the useful bits from previous queries.

I just don't see it chatbot assisted programming any worse than what we have today.

gamegoblin · on April 11, 2023

I understand the code now, I just didn't know how to write a browser plugin and having GPT4 walk me through it was 10x faster than reading a tutorial.

rcme · on April 11, 2023

Isn't that pretty much all code?

nathan_compton · on April 11, 2023

Like almost all machine learning stuff, I expect these things to only be useful in places where it doesn't really matter if the results are correct. When you apply a classifier in real life its critically important to understand its statistical characteristics which is typically done via model characterization, which involves cross validation or boot strapping or whatever. I think the idea that you can just zero-shot or few-shot deploy these things as classifiers and forget about it is incredibly naive unless (as I said earlier) the results just don't really matter.

I've already used LLMs in my work as a data scientist but it requires a ton of work to just make the results tractable (and I have been using GPT4, which behaves pretty well). These smaller language models ain't so regular. Ok, like consider a basic thing you want to do with a classifier: understand its behavior on a held out data set. Since no one knows what is really in the training data (since its so large), its quite hard to understand what the model can generalize about and what it has just accidentally memorized. CF reports that GPT4 doesn't perform nearly as well on even simple programming exercises that are chosen in such a way as to be sure they weren't in the training data.

There is enormous potential for statistical fuck ups here. Prompt engineering, for instance, is an easy place for over-fitting to happen as a prompt is fine tuned on data the prompt engineer has and thus fails to generalize to new data.

I do think there is a lot of value here, but I'm also sure that sloppy use of large language models is going to cause a bunch of trouble in the short to medium term, generate a lot of garbage, pollute a lot of databases, etc, while we figure all this stuff out.

rcme · on April 11, 2023

Nothing about a zero or few-shot classifier precludes you from validating on a data set.

nathan_compton · on April 11, 2023

You can certainly hold out a validation set while you are writing your prompt, but you can't know whether the model is over fitting for your data set since you don't know what data was in the training set.

rcme · on April 11, 2023

You'll probably find out pretty quick in production :)

But I posit that most text classification tasks don't have such strict accuracy requirements. For one, no text classifier is 100% accurate. For instance, I have genuine mail in my spam folder frequently. I see spam on social networks, etc. I struggle to think of cases that aren't at least somewhat tolerant to some amount of incorrect classification.

jprete · on April 12, 2023

I think most users are not going to do this; they have no reason to even know such a thing might be done, let alone that it’s useful.

muyuu · on April 11, 2023

spam and especially phishing will become much, much better

it will be hard to trust anything at all

SV_BubbleTime · on April 11, 2023

Which will force drive signed communications. So... win.

muyuu · on April 11, 2023

Yea in a sense it will be a win, but it will up the stakes. A lot of people with get scammed. A lot more than right now.

SV_BubbleTime · on April 11, 2023

Disagree. Adding signed communications as the default to replace email's completely nonsense zero encryption model is not upping the stakes. Email is the most successful federated software in the universe, but it's fucking time to get some encryption and authn.

muyuu · on April 11, 2023

you are having a good laugh if you think the average granny is going to do that anytime soon

SV_BubbleTime · on April 12, 2023

Yea, lol, what was I thinking!? It’s not like my granny uses TLS ECDH SHA1 right!? Oh, wait, everyday she does? The tools adapt; email has never had to.

muyuu · on April 12, 2023

the tools will only adapt when a large number of people have got burnt

as always happens

MichaelZuo · on April 12, 2023

Why would they? If actually implemented it will be a few checkboxes behind a prompt next time they login.

yacine_ · on April 11, 2023

very simple, just don't read the email

muyuu · on April 11, 2023

I'm half way there, I already have no mates and I ignore most of the email from work.

akiselev · on April 11, 2023

That’s great. My coworkers are already a step ahead on security!

tyingq · on April 11, 2023

Agree, though plain old Bayesian classifiers have been able to handle some significant portion of that office work for a long time. And not much ever came from it for everyday stuff outside of spam filters.

Maybe both the buzz factor and broader applicability means it's more likely to happen this go around?

yacine_ · on April 11, 2023

More like: the accessibility is what will make it go around. The accessibility is what changes everything! Getting an easy to use interface (instructions) over python changes the accessibility from the denizens of this website to ~anyone with a computer

If you're interested, see this paper that argues that point: https://arxiv.org/abs/2302.06541

Essentially, being label efficient is more important than being compute efficient, because the biggest computing constraint we have is enough humans doing the labeling (and knowing how to work a jupyter notebook), not tensor smashing nvidia cards

lisasays · on April 12, 2023

Re: https://arxiv.org/abs/2302.06541 - how are we supposed to take the phrase "agile text classification" seriously?

They could have just said "efficient". But no - they had to go for "agile".

xbmcuser · on April 11, 2023

Ease of use is what is going to change everything. Using natural language to ask something and getting an answer is different from what we had before. I knew I could automate a lot of my paperwork with scripts but as I am not a programer I never gave it more than a cursory look. Last Dec while playing around with ChatGPT I was able to get it to write some python scripts that resulted in my spending less than 20-25 min on tasks that I was spending 20-25 hours on. Now could I have written the scripts myself probably but it would have taken me months whereas with chat gpt it took me 2-3 hours to get a working scripts and another 1-2 hours to optimise them.

travisjungroth · on April 11, 2023

Ease of use is huge, so is deployment. Even as a software engineer the overhead for a random classification is so much lower.

londons_explore · on April 11, 2023

I think the real benefits will be for those with an ad-hoc task and no programming/scripting ability.

Sure, you and I know how to write a little script to sort a directory of documents into "schoolwork" and "other stuff".

But most people don't have that ability, so giving them that would really help accessibility.

bob1029 · on April 11, 2023

> Sure, semantic text classifiers were possible for a while, but making it accessible changes everything.

Binary classification can actually take you all the way in terms of classification if you are clever with set theory. It's also one of the most traceable & deterministic ways to understand how the natural language is being interpreted at each step.

The amount of performance required to run something like an SVM is laughable compared to what is required to run even baby-tier LLMs. If you can reduce the cost of running models to a <1ms invocation over a few megabytes of black box, you can easily test thousands of these per-user-query. Re-training and iterating is much more enjoyable for these reasons. You also don't need any GPUs for this.

At the end of the day, the quality of your data will be the biggest issue with older techniques. LLMs can bandaid all sorts of weird things that crop up in the real world and aren't present in the training data. SVMs cannot tolerate requests delivered in the format of Shakespeare (if unexpected). In a well-controlled domain, you would probably be able to get away with much cheaper options that are also more flexible.

czbond · on April 11, 2023

Can you explain what you did set up wise for your test? I'm following this "space" but the exact, simple pipelines are eluding me.

yacine_ · on April 11, 2023

fsreadfilesync and json

ticviking · on April 11, 2023

The big thing for me and many others is the ability to use the tool without sending NDA data to a 3rd party.

The potential amplifying power of that is enormous.

alden5 · on April 12, 2023

What makes it so much better than normal text classification for me is it doesn't require tons of training data to accurately classify text. using it to parse craigslist posts which i might find interesting showed very promising results although it's fairly slow on my base m1 machine.

syntaxing · on April 11, 2023

Super curious how you did this! Doesn’t 30B model require a hefty computer to run locally (assuming you’re tuning a non-quantized version)

drdaeman · on April 11, 2023

Not at all. Even a Raspberry Pi would do - you only need ~6GiB RAM for a 4-bit quantized LLaMA model (though it's gonna be quite slow). A decent modern desktop machine would do just fine, no need for anything extra fancy.

What I'm wondering is how they fed the documents, as all those LLMs have limitations on the input sizes.

jstarfish · on April 11, 2023

I've seen reports that it wrecks RPi SD cards in short order though, so beware...

> What I'm wondering is how they fed the documents, as all those LLMs have limitations on the input sizes.

It's like file hashing at scale, you don't have to read the whole stream for every file, just the first 1024/2048 bytes (or first few paragraphs).

(This works for classification and sorting, less so for summarization.)

yyyk · on April 11, 2023

>~6GiB RAM

That's for the 7B model. The 30B model needs 24GB quantized (or 64GB for the unquantized model).

yacine_ · on April 11, 2023

I'm running 30b quant on my 3090. that many 4bits fortunately fit into my precious vram

hospitalJail · on April 11, 2023

What model are you using? What program are you using?

Curious how you run the model then interface with it.

cs702 · on April 11, 2023

I expect we will see the biggest jump in performance if (when) consumer-grade coprocessors like mobile GPUs start incorporating attention layers as a primitive building block at the hardware level, e.g., with instructions and memory layouts engineered specifically to make ultra-low-precision (say, 4-bit) transformer layers as compute- and memory-efficient as possible on consumer devices. That seems almost inevitable to me.

jjoonathan · on April 11, 2023

Low precision: agreed

Attention: Isn't it quadratic in context length? I dunno, this feels like the crude first iteration of something that will get inevitably passed by something that scales better.

dpflan · on April 11, 2023

Sharing a comment from a similar line of discussion:

"""

Complexity is quadratic in sequence length. For 512 tokens it is 262K, but for 4000 tokens it becomes 16M and goes OOM on a single GPU. We need about 100K-1M tokens to load whole books at once.

Since 2017 there have been hundreds of attempts to bring O(N^2) to O(N), but none of them replaced the vanilla attention yet in large models. They lose on accuracy. Maybe Flash attention has a shot (https://arxiv.org/abs/2205.14135).

"""

- source: https://news.ycombinator.com/item?id=34171503

whimsicalism · on April 11, 2023

There is literally no way that GPT-4 is using classic O(N^2) attention exclusively. They haven't released the results, but I promise you this is not what they are using exclusively.

anthlax · on April 12, 2023

Interesting! How do you know this?

cs702 · on April 11, 2023

Memory use is already linear in practice, thanks to FlashAttention. It's an open question whether computation can be made sub-quadratic without impacting model performance, although there are ongoing efforts seeking to do exactly that.[a]

Keep in mind: Once you go into precision as low as 4 bits (or lower?), all sorts of optimizations can become practical. Off the top of my head, maybe you could cache and reuse common attention sub-matrices (e.g., a 16×16 sub-matrix with 4-bit elements occupies only 16×16×4÷8=128 bytes of space)?

My sense is there's so much money at stake here, that whoever does this first will win big even if they end up having to replace or augment it with something better down the road. Hypothetical example: Imagine Intel or AMD coming out with a $1K or $2K card that has "built-in 4-bit attention," enabling you to run transformers of much greater scale on a run-of-the-mill desktop PC. I'd buy that in a heartbeat.

[a] Here's a recent post about a new approach from a group at Stanford that looks promising to me, although I don't fully understand all its details yet: https://news.ycombinator.com/item?id=35502187

sebzim4500 · on April 11, 2023

Probably once context lengths get really long it might be better to use ANN rather than exact attention at inference time. I would imagine this would only pay off with ~100,000 token contexts, and even then it would only work if only a few tokens meaningfully contribute to each attention head.

whimsicalism · on April 11, 2023

I agree - unfortunately we have gone so far down the rabbit hole optimizing transformer models.

Alternative models like S4 have been able to get transformer level performance with O(N) sequence length scaling.

throwawaymaths · on April 11, 2023

In theory one could use the nysromformer. Don't know if anyone does in practice

whimsicalism · on April 11, 2023

They don't

throwawaymaths · on April 11, 2023

Do you know why/why not?

whimsicalism · on April 11, 2023

They've just gone with a different approach for fast attention. [0] Not sure it was due to any particular merits/lack of the nystroformer.

[0]: https://twitter.com/typedfemale/status/1609867110695735296

seydor · on April 11, 2023

maybe one could use physics to do that in analog. Or even better, in biology. I think with a clump of ~1.5kg of neurons we can have a pretty efficient coprocessor that is fed with pizza.

bick_nyers · on April 11, 2023

On the other hand, securing the biological coprocessor in your homelab and running queries on it 24/7 appears to have legal implications.

cs702 · on April 11, 2023

Moral implications too. The first Matrix movie is about the moral implications.

mcculley · on April 11, 2023

In polite company it is considered rude to suggest that there was more than one Matrix movie.

liuliu · on April 11, 2023

Chip's capability planning seems need about ~2yr lead time. So we are expecting fastest would be somewhere around end of 2024. (Transformers probably earlier than that (2023?), 4-bit would be later).

cs702 · on April 11, 2023

Unless there are groups that have been thinking about and working on this for a while...

zamalek · on April 11, 2023

There are those Google accelerators that plug into an M.2 slot. You could plausibly do this today, although I am not sure what sort of memory constraints those accelerators have.

BrutalCoding · on April 11, 2023

These ones can be plugged in with USB type-c, see: https://coral.ai/products/accelerator/

It’s used for boosting interference (offline) on Linux, Mac and Windows.

Haven’t bought or used them but I’ve had my eyes on these for a little while!

beauzero · on April 11, 2023

https://www.electronicsweekly.com/blogs/gadget-master/boards...

1827162 · on April 11, 2023

I found this to be very liberating, that I can finally type whatever I want into the LLM, without the possibility of the government knowing what I am writing. Just being able to do that, and have the watchful eye of the state not being able to monitor you is amazing.

seydor · on April 11, 2023

you have to check your screen's firmware for that

ticviking · on April 11, 2023

I mean I have moderately high certainty that I'm escaping casual surveillance. It's not perfect but it doesn't need to be perfect to be good enough.

anentropic · on April 11, 2023

Apple should get working on a version of the Neural Engine that is useful for these models, and remove the 3GB size limit [1] to take full advantage of the 'unified' memory architecture. Game changer.

Waste of die space currently (on Macbook at least, I'm sure they find uses for it in the iPhone)

[1] https://github.com/smpanaro/more-ane-transformers/blob/main/...

leetharris · on April 12, 2023

It's not a waste on Mac, it will dynamically switch between GPU and NPU whenever CoreML is called. There are a decent amount of applications that use CoreML.

But I do agree it should be improved!

Torkel · on April 11, 2023

Things are moving so fast in this space I felt like I needed a [March] in the title on this one :)

turnsout · on April 11, 2023

Too real–early March even felt different from late March. :D

whimsicalism · on April 11, 2023

It appears there is this genre of articles pretending that LLAMA or its RL-HF tuned variants are somehow even close to an alternative to ChatGPT. Spending more than a few moments interacting even with the larger instruct-tuned variants of these models quickly dispels that idea. Why do these takes around open-source AI remain so popular? What is the driving force?

I've posted this before, but it seems like this genre is just getting more and more popular - and more and more untethered from any actual metrics of how good these models are.

ingenieroariel · on April 11, 2023

They are comparable to the first ChatGPT proof of concept from say early 2022. The reason many of us are excited is because we may be a year or two away from being able to run a ChatGPT if the open source models follow a similar curve.

whimsicalism · on April 11, 2023

I have been using ChatGPT from the beginning :) ChatGPT did not have a PoC in early 2022.

They are not comparable with late 2022 ChatGPT.

simonw · on April 11, 2023

I assume you mean GPT-3 - ChatGPT didn't launch until November 30th 2022.

gitfan86 · on April 11, 2023

Because OpenAI or Google being the gate keepers is a show stopper for most serious people in the space.

They will be able to shutdown your startup on a whim. Even if they didn't politicians and regulators would be a huge risk. Without democratization we get blade runner . Not that democratization has no problems, just that it is the way a lot of us are wanting it to go.

whimsicalism · on April 11, 2023

And basing your startup on a model pirated from Meta isn't risky?

I get it, I understand why people like decentralization - but the open source community doesn't even have close to the capability to train an actually open-source LLaMA equivalent.

xyzzy123 · on April 11, 2023

Mozilla or Wikimedia foundation have budgets with the right number of zeroes to support an effort like this.

While those organizations might not be the right fit, I think they serve as an "existence proof" that a large scale open / nonprofit project is not completely inconceivable.

Could possibly see an industry consortium (of players too small to compete on their own) funding an open effort.

Last idea sounds crazy but hear me out: how much would Nvidia spending $100M on "open" models boost spending on graphics cards? I hope someone's running the numbers on that...

whimsicalism · on April 11, 2023

> how much would Nvidia spending $100M on "open" models boost spending on graphics cards?

Probably not as much as in a zero-sum game where everyone is trying to train their own model. Every leap in CPU inference makes this an increasingly less appealing option for them.

But I agree, some industry consortium might try to do it. I think they would first have to be relying heavily on LLMs before they were willing to do that, and its possible by then that the lead will have gotten too large to easily surmount, especially now that everyone has stopped publishing.

seydor · on April 11, 2023

You make it sound as if ChatGPT itself isnt severly limited. Vicuna and newer 13B models are quite close . And the uncensored models have a capability that ChatGPT will never have.

whimsicalism · on April 11, 2023

They just aren't quite close. Maybe you guys are asking shallow questions, but I'm asking subject matter questions and this is just not true.

It would be great if we could have detailed QA evaluation to show this, but of course then the open source people would train their models on it as a fine-tuning datasaet.

UncleEntity · on April 11, 2023

Train it on specific subject matter and not waste space on things like “what year did Jack Nicholson beat on that dude’s car with a golf club?”

I mean… the horror.

whimsicalism · on April 11, 2023

That's just not how these models work.

taneq · on April 11, 2023

You’ll have to forgive them, they’re probably trained mostly on HN posts. ;)

causi · on April 11, 2023

The way OpenAI has set their models up for prompting and follow-up is where their real advantage is. It's very hard to take a model you can run locally and just go "Hey, write me a powershell script to convert all these files into this format. Split them up so they're no more than an hour each. Ok, change it to also denoise them."

thewataccount · on April 11, 2023

Are you talking about the RLHF? That's where Vicuna/Alpaca and other finetuning help a lot.

There's also already tools for conversation flows (which just means you prepend the conversation history to the prompt).

I'm not saying the performance is nearly as good, but the actual workflow does already exist and is massively improving. The interesting part to me is that this finetuning can be done in a couple few hours on a consumer gpu (4090).

avereveard · on April 11, 2023

What are the newer models? I am testing them in batteries across complex tasks, so far vicuña is the most flexible but they all choke on reflective instructions (I. E. Knowledge to extract is not in the model but in the user text)

ExxKA · on April 11, 2023

I very interested in this usecase - where should I go on the interwebs?

avereveard · on April 11, 2023

original paper: https://arxiv.org/abs/2205.00445

langchain agents are a good starting implementation.

you can build your own prompt and get the ai to work by iself hallucinating tools, which may be cheaper to test out than going back and forth with an agent manager. not as accurate, but you can still extract useful work, i.e. https://i.imgur.com/AE4R3dR.png (gpt-35-turbo is traditionally failing this task completely, prompt get it to work at it)

these prompt all require the model to work off data within the prompt within the first shoot. model require a degree to introspection for that to work.

simion314 · on April 11, 2023

People are enthusiastic about the possibilities, imagine a black box on your desk contains only RAM and matrix multiplication chips, you install on it your favorite AI assistant and you train it with your private data/code, you remove all prudish restrictions and get productive on your work and on your off times.

Llama has the potential to reach ChatGPT it needs tunning to get better at responding to questions, llama if I am not worng is mostly attempting to predict what is next.

I can see it similar like Midjourney and Stable diffusion, midjorny can make any stupid prompt look like a digiatal art in the style of Midjorney but look how many stable Diffusion innovation happens, a competent person that is on top with all the new stuff can produce absolute anything in any style they want.

alchemist1e9 · on April 11, 2023

My hunch is model weights will be commercialized as a purchased object. They can be watermarked so easy to trace any leak.

Then hardware will be a separate business. I think Apple might be caught off-guard by Nvidia on hardware. The latest NVlink and 400Gbps interconnects when combined with H100 next iterations and also rumors of advanced PCIe motherboards with high lane Nvidia CPUs and it looks to me that next year they can be selling $100-300K physical systems optimized for LLM inference that physically remind me of mainframes.

simion314 · on April 11, 2023

It makes no sense, the science is free and open, the companies just put their money into throwing data into the model. Once you have the big llama model filled with all the humanity information as an open source thing at bast a company could sell you some small stuff to add on top, like maybe Disney would sell you a "license" and a lora to generate Disney crap, their model would probably will be lower in quality then the open ones but the license would be the important part.

It is kind of idiotic that some scientist can spend years and a lot of public money to create some technology and then bilionairs are miliking all the profits.

alchemist1e9 · on April 11, 2023

I definitely hope you are correct.

I’m really hoping there are viable distributed and somewhat decentralized eventually consistent training algorithms we could all run in a P2P system. That would be super cool.

However I can easily see that now the framework has been established if a company builds a proprietary curated dataset for specific skills and then pays to spend resources on specialized reinforcement training.

Then they can commercialize that I would think. As people would pay for an LLM that does XYZ the best. Kinda like your Disney example but I was thinking engineering tasks in my head.

whimsicalism · on April 11, 2023

I am similarly enthusiastic and super, super impressed by Llama.cpp

But I just don't see the need to mislead and suggest that these models are "on par" with ChatGPT or something like that. They just aren't.

simion314 · on April 11, 2023

I did not said that are the same, the llama has a lot of information in it , the issue seems to be the Q&A chat part. This can be added on top by the community without having to start from scratch IMO. But I might be wrong, and OpenAI put some magic shit in , some super secret unpublished stuff, in that case educate me. In my mind I compare with Stable Diffusion, the proprietary ones produce more artistic effects because they put some more stuff into the prompts and they are imposing their styles. With SD you have the control and you need to wait for some improvements, plugins, new loras or embedings with new styles etc. In SD I could train my face in 1- minutes, with proprietary shit it will never happen...

Imagine what a math community could train, they just need access to the model and soem GUI software that can help them train.

So llama based Chat stuff is not yet comparable with ChatGPT but there are already lot of progress made. At this moment coding and math is bad in llama based but other stuff is great, like story creation, also I only could test 3-b 4bit and it is good enough to for example provide me a complex response in valid JSON format.

eikenberry · on April 11, 2023

I'm interested as LLMs are interesting tech but ChatGPT is proprietary and useless for free software.

d4rkp4ttern · on April 12, 2023

Agreed. I’ve said it elsewhere: how do we know that ClosedAI has fully published all their tricks used to build GPT3.* or later? The main “clues” people have are from the InstructGPT paper, but it should shock nobody if it turns out that paper reveals less than 10% of their techniques, which may have taken years to come up with. Competitors would need to rediscover those on their own or come up with new ideas. Simply repeating the ideas from that paper is likely not going to get them to a competitive model performance.

emrah · on April 11, 2023

It's great they got LLMs running on resource constrained devices but are they any good? Or I should ask, with the limited resources they get, what good are they for?

kbrkbr · on April 11, 2023

From my experience with llama.cpp and oobaboogas webui I can say they are amazing, at least on my gaming pc. I’m absolutely astonished at the speed and quality of llama, alpaca, galactica and vicuna (the >10B parameters ones).

Make no mistake, it’s for tinkerers that do not expect each prompt to be answered human like.

I see them as creativity and thought testing tools, also knowledge exploratory.

whimsicalism · on April 11, 2023

If they are amazing, then ChatGPT must be God-like in your view.

I've been underwhelmed by Alpaca and Alpaca-LORA and LLaMA all at 13B but I have not tried higher params.

kbrkbr · on April 11, 2023

In my opinion the problem with these is engineering a good prompt. I read of lots of people only getting nonsense or repetitions, and learned a bit from what they shared. These models are not chat bots.

Vicuna is more friendly in that regard.

But I’m well aware of their limitations also, and I can see how one can be underwhelmed. They are not jacks of all trades

whimsicalism · on April 11, 2023

Alpaca & Alpaca-LoRA are literally trained to be chat bots. Their goal is to be instruct-tuned.

MacsHeadroom · on April 12, 2023

Alpaca-LoRA, and all LoRas, are garbage. Alpaca is horrible compared to newer finetunes. Even cleaning the Alpaca dataset and retraining a cleaned Alpaca improves its performance greatly.

But newer finetunes like Vicuna go well beyond that, including hundreds of thousands of real human conversations with GPT-4 ChatGPT in the dataset (unlike Alpaca's fully synthetic dataset).

Vicuna-13B in 16bit is easily comparable to ChatGPT-3.5 in capability. Newer finetunes coming out nearly every day are going beyond chatGPT-3.5 and getting closer and closer to GPT-4 performance.

You don't even have to install anything to validate this for yourself. There's a live web demo of Vicuna-13B right here: https://chat.lmsys.org/ (disable ad blocker if it does not load)

MacsHeadroom · on April 12, 2023

Also, InstructGPT was released in January 2021 and Vicuna-13B blows it out of the water.

yyyk · on April 11, 2023

30B Alpaca is better. Not anywhere near ChatGPT level, but better than the 13B models.

wing-_-nuts · on April 11, 2023

is there a difference in the quality of llm one would be able to train or run on a gpu with 8, 12, 16, all the way up to 24gb?

I'm trying to decide whether it's worth while to splurge on a more expensive 4090 vs a 4070 or whatever.

whimsicalism · on April 11, 2023

It makes 0 economical sense to buy a GPU to train a model. If you want to train a model, train it on the cloud.

thewataccount · on April 11, 2023

This isn't necessarily true with LoRAs - a 4090 can train/compute the alpaca dataset with LoRA in under 6 hours (it might be 3, I forget what it was).

So finetuning with LoRAs and a few other methods is fine on higher end consumer hardware like a 4090 and finishes in a reasonable amount of time - IMO definitely worth it if you're experimenting with this especially for the inference.

The base training though yeah I totally agree with you - train in the cloud, don't buy hardware when you need a month of 8x A100's or whatnot.

whimsicalism · on April 11, 2023

Even with LoRA the economics are not in your favor to buy a 4090 instead of cloud training.

thewataccount · on April 11, 2023

Actually yeah you're definitely right.

My perspective was for people who have other uses for them e.g. gaming or local inference. From a pure finance standpoint you're definitely right - you should rent and not buy a dedicated card. I think you'd need a few thousand hours to break even which is a few months 24/7.

wing-_-nuts · on April 11, 2023

It's more that I'm building a gaming pc this summer, and I can either target 1440p (4070) for 2k or 4k for 5k (4090). If I can do a lot more with a 4090 over a 4070 it might make sense, but I know a lot of cs students use google colab these days, so I may just rely on that.

thewataccount · on April 11, 2023

I'd seriously recommend the 4090 over the 4070 if you want to do finetuning/inference locally. And I highly recommend 64GB of ram.

The 24GB of VRAM is 100% worth it alone. If you want to do local ML stuff you _need_ that 24gb of VRAM.

64GB of ram + 24GB of vram lets you run a lot of the medium size models at decent speeds. I don't use Colab personally but AFAIK it should work fine for you if you don't want to do it locally.

Also worth noting is the newer ray tracing rendering that cyberpunk is doing. You should checkout the demos IMO it looks sick. It only runs at 18fps on a 4090 so it's only playable on a 4090 + dlss, and I'm not sure if the newer rending tech will be super achievable on any of the other cards - if that's of interest to you.

boppo1 · on April 12, 2023

Get a 13900 or 7950X and 64 gb of ram. You can run llama 30B and 65B, slowly but surely. Play with that before buying a gpu. If you really, really see yourself getting into this, then go ahead and get a 3090 or 4090. But otherwise get a cheaper nvidia card and wait for things to develop a little more. You can still play with ML and CUDA but you'll have cash left for when 50X0s drop, and that will probably be right around when this stuff will really be getting hot (if the current plateau doesn't hold).

Llama is basically an auto-complete right now. We're celebrating baby's first steps. It's not really worth the $600-1000 jump up from cards that can run all current games 4k60.

whimsicalism · on April 11, 2023

Fair enough.

anonzzzies · on April 11, 2023

Can you calculate that for me on a napkin? Every calculation I make for training, but certainly for inference, makes me break even after well under a year and then it’s vastly less if I buy the hardware myself.

whimsicalism · on April 11, 2023

You can infer on your local machine.

I doubt it will take you a year to train a model.

Vicuna-13B cost $300 to train/fine-tune [0]. They trained on an A100 which costs $10k [1].

[0]: https://vicuna.lmsys.org/ [1]: https://github.com/lm-sys/FastChat/blob/main/scripts/train-v...

turmeric_root · on April 11, 2023

More VRAM => larger models. IME it is absolutely worth maxing out VRAM for the significant improvement in quality, especially with LLaMA (though even with a 4090, you won't be able to run the largest 65-billion parameter model even with 4-bit quantization).

That said, I recommend renting a cloud GPU for a few hours and trying the larger models on them before buying a GPU of your own, just to see if the models meet your requirements.

sliken · on April 11, 2023

But should fit easily on a Apple MBP or Studio with 96GB or 128GB of unified memory.

Garrrrrr · on April 11, 2023

llama.cpp runs on your processor and uses a lot of RAM, so splurging on a GPU isn't going to help the performance

rubidium · on April 11, 2023

Is there a "getting started" guide you'd recommend to a newbie to the space?

kbrkbr · on April 11, 2023

I started with ggerganov’s llama.cpp GitHub repo, and went from there. But then again, I know some programming, statistics and machine learning, so it may not be for you, I cannot judge that.

Models can be found on huggingface.co, and I’d start with eachadea/ggml-vicuna-13b-4bit, but it needs 10G of cpu-ram. It is very friendly to any prompt though.

I read on my way (on reddit, when I recall correctly), that there must be some really good intro videos on YouTube.

boppo1 · on April 12, 2023

additional point of reference I don't know shit about ML or stats, and am a very weak programmer. I just know how to install programs on linux and use CLI in a basic fashion. I have had no problem getting llama.cpp going by following ggerganov's readme.

synergy20 · on April 11, 2023

are we talking about training or inference for local LLM here? it's hard to do any meaningful training on the edge unless we all carry a heavy gaming pc, even that, the training quality will be subpar?

qumpis · on April 11, 2023

Inference, even fine-tuning a few layers would be difficult since one needs to use non-quantized model, I'd imagine

ctoth · on April 11, 2023

Checkout LoRA and Alpaca LoRA and the whole huge group of people who have already figured this out. I think there was another breakthrough (yesterday?) which is a further adaption of LoRA to touch even less parameters at runtime.

homarp · on April 11, 2023

DyLoRA https://news.ycombinator.com/item?id=35514228 is what you meant

whimsicalism · on April 11, 2023

Disagree that this is a major breakthrough :) but it is probably a marginal improvement.

whimsicalism · on April 11, 2023

They're not training on edge :)

BulgarianIdiot · on April 11, 2023

They’re crude but will be getting better quickly.

cgearhart · on April 11, 2023

Even with smaller models & more optimized hardware, I think edge compute is going to be power-limited first. Batteries today just won’t support constantly running LLMs. But I joked recently that as long as they prove useful then consumers would be willing to swap their iPhone for the old car battery with a phone handle attached.

BulgarianIdiot · on April 11, 2023

The first step is for it to be viable for smaller models on desktops. The rest will follow, as hardware catches up. Hybrid analog-hybrid NN hardware is on the horizon, maybe in 3-4 years. This would allow GPT-4 level performance on an iPhone with plausible battery life.

The current hardware of course can't pull anything like this yet. But iPhone supports on-device facial recognition, object recognition, dictation and translation, so small steps...

boppo1 · on April 12, 2023

Who is making said hardware?

BulgarianIdiot · on April 13, 2023

There are few efforts under way.

https://research.ibm.com/blog/why-we-need-analog-AI-hardware

https://news.mit.edu/2022/analog-deep-learning-ai-computing-...

tudorw · on April 11, 2023

I guess if the hardware is cheap for some uses speed is not so important, you can just walk away and let it grind.

turmeric_root · on April 11, 2023

I like using them for memeing

jrm4 · on April 11, 2023

As I see these things come out, it feels like there's not a lot of discussion on which hardware (that isn't one of the fancy new Macs?) As in, there might be a lot of graphics cards out there that could be used here? Is it only Nvidia still, is AMD a possibility? Maybe I'm missing something on how the tech works?

seydor · on April 11, 2023

This has a list of models and their VRAM requirements

https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...

boppo1 · on April 12, 2023

30B llama needs a 3090 or 4090. 13B I think you can get away with a 3/4080. If you have 64 gigs of ram and a beefy CPU you can run even 65B, but boy it's slow.

13B is pretty meh, but 30B is great, if not quite Chatgpt. But I can ask it why my highschool geometry teacher was such a cunt and it will happily discuss the matter without reservation. Very therapeutic.

layer8 · on April 11, 2023

It would be nice to be able to run an LLM-driven spamassassin on a VPS for acceptable cost.

londons_explore · on April 11, 2023

I don't think it will help. Actual friends occasionally send me mail that says "test" from a random account. And spammers do too... There is no way to seperate them.

layer8 · on April 11, 2023

Personally I haven’t had that problem. I can readily distinguish spam from non-spam by manual review ~99% of the time. If I could train or instruct an LLM to do the same as I do now, I would be happy. My current false-negative rate with Bayesian spamassassin is more like 50%.

autoexec · on April 12, 2023

Don't you do it somehow? Plus a filter doesn't have to be 100% right all the time. Filtering out what is 99% certain to be or not be spam and leaving the human to cover the tiny number of messages that fall into a grey area would still save a ton of time.

Zuiii · on April 12, 2023

Yes, but I rely on my own personal experiences (online and offline) to determine whether an email is real or a scam. Unless that AI filter taps into my memory, it will likely lead to too many false positives and false negatives.

imaurer · on April 11, 2023

Tracking repos and resources for running LLMs locally here:

https://github.com/imaurer/awesome-decentralized-llm

turnsout · on April 11, 2023

Also, this continuously-updated spreadsheet is extremely helpful: https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRj...

gigel82 · on April 11, 2023

I don't understand why people are so excited to build this big thing on top of Llama, which is closed source, severely license restricted and we now know for a fact that Meta is going after users with the legal hammer.

I'm sure if we'd pool resources together we could build a truly open alternative worthy of building on top of.

whimsicalism · on April 11, 2023

> I'm sure if we'd pool resources together we could build a truly open alternative worthy of building on top of.

This would require like ConstitutionDAO level of resource pooling without direct monetary payoff.

I mean, good luck.

FL33TW00D · on April 11, 2023

I wrote a similar blog post in Nov 22: https://fleetwood.dev/posts/a-case-for-client-side-machine-l...

la64710 · on April 11, 2023

One simple thing these LLM models cannot do yet .. that is to simply point a LLM to a URL and it will start scraping - ie follow the hyperlinks and start consuming the content. I am not an AI guy but I guess this has to do with the context limitations of most model? How did they train OpenAI with all internet data till 2021? This I think will be a most popular feature for LLM models and I seriously hope it is OSS whenever it comes out.

marijnz · on April 11, 2023

This already works with ChatGPT plugins: https://openai.com/blog/chatgpt-plugins

waffletower · on April 11, 2023

While many NLP related Apple ML job listings have been added since this article was written, there were several recent listings at the time of its writing. While I feel that Apple does not focus well on intangible technologies, products that can't be readily carried, worn and given their boutique product development fetish focus, I have some hope that they can overcome this bias somewhat, and see how behind they are.

hackernewds · on April 11, 2023

I disagree with the assessment that Apple is behind. Apple is known for executing well and putting their weight behind the things they launch.

Like Jack Dorsey would often say "it's not important to be first to market, you can just be best to market". And the world got CashApp.

I'm sure however Apple enters the space, it will be fleshed out (vs Bard).

rootusrootus · on April 11, 2023

I want to agree, but it's pretty easy to find instances where Apple has dabbled but not delivered best-in-class solutions. Siri. iCloud. Home automation.

pwinnski · on April 11, 2023

Apple has two values in conflict with each other, I think. On the one hand, they want to deliver best-in-class solutions. On the other hand, they have a commitment to user privacy[0] as perhaps only a gay man growing up in the south might value.

Siri should be better! It lost features post-acquisition by Apple, and it seems like user privacy is why.

Home automation is arguable. If you consider a single point of failure on a server somewhere to be bad, Apple's solution is pretty great. Their commitment to zigging where others zagged put them behind, since hardware vendors didn't want to put in powerful (expensive) enough chips to handle the cryptography, but while other companies go out of business, or transmit images and video to external parties, Apple's works reliably and securely.

Still, as with most of my complaints about Apple, it's a trade-off between privacy and functionality, and Apple will seemingly always choose privacy over functionality, even as Google consistently chooses functionality over privacy.

[0] Yes, there are examples of edge cases that suggest a less-than-perfect record. Contrast that with their competitors, for which invading privacy is foundational to the business model.

sliken · on April 11, 2023

Up until recently I'd have agreed on Apple's.

Increasingly Apple seem to be blocking tracking, noteably from facebook, to make the most profit of that tracking. I've read claims that apple made between $5B and $20B on advertising in 2022. It's far from clear that Apple's view on privacy is going to stay the same.

pwinnski · on April 12, 2023

That they're blocking tracking is still privacy-focused. There are unsubstantiated claims that they exempt themselves from the same tracking, but reports of their advertising revenue doesn't contribute anything to those claims.

At Apple's scale, it's relatively easy for them to deliver $20B in ad revenue without any privacy-invading means.

People seem to have forgotten, but ads used to be based on context, so people looking at apps related to fitness might see ads related to fitness, but that wouldn't follow them around when they looked at other things. Apple still seems to be doing that; I haven't seen fitness ads on games, or game ads on fitness apps.

whimsicalism · on April 11, 2023

Apple has very little ML expertise and very few cloud resources.

They are absolutely behind in the space and anyone who works in the industry will tell you that. Only feasible way IMO would have to be a very big budget acquisition of one of the major LLM startups, but most of those already have big tech backers.

simonw · on April 11, 2023

"I think we’re going to eventually see a demo showing an open source model running on an iPhone as well"

I have Kevin Kwok's SheepyT running on my iPhone right now - it uses GPT-J, which is an openly licensed LLM by EleutherAI.

https://twitter.com/antimatter15/status/1644456371121954817

monkeydust · on April 11, 2023

Wait till these are embedded into soft toys.

'You are a koala who plays with 5-7 year olds, you are friendly natured and curious and like to ask questions'

jjtheblunt · on April 11, 2023

Like Teddy from A.I. the movie?

https://www.youtube.com/watch?v=YRsICbxDEiI

txomon · on April 11, 2023

Out of doubt, which seems to be spreading around the internet. The LLaMa model weights weren't "leaked" AFAIK but rather explicitly given access to to researchers, isn't it right?

I know the article goes on to speak about something else, but I'm not sure why this claim that the LLaMa model weights were leaked, as in unintendenly made available is being done.

ftxbro · on April 11, 2023

My understanding is that researchers could ask for access to weights, but then also they were leaked so that anyone could get them without asking. There is another layer, where Facebook seems to accept it on some level (I mean they don't have a choice anymore anyway); they put a cheeky comment in the open pull request instead of closing it.

turmeric_root · on April 11, 2023

The model weights were only shared by FB to people who applied for research access. Github repos containing links to the model weights have been taken down by FB.

causi · on April 11, 2023

LLAMA isn't there and probably never will be, but the possibility of running something equivalent to ChatGPT has certainly made me reconsider my GPU purchases. I wonder if in the end will it be Nvidia's CUDA advantage or AMD's larger amount of memory that will end up being more important when we do get it.

croes · on April 11, 2023

Local LLMs remove the data protection problem but open the door for malicious use on a larger scale.

colonwqbang · on April 11, 2023

It was never possible to keep this technology secret for any length of time.

croes · on April 11, 2023

I thought the possibility to run these LLMs on everyday hardware would be further on the future and in the beginning more limited to big servers.

But this could be the equivalent of the Low Orbit Ion Canon for phishers and scammers

Root_Denied · on April 11, 2023

It also makes it easier for the defense/blue team side to build counters or monitor and defend the attack surface of a given system against these attacks.

Digital arms races are nothing new, this is just the latest battlefield.

autoexec · on April 12, 2023

My local LLM will automatically detect scams and phishing attempts so it'll all balance out.

binkHN · on April 11, 2023

This is wonderful. As hardware and software continues to improve, everything seems to find a way to run on ever smaller devices. Guess your own pocket-AGI is not too far away after all.

1827162 · on April 11, 2023

By the way, I was thinking of something along the lines of a powerful FPGA with direct access to large quantities of very fast NAND flash, likely many chips in parallel, which will save having to load the model into RAM..... So it will be able to directly run from NAND flash, which opens up the possibility of using very large models???

Power consumption would not be an issue if it's used sporadically throughout the day, it's not like it needs to run continuously?

There is still the issue of NAND flash read disturb, which I haven't fully looked into yet.

yieldcrv · on April 11, 2023

So interesting how you can already tell that is a 3 week old epiphany

sendfoods · on April 11, 2023

How realistic is CPU-only inference in the near future?

travisjungroth · on April 11, 2023

It’s in the near past. https://github.com/ggerganov/llama.cpp

abetlen · on April 11, 2023

Also worth checking out https://github.com/saharNooby/rwkv.cpp which is based on Georgi's library and offers support for the RWKV family of models which are Apache-2.0 licensed.

BrutalCoding · on April 11, 2023

I’ve got some of their smaller Raven models running locally on my M1 (only 16GB of RAM).

I’m also in the middle of making it user friendly to run these models on all platforms (built with Flutter). First MacOS release will be out before this weekend: https://github.com/BrutalCoding/shady.ai

abetlen · on April 11, 2023

You can see for yourself (assuming you have the model weights) https://github.com/abetlen/llama-cpp-python

I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.

0xDEF · on April 11, 2023

I know it's secret how GPT-4 actually works but are there any of these local LLMs that are also multimodal?

Oranguru · on April 12, 2023

Yes, there are. For example: https://laion.ai/blog/open-flamingo/

whimsicalism · on April 11, 2023

sharemywin · on April 11, 2023

I could certainly see them being trained/used as a front end for a database(file system, api calls).

agumonkey · on April 11, 2023

Were transformers used in other contexts ? biochemistry ? geometry .. whatever.

skinner_ · on April 13, 2023

Alphafold uses something they call Evoformer, it is an attention mechanism. Our group has tried, and so far failed to utilize transformers for a very very specific search problem in geometry (https://bit.ly/unit-distances).

agumonkey · on April 15, 2023

thanks a lot man, very nice

transfire · on April 11, 2023

We really need optical computing to take this to the next level.

rambojohnson · on April 12, 2023

This service has been suspended by its owner.

Sparkyte · on April 11, 2023

Everyone be like ChatGPT is everything. I am like it won't be the last or the first.

ingenieroariel · on April 11, 2023

In order to run large language models, we should all be buying a fully loaded Mac Studio (128GB of ram, 20 CPU cores, a lot of GPU and Neural cores.) and putting Linux on it to remove the artificial restrictions.

Yes, we will be running them soon in low end hardware, but we need to get at least to GPT-3.5-turbo level of inference speed and quality before we try to make it small.

I already started.

    $ neofetch
                   -`                    x@decpti 
                  .o+`                   ------- 
                 `ooo/                   OS: Arch Linux ARM aarch64 
                `+oooo:                  Host: Apple Mac Studio (M1 Ultra, 2022) 
               `+oooooo:                 Kernel: 6.1.0-rc6-asahi-4-1-ARCH 
               -+oooooo+:                Uptime: 4 hours, 23 mins 
             `/:-:++oooo+:               Packages: 177 (pacman) 
            `/++++/+++++++:              Shell: bash 5.1.16 
           `/++++++++++++++:             Resolution: 1920x1080 
          `/+++ooooooooooooo/`           Terminal: /dev/pts/0 
         ./ooosssso++osssssso+`          CPU: (20) @ 2.064GHz 
        .oossssso-````/ossssss+`         Memory: 717MiB / 129540MiB 
       -osssssso.      :ssssssso.
      :osssssss/        osssso+++.                               
     /ossssssss/        +ssssooo/-