I was able to run a LLaMa on my personal machine to run some labeling on my documents, as a test of its capabilities. It was instruct tune. 30b parameters
4 example labels, and I had a binary classifier in seconds. Sure, semantic text classifiers were possible for a while, but making it accessible changes everything. Giving anyone who can use a spreadsheet the power of a local LLM (or, basically free LLMs) can make them much, much more productive. A lot of office work is clicking through sheets and doing manual labeling.
It's truly wild what is becoming accessible! Really excited to see the next gen software that the open community comes up with :)
LLMs as general purpose classifiers is a really big deal, especially because you can give them fuzzy instructions. I know people are worried about LLMs and spam, but I think LLMs may provide an opportunity to elevate online discourse by being more efficient at filtering out spam and low quality commentary.
I already have a custom browser plugin that calls out to GPT (gpt-3.5-turbo is cheap and good enough for this) to classify and filter out low-effort, overly negative, or intellectually dishonest HN comments. It significantly improves the experience on this site.
Bonus points: I had never written a browser plugin, but GPT4 helped me do it in under half an hour.
Close, but it's a bit more specialized to just work on HN:
- Examples of unwanted content
- Then I give a large numbered list of comments and ask which numbers should be filtered
- The plugin then just deletes those comment nodes from the DOM. If HN ever updates their HTML I will have to tweak this code.
The reason to send a large list of comments is just to save on costs. It's cheaper to do it this way than one comment at a time.
So the main difference from what you've proposed is GPT never sees the HTML. My code enumerates the comments in the HTML and splices them in to the prompt in a nice numbered list, then does the reverse translation from list number to DOM element in the other direction.
I give GPT some examples of what I consider low-effort comments and non-low-effort comments, and then I just ask it to do classification. It's quite good at few-shot classification of fuzzy stuff like this.
How many line of codes are in the kernel, drivers, and apps on your laptop/desktop?
How sure are you that all that code has been reviewed by a 3rd party? How many CVEs a year impact your laptop/desktop?
Do you have any reason to think that increased productivity with LLM assistance will result in lower quality code? Personally I find LLM assistance increases productivity, decreases the penalty of using a more difficult language like rust, and makes it more palatable to spend more (LLM assisted) time writing tests.
I meant reviewed by a (competent) human; not, audited by a security team/whatever. Generative computer code is in a whole new category of risk, because it can hallucinate a `rm -rf *` that you might overlook if you don't understand the code yourself.
Well at least for now, any non-trivial code is going to be written by a human+AI, likely one function at a time. It's pretty rare for the AI to give me exactly what I want, often I have to tweak my prompts multiple times, and even just give up and write a function based on the useful bits from previous queries.
I just don't see it chatbot assisted programming any worse than what we have today.
Like almost all machine learning stuff, I expect these things to only be useful in places where it doesn't really matter if the results are correct. When you apply a classifier in real life its critically important to understand its statistical characteristics which is typically done via model characterization, which involves cross validation or boot strapping or whatever. I think the idea that you can just zero-shot or few-shot deploy these things as classifiers and forget about it is incredibly naive unless (as I said earlier) the results just don't really matter.
I've already used LLMs in my work as a data scientist but it requires a ton of work to just make the results tractable (and I have been using GPT4, which behaves pretty well). These smaller language models ain't so regular. Ok, like consider a basic thing you want to do with a classifier: understand its behavior on a held out data set. Since no one knows what is really in the training data (since its so large), its quite hard to understand what the model can generalize about and what it has just accidentally memorized. CF reports that GPT4 doesn't perform nearly as well on even simple programming exercises that are chosen in such a way as to be sure they weren't in the training data.
There is enormous potential for statistical fuck ups here. Prompt engineering, for instance, is an easy place for over-fitting to happen as a prompt is fine tuned on data the prompt engineer has and thus fails to generalize to new data.
I do think there is a lot of value here, but I'm also sure that sloppy use of large language models is going to cause a bunch of trouble in the short to medium term, generate a lot of garbage, pollute a lot of databases, etc, while we figure all this stuff out.
You can certainly hold out a validation set while you are writing your prompt, but you can't know whether the model is over fitting for your data set since you don't know what data was in the training set.
You'll probably find out pretty quick in production :)
But I posit that most text classification tasks don't have such strict accuracy requirements. For one, no text classifier is 100% accurate. For instance, I have genuine mail in my spam folder frequently. I see spam on social networks, etc. I struggle to think of cases that aren't at least somewhat tolerant to some amount of incorrect classification.
Disagree. Adding signed communications as the default to replace email's completely nonsense zero encryption model is not upping the stakes. Email is the most successful federated software in the universe, but it's fucking time to get some encryption and authn.
Yea, lol, what was I thinking!? It’s not like my granny uses TLS ECDH SHA1 right!? Oh, wait, everyday she does? The tools adapt; email has never had to.
Agree, though plain old Bayesian classifiers have been able to handle some significant portion of that office work for a long time. And not much ever came from it for everyday stuff outside of spam filters.
Maybe both the buzz factor and broader applicability means it's more likely to happen this go around?
More like: the accessibility is what will make it go around. The accessibility is what changes everything! Getting an easy to use interface (instructions) over python changes the accessibility from the denizens of this website to ~anyone with a computer
Essentially, being label efficient is more important than being compute efficient, because the biggest computing constraint we have is enough humans doing the labeling (and knowing how to work a jupyter notebook), not tensor smashing nvidia cards
Ease of use is what is going to change everything. Using natural language to ask something and getting an answer is different from what we had before. I knew I could automate a lot of my paperwork with scripts but as I am not a programer I never gave it more than a cursory look. Last Dec while playing around with ChatGPT I was able to get it to write some python scripts that resulted in my spending less than 20-25 min on tasks that I was spending 20-25 hours on. Now could I have written the scripts myself probably but it would have taken me months whereas with chat gpt it took me 2-3 hours to get a working scripts and another 1-2 hours to optimise them.
> Sure, semantic text classifiers were possible for a while, but making it accessible changes everything.
Binary classification can actually take you all the way in terms of classification if you are clever with set theory. It's also one of the most traceable & deterministic ways to understand how the natural language is being interpreted at each step.
The amount of performance required to run something like an SVM is laughable compared to what is required to run even baby-tier LLMs. If you can reduce the cost of running models to a <1ms invocation over a few megabytes of black box, you can easily test thousands of these per-user-query. Re-training and iterating is much more enjoyable for these reasons. You also don't need any GPUs for this.
At the end of the day, the quality of your data will be the biggest issue with older techniques. LLMs can bandaid all sorts of weird things that crop up in the real world and aren't present in the training data. SVMs cannot tolerate requests delivered in the format of Shakespeare (if unexpected). In a well-controlled domain, you would probably be able to get away with much cheaper options that are also more flexible.
What makes it so much better than normal text classification for me is it doesn't require tons of training data to accurately classify text. using it to parse craigslist posts which i might find interesting showed very promising results although it's fairly slow on my base m1 machine.
Not at all. Even a Raspberry Pi would do - you only need ~6GiB RAM for a 4-bit quantized LLaMA model (though it's gonna be quite slow). A decent modern desktop machine would do just fine, no need for anything extra fancy.
What I'm wondering is how they fed the documents, as all those LLMs have limitations on the input sizes.
I expect we will see the biggest jump in performance if (when) consumer-grade coprocessors like mobile GPUs start incorporating attention layers as a primitive building block at the hardware level, e.g., with instructions and memory layouts engineered specifically to make ultra-low-precision (say, 4-bit) transformer layers as compute- and memory-efficient as possible on consumer devices. That seems almost inevitable to me.
Attention: Isn't it quadratic in context length? I dunno, this feels like the crude first iteration of something that will get inevitably passed by something that scales better.
Sharing a comment from a similar line of discussion:
"""
Complexity is quadratic in sequence length. For 512 tokens it is 262K, but for 4000 tokens it becomes 16M and goes OOM on a single GPU. We need about 100K-1M tokens to load whole books at once.
Since 2017 there have been hundreds of attempts to bring O(N^2) to O(N), but none of them replaced the vanilla attention yet in large models. They lose on accuracy. Maybe Flash attention has a shot (https://arxiv.org/abs/2205.14135).
There is literally no way that GPT-4 is using classic O(N^2) attention exclusively. They haven't released the results, but I promise you this is not what they are using exclusively.
Memory use is already linear in practice, thanks to FlashAttention. It's an open question whether computation can be made sub-quadratic without impacting model performance, although there are ongoing efforts seeking to do exactly that.[a]
Keep in mind: Once you go into precision as low as 4 bits (or lower?), all sorts of optimizations can become practical. Off the top of my head, maybe you could cache and reuse common attention sub-matrices (e.g., a 16×16 sub-matrix with 4-bit elements occupies only 16×16×4÷8=128 bytes of space)?
My sense is there's so much money at stake here, that whoever does this first will win big even if they end up having to replace or augment it with something better down the road. Hypothetical example: Imagine Intel or AMD coming out with a $1K or $2K card that has "built-in 4-bit attention," enabling you to run transformers of much greater scale on a run-of-the-mill desktop PC. I'd buy that in a heartbeat.
[a] Here's a recent post about a new approach from a group at Stanford that looks promising to me, although I don't fully understand all its details yet: https://news.ycombinator.com/item?id=35502187
Probably once context lengths get really long it might be better to use ANN rather than exact attention at inference time. I would imagine this would only pay off with ~100,000 token contexts, and even then it would only work if only a few tokens meaningfully contribute to each attention head.
maybe one could use physics to do that in analog. Or even better, in biology. I think with a clump of ~1.5kg of neurons we can have a pretty efficient coprocessor that is fed with pizza.
Chip's capability planning seems need about ~2yr lead time. So we are expecting fastest would be somewhere around end of 2024. (Transformers probably earlier than that (2023?), 4-bit would be later).
There are those Google accelerators that plug into an M.2 slot. You could plausibly do this today, although I am not sure what sort of memory constraints those accelerators have.
I found this to be very liberating, that I can finally type whatever I want into the LLM, without the possibility of the government knowing what I am writing. Just being able to do that, and have the watchful eye of the state not being able to monitor you is amazing.
Apple should get working on a version of the Neural Engine that is useful for these models, and remove the 3GB size limit [1] to take full advantage of the 'unified' memory architecture. Game changer.
Waste of die space currently (on Macbook at least, I'm sure they find uses for it in the iPhone)
It's not a waste on Mac, it will dynamically switch between GPU and NPU whenever CoreML is called. There are a decent amount of applications that use CoreML.
It appears there is this genre of articles pretending that LLAMA or its RL-HF tuned variants are somehow even close to an alternative to ChatGPT.
Spending more than a few moments interacting even with the larger instruct-tuned variants of these models quickly dispels that idea. Why do these takes around open-source AI remain so popular? What is the driving force?
I've posted this before, but it seems like this genre is just getting more and more popular - and more and more untethered from any actual metrics of how good these models are.
They are comparable to the first ChatGPT proof of concept from say early 2022. The reason many of us are excited is because we may be a year or two away from being able to run a ChatGPT if the open source models follow a similar curve.
Because OpenAI or Google being the gate keepers is a show stopper for most serious people in the space.
They will be able to shutdown your startup on a whim. Even if they didn't politicians and regulators would be a huge risk. Without democratization we get blade runner . Not that democratization has no problems, just that it is the way a lot of us are wanting it to go.
And basing your startup on a model pirated from Meta isn't risky?
I get it, I understand why people like decentralization - but the open source community doesn't even have close to the capability to train an actually open-source LLaMA equivalent.
Mozilla or Wikimedia foundation have budgets with the right number of zeroes to support an effort like this.
While those organizations might not be the right fit, I think they serve as an "existence proof" that a large scale open / nonprofit project is not completely inconceivable.
Could possibly see an industry consortium (of players too small to compete on their own) funding an open effort.
Last idea sounds crazy but hear me out: how much would Nvidia spending $100M on "open"
models boost spending on graphics cards? I hope someone's running the numbers on that...
> how much would Nvidia spending $100M on "open" models boost spending on graphics cards?
Probably not as much as in a zero-sum game where everyone is trying to train their own model. Every leap in CPU inference makes this an increasingly less appealing option for them.
But I agree, some industry consortium might try to do it. I think they would first have to be relying heavily on LLMs before they were willing to do that, and its possible by then that the lead will have gotten too large to easily surmount, especially now that everyone has stopped publishing.
You make it sound as if ChatGPT itself isnt severly limited. Vicuna and newer 13B models are quite close . And the uncensored models have a capability that ChatGPT will never have.
They just aren't quite close. Maybe you guys are asking shallow questions, but I'm asking subject matter questions and this is just not true.
It would be great if we could have detailed QA evaluation to show this, but of course then the open source people would train their models on it as a fine-tuning datasaet.
The way OpenAI has set their models up for prompting and follow-up is where their real advantage is. It's very hard to take a model you can run locally and just go "Hey, write me a powershell script to convert all these files into this format. Split them up so they're no more than an hour each. Ok, change it to also denoise them."
Are you talking about the RLHF? That's where Vicuna/Alpaca and other finetuning help a lot.
There's also already tools for conversation flows (which just means you prepend the conversation history to the prompt).
I'm not saying the performance is nearly as good, but the actual workflow does already exist and is massively improving. The interesting part to me is that this finetuning can be done in a couple few hours on a consumer gpu (4090).
What are the newer models? I am testing them in batteries across complex tasks, so far vicuña is the most flexible but they all choke on reflective instructions (I. E. Knowledge to extract is not in the model but in the user text)
langchain agents are a good starting implementation.
you can build your own prompt and get the ai to work by iself hallucinating tools, which may be cheaper to test out than going back and forth with an agent manager. not as accurate, but you can still extract useful work, i.e. https://i.imgur.com/AE4R3dR.png (gpt-35-turbo is traditionally failing this task completely, prompt get it to work at it)
these prompt all require the model to work off data within the prompt within the first shoot. model require a degree to introspection for that to work.
People are enthusiastic about the possibilities, imagine a black box on your desk contains only RAM and matrix multiplication chips, you install on it your favorite AI assistant and you train it with your private data/code, you remove all prudish restrictions and get productive on your work and on your off times.
Llama has the potential to reach ChatGPT it needs tunning to get better at responding to questions, llama if I am not worng is mostly attempting to predict what is next.
I can see it similar like Midjourney and Stable diffusion, midjorny can make any stupid prompt look like a digiatal art in the style of Midjorney but look how many stable Diffusion innovation happens, a competent person that is on top with all the new stuff can produce absolute anything in any style they want.
My hunch is model weights will be commercialized as a purchased object. They can be watermarked so easy to trace any leak.
Then hardware will be a separate business. I think Apple might be caught off-guard by Nvidia on hardware. The latest NVlink and 400Gbps interconnects when combined with H100 next iterations and also rumors of advanced PCIe motherboards with high lane Nvidia CPUs and it looks to me that next year they can be selling $100-300K physical systems optimized for LLM inference that physically remind me of mainframes.
It makes no sense, the science is free and open, the companies just put their money into throwing data into the model. Once you have the big llama model filled with all the humanity information as an open source thing at bast a company could sell you some small stuff to add on top, like maybe Disney would sell you a "license" and a lora to generate Disney crap, their model would probably will be lower in quality then the open ones but the license would be the important part.
It is kind of idiotic that some scientist can spend years and a lot of public money to create some technology and then bilionairs are miliking all the profits.
I’m really hoping there are viable distributed and somewhat decentralized eventually consistent training algorithms we could all run in a P2P system. That would be super cool.
However I can easily see that now the framework has been established if a company builds a proprietary curated dataset for specific skills and then pays to spend resources on specialized reinforcement training.
Then they can commercialize that I would think. As people would pay for an LLM that does XYZ the best. Kinda like your Disney example but I was thinking engineering tasks in my head.
I did not said that are the same, the llama has a lot of information in it , the issue seems to be the Q&A chat part. This can be added on top by the community without having to start from scratch IMO. But I might be wrong, and OpenAI put some magic shit in , some super secret unpublished stuff, in that case educate me. In my mind I compare with Stable Diffusion, the proprietary ones produce more artistic effects because they put some more stuff into the prompts and they are imposing their styles. With SD you have the control and you need to wait for some improvements, plugins, new loras or embedings with new styles etc.
In SD I could train my face in 1- minutes, with proprietary shit it will never happen...
Imagine what a math community could train, they just need access to the model and soem GUI software that can help them train.
So llama based Chat stuff is not yet comparable with ChatGPT but there are already lot of progress made. At this moment coding and math is bad in llama based but other stuff is great, like story creation, also I only could test 3-b 4bit and it is good enough to for example provide me a complex response in valid JSON format.
Agreed. I’ve said it elsewhere: how do we know that ClosedAI has fully published all their tricks used to build GPT3.* or later? The main “clues” people have are from the InstructGPT paper, but it should shock nobody if it turns out that paper reveals less than 10% of their techniques, which may have taken years to come up with. Competitors would need to rediscover those on their own or come up with new ideas. Simply repeating the ideas from that paper is likely not going to get them to a competitive model performance.
It's great they got LLMs running on resource constrained devices but are they any good? Or I should ask, with the limited resources they get, what good are they for?
From my experience with llama.cpp and oobaboogas webui I can say they are amazing, at least on my gaming pc. I’m absolutely astonished at the speed and quality of llama, alpaca, galactica and vicuna (the >10B parameters ones).
Make no mistake, it’s for tinkerers that do not expect each prompt to be answered human like.
I see them as creativity and thought testing tools, also knowledge exploratory.
In my opinion the problem with these is engineering a good prompt. I read of lots of people only getting nonsense or repetitions, and learned a bit from what they shared. These models are not chat bots.
Vicuna is more friendly in that regard.
But I’m well aware of their limitations also, and I can see how one can be underwhelmed. They are not jacks of all trades
Alpaca-LoRA, and all LoRas, are garbage. Alpaca is horrible compared to newer finetunes. Even cleaning the Alpaca dataset and retraining a cleaned Alpaca improves its performance greatly.
But newer finetunes like Vicuna go well beyond that, including hundreds of thousands of real human conversations with GPT-4 ChatGPT in the dataset (unlike Alpaca's fully synthetic dataset).
Vicuna-13B in 16bit is easily comparable to ChatGPT-3.5 in capability. Newer finetunes coming out nearly every day are going beyond chatGPT-3.5 and getting closer and closer to GPT-4 performance.
You don't even have to install anything to validate this for yourself. There's a live web demo of Vicuna-13B right here: https://chat.lmsys.org/ (disable ad blocker if it does not load)
This isn't necessarily true with LoRAs - a 4090 can train/compute the alpaca dataset with LoRA in under 6 hours (it might be 3, I forget what it was).
So finetuning with LoRAs and a few other methods is fine on higher end consumer hardware like a 4090 and finishes in a reasonable amount of time - IMO definitely worth it if you're experimenting with this especially for the inference.
The base training though yeah I totally agree with you - train in the cloud, don't buy hardware when you need a month of 8x A100's or whatnot.
My perspective was for people who have other uses for them e.g. gaming or local inference.
From a pure finance standpoint you're definitely right - you should rent and not buy a dedicated card. I think you'd need a few thousand hours to break even which is a few months 24/7.
It's more that I'm building a gaming pc this summer, and I can either target 1440p (4070) for 2k or 4k for 5k (4090). If I can do a lot more with a 4090 over a 4070 it might make sense, but I know a lot of cs students use google colab these days, so I may just rely on that.
I'd seriously recommend the 4090 over the 4070 if you want to do finetuning/inference locally. And I highly recommend 64GB of ram.
The 24GB of VRAM is 100% worth it alone. If you want to do local ML stuff you _need_ that 24gb of VRAM.
64GB of ram + 24GB of vram lets you run a lot of the medium size models at decent speeds. I don't use Colab personally but AFAIK it should work fine for you if you don't want to do it locally.
Also worth noting is the newer ray tracing rendering that cyberpunk is doing. You should checkout the demos IMO it looks sick. It only runs at 18fps on a 4090 so it's only playable on a 4090 + dlss, and I'm not sure if the newer rending tech will be super achievable on any of the other cards - if that's of interest to you.
Get a 13900 or 7950X and 64 gb of ram. You can run llama 30B and 65B, slowly but surely. Play with that before buying a gpu. If you really, really see yourself getting into this, then go ahead and get a 3090 or 4090. But otherwise get a cheaper nvidia card and wait for things to develop a little more. You can still play with ML and CUDA but you'll have cash left for when 50X0s drop, and that will probably be right around when this stuff will really be getting hot (if the current plateau doesn't hold).
Llama is basically an auto-complete right now. We're celebrating baby's first steps. It's not really worth the $600-1000 jump up from cards that can run all current games 4k60.
Can you calculate that for me on a napkin? Every calculation I make for training, but certainly for inference, makes me break even after well under a year and then it’s vastly less if I buy the hardware myself.
More VRAM => larger models. IME it is absolutely worth maxing out VRAM for the significant improvement in quality, especially with LLaMA (though even with a 4090, you won't be able to run the largest 65-billion parameter model even with 4-bit quantization).
That said, I recommend renting a cloud GPU for a few hours and trying the larger models on them before buying a GPU of your own, just to see if the models meet your requirements.
I started with ggerganov’s llama.cpp GitHub repo, and went from there. But then again, I know some programming, statistics and machine learning, so it may not be for you, I cannot judge that.
Models can be found on huggingface.co, and I’d start with eachadea/ggml-vicuna-13b-4bit, but it needs 10G of cpu-ram. It is very friendly to any prompt though.
I read on my way (on reddit, when I recall correctly), that there must be some really good intro videos on YouTube.
additional point of reference I don't know shit about ML or stats, and am a very weak programmer. I just know how to install programs on linux and use CLI in a basic fashion. I have had no problem getting llama.cpp going by following ggerganov's readme.
are we talking about training or inference for local LLM here? it's hard to do any meaningful training on the edge unless we all carry a heavy gaming pc, even that, the training quality will be subpar?
Checkout LoRA and Alpaca LoRA and the whole huge group of people who have already figured this out. I think there was another breakthrough (yesterday?) which is a further adaption of LoRA to touch even less parameters at runtime.
Even with smaller models & more optimized hardware, I think edge compute is going to be power-limited first. Batteries today just won’t support constantly running LLMs. But I joked recently that as long as they prove useful then consumers would be willing to swap their iPhone for the old car battery with a phone handle attached.
The first step is for it to be viable for smaller models on desktops. The rest will follow, as hardware catches up. Hybrid analog-hybrid NN hardware is on the horizon, maybe in 3-4 years. This would allow GPT-4 level performance on an iPhone with plausible battery life.
The current hardware of course can't pull anything like this yet. But iPhone supports on-device facial recognition, object recognition, dictation and translation, so small steps...
As I see these things come out, it feels like there's not a lot of discussion on which hardware (that isn't one of the fancy new Macs?) As in, there might be a lot of graphics cards out there that could be used here? Is it only Nvidia still, is AMD a possibility? Maybe I'm missing something on how the tech works?
30B llama needs a 3090 or 4090. 13B I think you can get away with a 3/4080. If you have 64 gigs of ram and a beefy CPU you can run even 65B, but boy it's slow.
13B is pretty meh, but 30B is great, if not quite Chatgpt. But I can ask it why my highschool geometry teacher was such a cunt and it will happily discuss the matter without reservation. Very therapeutic.
I don't think it will help. Actual friends occasionally send me mail that says "test" from a random account. And spammers do too... There is no way to seperate them.
Personally I haven’t had that problem. I can readily distinguish spam from non-spam by manual review ~99% of the time. If I could train or instruct an LLM to do the same as I do now, I would be happy. My current false-negative rate with Bayesian spamassassin is more like 50%.
Don't you do it somehow? Plus a filter doesn't have to be 100% right all the time. Filtering out what is 99% certain to be or not be spam and leaving the human to cover the tiny number of messages that fall into a grey area would still save a ton of time.
Yes, but I rely on my own personal experiences (online and offline) to determine whether an email is real or a scam. Unless that AI filter taps into my memory, it will likely lead to too many false positives and false negatives.
I don't understand why people are so excited to build this big thing on top of Llama, which is closed source, severely license restricted and we now know for a fact that Meta is going after users with the legal hammer.
I'm sure if we'd pool resources together we could build a truly open alternative worthy of building on top of.
One simple thing these LLM models cannot do yet .. that is to simply point a LLM to a URL and it will start scraping - ie follow the hyperlinks and start consuming the content. I am not an AI guy but I guess this has to do with the context limitations of most model? How did they train OpenAI with all internet data till 2021? This I think will be a most popular feature for LLM models and I seriously hope it is OSS whenever it comes out.
While many NLP related Apple ML job listings have been added since this article was written, there were several recent listings at the time of its writing. While I feel that Apple does not focus well on intangible technologies, products that can't be readily carried, worn and given their boutique product development fetish focus, I have some hope that they can overcome this bias somewhat, and see how behind they are.
I want to agree, but it's pretty easy to find instances where Apple has dabbled but not delivered best-in-class solutions. Siri. iCloud. Home automation.
Apple has two values in conflict with each other, I think. On the one hand, they want to deliver best-in-class solutions. On the other hand, they have a commitment to user privacy[0] as perhaps only a gay man growing up in the south might value.
Siri should be better! It lost features post-acquisition by Apple, and it seems like user privacy is why.
Home automation is arguable. If you consider a single point of failure on a server somewhere to be bad, Apple's solution is pretty great. Their commitment to zigging where others zagged put them behind, since hardware vendors didn't want to put in powerful (expensive) enough chips to handle the cryptography, but while other companies go out of business, or transmit images and video to external parties, Apple's works reliably and securely.
Still, as with most of my complaints about Apple, it's a trade-off between privacy and functionality, and Apple will seemingly always choose privacy over functionality, even as Google consistently chooses functionality over privacy.
[0] Yes, there are examples of edge cases that suggest a less-than-perfect record. Contrast that with their competitors, for which invading privacy is foundational to the business model.
Increasingly Apple seem to be blocking tracking, noteably from facebook, to make the most profit of that tracking. I've read claims that apple made between $5B and $20B on advertising in 2022. It's far from clear that Apple's view on privacy is going to stay the same.
That they're blocking tracking is still privacy-focused. There are unsubstantiated claims that they exempt themselves from the same tracking, but reports of their advertising revenue doesn't contribute anything to those claims.
At Apple's scale, it's relatively easy for them to deliver $20B in ad revenue without any privacy-invading means.
People seem to have forgotten, but ads used to be based on context, so people looking at apps related to fitness might see ads related to fitness, but that wouldn't follow them around when they looked at other things. Apple still seems to be doing that; I haven't seen fitness ads on games, or game ads on fitness apps.
Apple has very little ML expertise and very few cloud resources.
They are absolutely behind in the space and anyone who works in the industry will tell you that. Only feasible way IMO would have to be a very big budget acquisition of one of the major LLM startups, but most of those already have big tech backers.
Out of doubt, which seems to be spreading around the internet. The LLaMa model weights weren't "leaked" AFAIK but rather explicitly given access to to researchers, isn't it right?
I know the article goes on to speak about something else, but I'm not sure why this claim that the LLaMa model weights were leaked, as in unintendenly made available is being done.
My understanding is that researchers could ask for access to weights, but then also they were leaked so that anyone could get them without asking. There is another layer, where Facebook seems to accept it on some level (I mean they don't have a choice anymore anyway); they put a cheeky comment in the open pull request instead of closing it.
The model weights were only shared by FB to people who applied for research access. Github repos containing links to the model weights have been taken down by FB.
LLAMA isn't there and probably never will be, but the possibility of running something equivalent to ChatGPT has certainly made me reconsider my GPU purchases. I wonder if in the end will it be Nvidia's CUDA advantage or AMD's larger amount of memory that will end up being more important when we do get it.
It also makes it easier for the defense/blue team side to build counters or monitor and defend the attack surface of a given system against these attacks.
Digital arms races are nothing new, this is just the latest battlefield.
This is wonderful. As hardware and software continues to improve, everything seems to find a way to run on ever smaller devices. Guess your own pocket-AGI is not too far away after all.
By the way, I was thinking of something along the lines of a powerful FPGA with direct access to large quantities of very fast NAND flash, likely many chips in parallel, which will save having to load the model into RAM..... So it will be able to directly run from NAND flash, which opens up the possibility of using very large models???
Power consumption would not be an issue if it's used sporadically throughout the day, it's not like it needs to run continuously?
There is still the issue of NAND flash read disturb, which I haven't fully looked into yet.
Also worth checking out https://github.com/saharNooby/rwkv.cpp which is based on Georgi's library and offers support for the RWKV family of models which are Apache-2.0 licensed.
I’ve got some of their smaller Raven models running locally on my M1 (only 16GB of RAM).
I’m also in the middle of making it user friendly to run these models on all platforms (built with Flutter). First MacOS release will be out before this weekend: https://github.com/BrutalCoding/shady.ai
I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.
Alphafold uses something they call Evoformer, it is an attention mechanism. Our group has tried, and so far failed to utilize transformers for a very very specific search problem in geometry (https://bit.ly/unit-distances).
In order to run large language models, we should all be buying a fully loaded Mac Studio (128GB of ram, 20 CPU cores, a lot of GPU and Neural cores.) and putting Linux on it to remove the artificial restrictions.
Yes, we will be running them soon in low end hardware, but we need to get at least to GPT-3.5-turbo level of inference speed and quality before we try to make it small.
I already started.
$ neofetch
-` x@decpti
.o+` -------
`ooo/ OS: Arch Linux ARM aarch64
`+oooo: Host: Apple Mac Studio (M1 Ultra, 2022)
`+oooooo: Kernel: 6.1.0-rc6-asahi-4-1-ARCH
-+oooooo+: Uptime: 4 hours, 23 mins
`/:-:++oooo+: Packages: 177 (pacman)
`/++++/+++++++: Shell: bash 5.1.16
`/++++++++++++++: Resolution: 1920x1080
`/+++ooooooooooooo/` Terminal: /dev/pts/0
./ooosssso++osssssso+` CPU: (20) @ 2.064GHz
.oossssso-````/ossssss+` Memory: 717MiB / 129540MiB
-osssssso. :ssssssso.
:osssssss/ osssso+++.
/ossssssss/ +ssssooo/-
Which LLM can run on apple's neural cores / GPU cores? I can only run on plain ol' CPU cores (llama), and it runs fine on my Ryzen CPU for less than half the price of that system. That being said I'm switching from Ubuntu to Arch cause I'm sick of all my packages being way out of date!
Is this a joke or an ad? Paying Apple hardware premiums (for hardware that isn't well supported by PyTorch, et al.) just to load Linux onto it and run AI models in a closet somewhere, all seems like an incredibly wasteful way to go about it.
4 example labels, and I had a binary classifier in seconds. Sure, semantic text classifiers were possible for a while, but making it accessible changes everything. Giving anyone who can use a spreadsheet the power of a local LLM (or, basically free LLMs) can make them much, much more productive. A lot of office work is clicking through sheets and doing manual labeling.
It's truly wild what is becoming accessible! Really excited to see the next gen software that the open community comes up with :)