> I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?
Because it's now a simple REST-like query to interact with that server.
Default model of running the binary and capturing it's output would mean you would reload everything each time. Of course, you can write a master process what would actually perform the queries and have a separate executable for querying that master process... wait, you just invented a server.
I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.
Aren’t people mostly running browser frontends in front of these to provide a persistent UI - a chat interface or an image workspace or something?
sure, if you’re running a lot of little command line tools that need access to an LLM a server makes sense but what I don’t understand is why that isn’t a niche way of distributing these things - instead it seems to be the default.
If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:
On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.
You can run LLM models right inside a short lived process.
But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.
Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.
Last run of llama.cpp main off my command line:
llama_print_timings: load time = 871.43 ms
llama_print_timings: sample time = 20.39 ms / 259 runs ( 0.08 ms per token, 12702.31 tokens per second)
llama_print_timings: prompt eval time = 397.77 ms / 3 tokens ( 132.59 ms per token, 7.54 tokens per second)
llama_print_timings: eval time = 20079.05 ms / 258 runs ( 77.83 ms per token, 12.85 tokens per second)
llama_print_timings: total time = 20534.77 ms / 261 tokens
Because it's now a simple REST-like query to interact with that server.
Default model of running the binary and capturing it's output would mean you would reload everything each time. Of course, you can write a master process what would actually perform the queries and have a separate executable for querying that master process... wait, you just invented a server.