If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:
On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.
You can run LLM models right inside a short lived process.
But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.
Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.
Last run of llama.cpp main off my command line:
llama_print_timings: load time = 871.43 ms
llama_print_timings: sample time = 20.39 ms / 259 runs ( 0.08 ms per token, 12702.31 tokens per second)
llama_print_timings: prompt eval time = 397.77 ms / 3 tokens ( 132.59 ms per token, 7.54 tokens per second)
llama_print_timings: eval time = 20079.05 ms / 258 runs ( 77.83 ms per token, 12.85 tokens per second)
llama_print_timings: total time = 20534.77 ms / 261 tokens
./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt “say hello”
On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.
You can run LLM models right inside a short lived process.
But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.
Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.
Last run of llama.cpp main off my command line: