If you just check out https://github.com/ggerganov/llama.cpp and run make, you’l...

If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:

./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt “say hello”

On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.

You can run LLM models right inside a short lived process.

But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.

Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.

Last run of llama.cpp main off my command line:

   llama_print_timings:        load time =     871.43 ms
   llama_print_timings:      sample time =      20.39 ms /   259 runs   (    0.08 ms per token, 12702.31 tokens per second)
   llama_print_timings: prompt eval time =     397.77 ms /     3 tokens (  132.59 ms per token,     7.54 tokens per second)
   llama_print_timings:        eval time =   20079.05 ms /   258 runs   (   77.83 ms per token,    12.85 tokens per second)
   llama_print_timings:       total time =   20534.77 ms /   261 tokens