Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?

Because it's now a simple REST-like query to interact with that server.

Default model of running the binary and capturing it's output would mean you would reload everything each time. Of course, you can write a master process what would actually perform the queries and have a separate executable for querying that master process... wait, you just invented a server.



I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.

Aren’t people mostly running browser frontends in front of these to provide a persistent UI - a chat interface or an image workspace or something?

sure, if you’re running a lot of little command line tools that need access to an LLM a server makes sense but what I don’t understand is why that isn’t a niche way of distributing these things - instead it seems to be the default.


> I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.

Did you ever used a computer?

    PS C:\Users\Administrator\AppData\Local\Programs\Ollama> ./ollama.exe run llama2:7b "say hello" --verbose
    Hello! How can I help you today?

    total duration:       35.9150092s
    load duration:        1.7888ms
    prompt eval duration: 1.941793s
    prompt eval rate:     0.00 tokens/s
    eval count:           10 token(s)
    eval duration:        16.988289s
    eval rate:            0.59 tokens/s
But I feel like you are here just to troll around without a merit or a target.


If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:

./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt “say hello”

On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.

You can run LLM models right inside a short lived process.

But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.

Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.

Last run of llama.cpp main off my command line:

   llama_print_timings:        load time =     871.43 ms
   llama_print_timings:      sample time =      20.39 ms /   259 runs   (    0.08 ms per token, 12702.31 tokens per second)
   llama_print_timings: prompt eval time =     397.77 ms /     3 tokens (  132.59 ms per token,     7.54 tokens per second)
   llama_print_timings:        eval time =   20079.05 ms /   258 runs   (   77.83 ms per token,    12.85 tokens per second)
   llama_print_timings:       total time =   20534.77 ms /   261 tokens




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: