Just an FYI, this is writeup from August 2023 and a lot has changed (for the bet...

Just an FYI, this is writeup from August 2023 and a lot has changed (for the better!) for RDNA3 AI/ML support.

That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).

That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).

Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.

At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.

There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).

Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs

[1] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...