A better setup on paper would be this CPU and motherboard combination with 12x64...

menaerus · on Jan 30, 2025

I am not sure if that would work well since 8 cores is really small, it can't really scale well wrt to attention head algorithm, and more importantly this particular EPYC part can only achieve ~50% of the theoretical memory bandwidth, ~240 G/s. Other EPYC parts are running close to ~70%, at ~400 G/s, which OP is using.

bildung · on Jan 29, 2025

I think the point of the two socket solution is the doubled memory bandwith. You propose using just a single one of the same CPU, or am I missing something?

ryao · on Jan 29, 2025

llama.cpp’s token generation speed does not scale with multiple CPU sockets just like it does not scale with multiple GPUs. Matthew Carrigan wrote:

> Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!

This does not actually make sense. It is well known that there is a penalty for accessing memory attached to a different CPU. You don’t get more bandwidth from disabling the NUMA node information and his token generation performance reflects that. If there was a doubling effect from using two CPU sockets, he should be getting twice the performance, but he is not.

Additionally, llama.cpp’s NUMA support is suboptimal, so he is likely taking a performance hit:

https://github.com/ggerganov/llama.cpp/issues/11333

When llama.cpp fixes its NUMA support, using two sockets should be no worse than using one socket, but it will not become better unless some new way of doing the calculations is devised that benefits from NUMA. This might be possible (particularly if you can get GEMV to run faster using NUMA), but it is not how things are implemented right now.

freeqaz · on Jan 29, 2025

Do you get more bandwidth at the cost of latency?

Also how much would stuffing a GPU or 3 (3090/4090) improve speeds, even with heavy CPU layer offloading, or would the penalty be too big? I know in some cases you're swapping data into the GPU, but in others you're just doing parts on the CPU. I'm curious what the comparison for speed would be.

ryao · on Jan 29, 2025

I would suspect the infinity fabric links are already saturated with the local RAM’s memory bandwidth such that you will not get more by accessing another socket’s RAM.

Chips and Cheese suggests things are even worse than this as the per CCD bandwidth is limited to around 120GB/sec, which probably ruins the idea of using the 9015, as that only has 2 CCDs:

https://old.chipsandcheese.com/2024/10/11/amds-turin-5th-gen...

https://www.techpowerup.com/cpu-specs/epyc-9015.c3903

Anyway, leveraging both sockets’ memory bandwidth would require splitting the layers into partitions for each NUMA node and doing that partition’s part of each GEMV calculation on the local CPU cores. PBLAS might be useful in implementing something like that.

As for a speed up from using 3090/4090 cards, that is a bit involved to estimate. The model has 61 layers. The way llama.cpp works is that it will offload layers and the computation will move from device to device depending on where the layers are in memory. You would need to calculate roughly how long it takes for each device to do a layer. Then multiple by the number of layers processed by that device and sum across the devices. Finally, normalize to get the number of tokens per second and you will have your answer. DeepSeek R1 has 61 layers (although I think llama.cpp will say 62 due to the embedding layer if it counts for DeepSeek like it does for llama 3). It has 37GB of activated weights, so you can do 37GB / 61 / memory bandwidth to get the time per layer. You probably want to multiply by 1.25 as a fudge factor to account for the fact that these things never run at the full speed that these calculations predict. Then you can plug in these numbers into the earlier calculation I described to get your answer.

bildung · on Jan 29, 2025

Thanks! TIL.