Since it would be a 2DPC configuration, the memory would be limited to 4400MT/sec unless you overclock it. That would give 422.4GB/sec, which should be enough to run the full model at 11 tokens per second according to a simple napkin math calculation. In practice, it might not run that fast. If the memory is overclocked, getting to 16 tokens per second might be possible (according to napkin math).
The subtotal for the linked parts alone is $5,139.98. It should stay below $6000 even after adding the other things needed, although perhaps it would be more after tax.
Note that I have not actually built this to know how it works in practice. My description here is purely hypothetical.
I am not sure if that would work well since 8 cores is really small, it can't really scale well wrt to attention head algorithm, and more importantly this particular EPYC part can only achieve ~50% of the theoretical memory bandwidth, ~240 G/s. Other EPYC parts are running close to ~70%, at ~400 G/s, which OP is using.
I think the point of the two socket solution is the doubled memory bandwith. You propose using just a single one of the same CPU, or am I missing something?
llama.cpp’s token generation speed does not scale with multiple CPU sockets just like it does not scale with multiple GPUs. Matthew Carrigan wrote:
> Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!
This does not actually make sense. It is well known that there is a penalty for accessing memory attached to a different CPU. You don’t get more bandwidth from disabling the NUMA node information and his token generation performance reflects that. If there was a doubling effect from using two CPU sockets, he should be getting twice the performance, but he is not.
Additionally, llama.cpp’s NUMA support is suboptimal, so he is likely taking a performance hit:
When llama.cpp fixes its NUMA support, using two sockets should be no worse than using one socket, but it will not become better unless some new way of doing the calculations is devised that benefits from NUMA. This might be possible (particularly if you can get GEMV to run faster using NUMA), but it is not how things are implemented right now.
Also how much would stuffing a GPU or 3 (3090/4090) improve speeds, even with heavy CPU layer offloading, or would the penalty be too big? I know in some cases you're swapping data into the GPU, but in others you're just doing parts on the CPU. I'm curious what the comparison for speed would be.
I would suspect the infinity fabric links are already saturated with the local RAM’s memory bandwidth such that you will not get more by accessing another socket’s RAM.
Chips and Cheese suggests things are even worse than this as the per CCD bandwidth is limited to around 120GB/sec, which probably ruins the idea of using the 9015, as that only has 2 CCDs:
Anyway, leveraging both sockets’ memory bandwidth would require splitting the layers into partitions for each NUMA node and doing that partition’s part of each GEMV calculation on the local CPU cores. PBLAS might be useful in implementing something like that.
As for a speed up from using 3090/4090 cards, that is a bit involved to estimate. The model has 61 layers. The way llama.cpp works is that it will offload layers and the computation will move from device to device depending on where the layers are in memory. You would need to calculate roughly how long it takes for each device to do a layer. Then multiple by the number of layers processed by that device and sum across the devices. Finally, normalize to get the number of tokens per second and you will have your answer. DeepSeek R1 has 61 layers (although I think llama.cpp will say 62 due to the embedding layer if it counts for DeepSeek like it does for llama 3). It has 37GB of activated weights, so you can do 37GB / 61 / memory bandwidth to get the time per layer. You probably want to multiply by 1.25 as a fudge factor to account for the fact that these things never run at the full speed that these calculations predict. Then you can plug in these numbers into the earlier calculation I described to get your answer.
https://www.newegg.com/p/N82E16819113866
https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-se...
As for memory, these two kits should work (both are needed for the full 12 DIMMs):
https://www.newegg.com/owc-256gb/p/1X5-005D-001G0
https://www.newegg.com/owc-512gb/p/1X5-005D-001G4
Since it would be a 2DPC configuration, the memory would be limited to 4400MT/sec unless you overclock it. That would give 422.4GB/sec, which should be enough to run the full model at 11 tokens per second according to a simple napkin math calculation. In practice, it might not run that fast. If the memory is overclocked, getting to 16 tokens per second might be possible (according to napkin math).
The subtotal for the linked parts alone is $5,139.98. It should stay below $6000 even after adding the other things needed, although perhaps it would be more after tax.
Note that I have not actually built this to know how it works in practice. My description here is purely hypothetical.