(answer for 1 inference)
Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.