DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

anonzzzies · 2026-04-24T04:06:17 1777003577

From this thread [0] I can assume that because, while 1.6T, it is A49B, it can run (theoretically, very slow maybe) locally on consumer hardeware, or is that wrong?

[0] https://news.ycombinator.com/item?id=47864835

alecco · 2026-04-24T05:43:39 1777009419

If 5090 has 32GB, and let's say somehow a 1-bit quantization is possible and you don't need more VRAM for anything else (forget KV cache etc), it would be able to fit a 256B 1-bit model. Just to picture it in extremes how unlikely this is.

And the active parameters come from the experts. For each token the model picks some experts to run the pass (usually 2 to 4, I haven't read V4's papers). It's not always the same experts.

OTOH, being DeepSeek, I foresee a bunch of V4 distilled FP8 models fitting in a 5090 with tiny batches and with performance close from 75 to 85% of V4. And this might be good enough for many everyday tasks.

Today is a good day for open models. Thank god for DeepSeek.

Quasimarion · 2026-04-24T04:38:17 1777005497

Theoretically with streaming, any model that fit the disk can run on consumer hardware, just terribly slow.

imrebuild · 2026-04-24T09:41:26 1777023686

It will be Seconds Per Token instead of Tokens Per Second.

lostmsu · 2026-04-25T11:00:00 1777114800

2 s/t on to NVMes

The issue is KV cache

halJordan · 2026-04-24T21:00:13 1777064413

No because you don't know which 49 is needed until the moment it's needed

mark_l_watson · 2026-04-24T16:05:17 1777046717

The flash version is smaller, I think around 200B parameters and is cheap to run.

woeirua · 2026-04-24T03:13:05 1777000385

Hmm. Looks like DeepSeek is just about 2 months behind the leaders now.

anonzzzies · 2026-04-24T03:59:20 1777003160

If that is really so, it would be now be good enough to replace claude for us; we use sonnet only; with our setup, use cases and tooling it works as well as opus 4.6, 4.7 so far. We won't replace sonnet as long as they have subscriptions but it is good to have alternatives for when they force pay per use eventually.

arunkant · 2026-04-24T05:36:36 1777008996

Yep, it should be better and more efficient then sonnet.

gwern · 2026-04-24T05:08:25 1777007305

Main discussion: https://news.ycombinator.com/item?id=47884971

mark_l_watson · 2026-04-24T16:03:19 1777046599

I used the flash version on a tricky Common Lisp coding problem this morning. The first cut of the new library had a runtime error. I was running in a simple REPL using:

ollama run deepseek-v4-flash:cloud

so I had to feed the generated code and the error back into the REPL manually, but it nailed it the second time, and the Common Lisp code was very good.

statements · 2026-04-24T04:48:00 1777006080

The quality of this model vs the price is an insane value deal.

statements · 2026-04-24T05:33:38 1777008818

Models like Deepseek is the only reason we are able to categorize and measure quality of thousands of MCP servers (https://glama.ai/blog/2026-04-03-tool-definition-quality-sco...). That's billions of tokens – an expense that would be otherwise very hard to swallow.

cmrdporcupine · 2026-04-24T03:13:07 1777000387

Pricing: https://api-docs.deepseek.com/quick_start/pricing

"Pro" $3.48 / 1M output tokens vs $4.40 for GLM 5.1 or $4.00 for Kimi K2.6

"Flash" is only $0.28 / 1M and seems quite competent

(EDIT: Note that if you hit the setting that opencode etc hit (deepseek-chat / deepseek-reasoner) for DeepSeek API, it appears to be "flash".)

taosx · 2026-04-24T03:44:28 1777002268

I estimated that even with heavy usage it would cost your around 30-70$ depending on caching at around 40M tokens. That would give you around double the usage compared to gpt-5.5 on the 200$ sub

mudkipdev · 2026-04-24T03:41:04 1777002064

This is refreshing right after GPT-5.5's $30

taosx · 2026-04-24T03:41:47 1777002107

So the R line (R2) is discontinued or folder back into v4 right?

mudkipdev · 2026-04-24T03:43:56 1777002236

I believe the R stood for reasoning, just like OpenAI had their own dedicated o1/o3 family, but now every model just has it built-in.