More

awestroke · 2026-05-01T16:17:42 1777652262

Is your experience with this new quantization approach from Intel? Otherwise your comment is a bit offtopic at best, misleading at worst.

awestroke · 2026-04-27T15:01:46 1777302106

Well, you see, they just can't find a checkbox for ipv6 support in the IIS GUI on their ingress servers.

awestroke · 2026-04-26T09:33:38 1777196018

It's only open because nobody who's interested in this model would send their data to openai to be stripped of PII. If they thought otherwise, it would be closed-weights and API-only for "safety" reasons

awestroke · 2026-04-25T09:42:21 1777110141

Well let's talk again when the problems have been solved, then. Until then, manually curated skills and documentation will beat this

awestroke · 2026-04-23T11:29:09 1776943749

Meta is not in the AI game any more

tim333 · 2026-04-23T16:05:30 1776960330

LLM Arena has them at #3 on the overview, behind Anthropic and Google, ahead of Grok and OpenAI.

pjc50 · 2026-04-23T11:45:39 1776944739

Didn't they just announce they were going to be surveilling all their employees screens and keystrokes for AI training? Is that just for the love of the game rather than as part of a product?

oceansky · 2026-04-23T12:33:24 1776947604

That's probably just for internal metrics, automating dev work and facilitate stack ranking. Not to release a product necessarily.

oceansky · 2026-04-23T11:31:23 1776943883

Just saw Zuckerberg post from July 2025 saying they are going to be "careful" with what they release.

trvz · 2026-04-23T11:35:42 1776944142

1) In the AI world, that's a very long time ago

2) That still equates to "Meta is not in the AI game any more" in meta-corporate speak

oceansky · 2026-04-23T12:29:32 1776947372

Yes, point two is what I meant.

awestroke · 2026-04-22T17:56:24 1776880584

27 multiplied by quant, add context

awestroke · 2026-04-21T19:21:01 1776799261

The better way to defend against these types of issues is to avoid Vercel and similar providers

elwebmaster · 2026-04-22T02:34:20 1776825260

Nailed it.

awestroke · 2026-04-17T10:12:36 1776420756

Yes it does:

https://code.claude.com/docs/en/desktop#let-claude-use-your-...

firloop · 2026-04-17T12:14:55 1776428095

I’m referring to how it can use your computer in the background.

awestroke · 2026-04-17T14:30:50 1776436250

Me too

awestroke · 2026-04-15T14:47:59 1776264479

So are you offering api keys or...

awestroke · 2026-04-14T10:53:46 1776164026

I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?

radarsat1 · 2026-04-14T13:30:06 1776173406

Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.

It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.

qeternity · 2026-04-14T11:44:24 1776167064

I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.

nodja · 2026-04-14T20:57:27 1776200247

Same reason why prompt processing is faster than text generation.

When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.

Majromax · 2026-04-15T01:30:41 1776216641

Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.

dd8601fn · 2026-04-15T15:23:03 1776266583

From what I understood, if we’re talking a single user on a mac (not batching) you’re rarely compute bound in the first place. More rows per pass is nearly free that way when cores were sitting idle anyway.

If that’s wrong I would certainly appreciate being corrected, though. But if it’s right, a 2.9x speed-up after rejected tokens, nearly for free, sounds amazing.

nodja · 2026-04-16T00:15:58 1776298558

That will depend on the model, but they'll hit compute limits before a typical GPU in almost all cases. Macs will still benefit a speedup from this, just not one as big as the one reported.

Balinares · 2026-04-14T12:41:02 1776170462

Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.

anentropic · 2026-04-14T11:48:42 1776167322

presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model

a1j9o94 · 2026-04-14T11:33:28 1776166408

You would only use the base model during training. This is a distillation technique