Hacker Newsnew | past | comments | ask | show | jobs | submit | awestroke's commentslogin

Is your experience with this new quantization approach from Intel? Otherwise your comment is a bit offtopic at best, misleading at worst.

Well, you see, they just can't find a checkbox for ipv6 support in the IIS GUI on their ingress servers.

It's only open because nobody who's interested in this model would send their data to openai to be stripped of PII. If they thought otherwise, it would be closed-weights and API-only for "safety" reasons

Well let's talk again when the problems have been solved, then. Until then, manually curated skills and documentation will beat this

Meta is not in the AI game any more

LLM Arena has them at #3 on the overview, behind Anthropic and Google, ahead of Grok and OpenAI.

Didn't they just announce they were going to be surveilling all their employees screens and keystrokes for AI training? Is that just for the love of the game rather than as part of a product?

That's probably just for internal metrics, automating dev work and facilitate stack ranking. Not to release a product necessarily.

Just saw Zuckerberg post from July 2025 saying they are going to be "careful" with what they release.

1) In the AI world, that's a very long time ago

2) That still equates to "Meta is not in the AI game any more" in meta-corporate speak


Yes, point two is what I meant.

27 multiplied by quant, add context

The better way to defend against these types of issues is to avoid Vercel and similar providers

Nailed it.


I’m referring to how it can use your computer in the background.


Me too


So are you offering api keys or...


I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?


Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.

It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.


I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.


Same reason why prompt processing is faster than text generation.

When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.


Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.


From what I understood, if we’re talking a single user on a mac (not batching) you’re rarely compute bound in the first place. More rows per pass is nearly free that way when cores were sitting idle anyway.

If that’s wrong I would certainly appreciate being corrected, though. But if it’s right, a 2.9x speed-up after rejected tokens, nearly for free, sounds amazing.


That will depend on the model, but they'll hit compute limits before a typical GPU in almost all cases. Macs will still benefit a speedup from this, just not one as big as the one reported.


Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.


presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model


You would only use the base model during training. This is a distillation technique


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: