I’m surprised Michael Levin’s research hasn’t expanded much beyond a certain YouTube media bubble. They’re able to start and stop cancer growth with only voltage changes between cells, likewise they can also trigger regeneration or anatomical changes using voltage changes. His research seems to suggest a lot of important anatomical plans are stored in an electric field around the body, not in the DNA. This model’s explanation for cancer is that some cells become disconnected from this field and start growing independently of the overall body plan.
I love his work (even though I know little more than what he says in interviews). I am also surprised it's not more widely known / applied. I am very skeptical of conspiracy-minded thinking, so I'd much rather assume his and his team's work hasn't reached escape velocity from obscurity. Especially with larger industries, it takes time and significant breakthroughs to become "a household name", so to speak.
One basic example is not counting bugs as points in your ticket tracker. At my last job I had coworkers whose velocity was almost double everyone else’s but it was because they kept deploying and then fixing their own bugs.
Devil’s advocate: what’s the point? Is it important to have reading and writing skills if everything can be transcribed through AI? Or maybe it’s not directly important, but the ability to hold your attention on something for 30-60 minutes is? Is reading the best medium for education, or something more like Kahn Academy videos?
I also wonder how the Montessori schools are doing, since I believe they focus less on rote skill acquisition and more on creativity.
Reading/writing is a much more dense and navigable way of taking in and recording information than speech. Efficient use of AI requires being very good at reading quickly and having the comprehension skills to pick up on nuances that suggest a hole in the AI's work.
In a world where AI is empowering existing experts while risking junior hiring, the young should be aiming to be competitive with those experts, not aiming below even current juniors. If, as a human, you're just acting as a glorified harness around an LLM, you're more replaceable.
In my opinion it is. Reading can convey information faster than even sped up videos, is easier to skim, and has high precision.
Im not saying it is the best for everyone, but it has been proven repeatedly to beat out any other method in the majority of the population. Plus its time stability and storage is much easier and reliable.
It also could have other side benefits like focus or perhaps something like visual acuity, much like how writing by hand can develop good hand-eye coordination. If someone struggled to write with a pencil for example I would be very wary about handing them sharp tools or knives.
When I used to read for pleasure, I did it because it was pleasurable. Not because it would be the hard thing. It was fun and easy.
What this particular chain of thoughts shows is that adults don't read for pleasure either, they associate it with an uncomfortable hard thing one should to do "build character".
This is conflating hard with unpleasant. A child just learning to read is going to find it hard to do, yet through adults pushing them to do the hard thing, they learn to read and sometimes begin to find it to be pleasurable. Building most skills is hard, yet that doesn't exclude taking pleasure in it. Many of us taught ourselves to code, the fact that we enjoyed it doesn't mean it wasn't also hard.
We've all learned the lesson that sometimes you have to struggle through something hard, to be able to access better pleasure.
I don't think this is accurate. In basically every single game or sport with measurable outcomes, doing things are relatively simple to you (and often enjoyable) endlessly - drives improvement.
If you want to make the argument that it's muscle memory in e.g. shooting freethrows in basketball, then you can see the exact same thing in doing chess tactical puzzles. There's even one successful learning method called the Woodpecker Method where you endlessly repeat over the same series of tactics working to get the time it takes you to do them down to essentially instantaneous. And it works excellently for improvement, and I obviously don't just mean improvement at doing that set of tactics.
The original donkey kong is pretty difficult compared to some of the wide-audience games that have been coming out. As far as I can tell, if the audience of a particular franchise includes younger generations as a majority or near-majority, the difficulty plummets. I don't think "plummets" is even that sensational. See pokemon, kingdom hearts, mario games, final fantasy games. Some franchises and genres have survived but not all of them.
I might be missing some other reasons why this could be happening, like increases in game balance and coordination.
Play Mario Odyssey for an hour or two then play Super Mario Bros 1, 2, or 3 as one startling example.
Mario games have reduced the difficulty a lot, although you should probably compare Mario Wonder with SMB 1-3. Odyssey is more comparable with Mario 64.
One of the things though is when most peopley play SMB 1-3 today, they're playing with input lag. Mario Wonder was designed with input lag in mind, SMB 1 was not and it increases the difficulty.
Mario Wonder lets you choose to use invulnerable characters, etc. There was only one level I remember needing to try many times to beat. OTOH, there's lots of difficult levels in smb 1...
That's mostly the MBAification of games that I think is completely disconnected from what most kids want. But the MBA logic is about maximizing market reach. Relatively few people will choose not to play a game because it's too easy, but ostensibly the same isn't true of games that are seen as difficult. Of course Elden Ring, Dark Souls, et al completely proved this to be nonsense (to say nothing of pvp games), but who's gonna let a bit of reality get in the way of pie charts, bar graphs, and powerpoints?
In the world of games outside the big money AAA MBA stuff, there's plenty of highly challenging franchises that maintain true to themselves and thrive, even with plenty of kids playing. E.g. - I suspect the median age for Binding of Isaac is well below the age of consent.
Shifting to a position below where it previously was? I don't get your point... Is there's a new bar? What would you say the new expectation is that doesn't build on previous core skills?
Reading, writing, and math are foundational skills that, aside from having enormous utility in their own right, are also crucial for developing sharp, creative, and analytical minds.
Take writing as an example: it challenges you to organize your thoughts, patch up the weaknesses of your arguments, and find effective means of connecting with your audience. In so doing, you restructure your own understanding of the world, deepening your expertise and mental schemas. That's something an LLM can't do.
I am not sure what to make of this devils advocate comment. Are you just throwing opposites at the wall? I genuinely don't know how to interpret the query about attention span?
Are you suggesting that a lower attention span has no impact? I don't know how I would learn things if my attention span was shit, or even sit with difficult problems or emotions and resolve them. Even just general productivity, which, sure there are some arguments about good vs bad productivity, but in general, any form of productivity will benefit from better attention span I think?
I think a world where people are just automatons who ask AIs to do things for them is a pretty bleak world.
Reading, writing, being able to focus on things... these are healthy things that healthy brains do. And I don't think that's just a case of it always being that way in the past, so any change to that is bad. I think humans as a species will die if we give this sort of thing up, and I don't think I'm exaggerating or engaging in AI doomerism here.
Not to mention that AI isn't that good. Maybe it will be, but I'm skeptical. Human progress will basically stop if we lose a generation of kids to this brainrot, with barely-capable AIs that can't even design their successors and move humanity forward. Who else will push humanity forward, if the next several generations of kids are intellectually incapable of doing so?
What kind of world are we building where advanced technology is just used to stop living as full human beings? Where’s the desire to self-actualize? Did it start dying when people got glued to the idiot box?
At the Montessori school my kid goes to, children read a lot. They have access to the Library, can ask the teacher to go there when they need/want to. The school has a no phone policy, children are not allowed to bring them to school. Computers are only in the computer room for children to access when they want to do research and there are no laptops in the class room.
It works well and the children are both happy and do well academically
My pen weighs in at 10g in my backpack and is capable of durably recording information for thousands of years. No battery to charge, cheap, and plentiful.
I use my fingers to interact with computers, and they don't have any extra weight at all, as they are already attached to me. You need to also count the weight of the paper.
And, no, your pen and paper are not able to durably record information for thousands of years. Unless you have some really bespoke setup.
Acid-free paper and a carbon-black ink, or a modern neutral pH iron-gall ink, should last 1000 years if stored correctly. 2000 might be pushing it, but under a controlled atmosphere it should be possible.
These posts are never written by software engineers, it’s always some tech exec, retired engineer, or VC. This author is apparently a professor at the Wharton School of Management?
None of these people have to ship or maintain real products, they’re just making side projects.
The only decent software engineering perspective I’ve seen has been from Mitchell Hashimoto.
I don’t think that’s true, I think these authors are making a much stronger claim that AI is proficient or even an expert at software engineering. This author describes how complex and sophisticated their software is, and the only value he’ll concede to “coders” is that there might be a few bugs they’d need to fix.
Imagine not being an architect and using Claude to put together a building plan, then concluding it’s basically done but we might need a real architect to double check the measurements. It may even be true but I’d be skeptical if it’s always non-architects saying this.
And - we kind of have been here before. The "proto"-type is almost complete. Its just a little slow, a little spaghettificated, just written in excel-vb, clicked together in node-graphs, or the next hot thing that makes coding unnecessary.
Why do they even need coders to fix these bugs? It would be an order of magnitude (at least) to ask Claude to find and fix them, and it will likely be successful.
Building in the physical world has physical and time constraints that cannot be overcome, which is one of the reasons architecture (and engineering) are so important in this domain. In software development these constraints were only inherent when people were writing the majority of the software. I feel like I’m seeing what I thought were fundamental constraints being eroded by the increasing speed and correctness of these tools and it’s making me reconsider the importance of some of the values that are held by software engineering.
It’s obviously dependent on the domain and solution, but if your software can be extremely rapidly rearranged, bugs found and fixed with little effort, and features added with only a minimum prompt, I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution. I also worry that, as a software developer, if more can be accomplished in less time there will be less room on this planet for software developers.
> I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution.
This very well summarizes my current thinking on the subject as well. And most of my career has been playing the role of technical debt nazi. Much to the detriment of my earning potential.
Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.
I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.
It won't take over everything, but I've already seen otherwise very intelligent go-getter type folks who are not technical or know how to code made extremely useful things for themselves and their small little enterprises. And this will seemingly only get better and more efficient.
For someone who really does love the idea of well architected and future-proof code this is just icky to even say or consider. But I'm coming around to this is the future for the majority of software for most places. And it may have the ability to seriously even the playing field for small enterprises in some industries.
I'm currently using it to implement a zillion side projects at home I've been "meaning to get to" for years. It makes incredibly silly unmaintainable code most of the time - but I learned to not care, and just tell the AI bot to fix it/add to it as I go along. Worst-case I spend a single night deleting it all and starting from zero to "refactor" an entire thing.
> I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.
I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?
What if to fix 2 bugs your LLM starts adding 50 new ones? Will you tell your customers in supports channel "sorry software is finished, if we try fixing anything, everything else might break, not worth it". Or "we can probably fix it, but our AI usage will raise so much we need to up the subscription 3 fold, you choose".
The speed at which LLM codes is only comparable to the speed at which they add garbage to your repo. If you stop caring about maintainability, you also stops caring about your AI/LLM related bills and the viability of your project past the PoC stage.
The GP explicitly mentioned "end product is not selling software". But even then, bugfixes introducing new bugs are not unheard of before. Most code used to be mediocre quality so there's not a sea of change with AI. Perhaps it even becomes better on average.
Another thing though is selling software in the first place will soon become tough proposition outside of a few niches.
I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?
There's no reason to think that quality and maintainability will start falling exponentially. On the contrary, these models get better every couple months, and 99% of software isn't actually that complicated. There's just no reason for the fear-mongering that fixing 2 bugs will cause the LLM to add 50 new ones.
I think many software engineers forget they exist to get real things done
One billion percent. I think the vast majority of the anti-AI sentiments I hear from software engineers comes down to them caring more about playing with their tools than actually solving the problem.
> Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.
This hits the nail in the head.
Detractors often hang on to examples of coding assistants making mistakes or output subpar code, but they somehow miss the fact that coding assistants can also be prompted again and refactor whole swaths of code just as fast as they introduce oopsies. This means that the worst case scenario implies fast convergence to an acceptable outcome, and from there also fast iteration to improve upon that.
The problem is that this approach is not sustainable. Errors compound. The cost to fix one issue might seem small at first, but over a stretch of time all these "oopsies" become architectural spaghetti that can only be fixed with a complete rewrite, which will certainly become more expensive than getting the code "organically" developed.
The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering. Let engineers really own the architecture. Enforce that any new feature - no matter how small - to be specced out with complete sequence diagrams. Ensure that every new software package needs to be put on an UML component diagram for the team to review and see each addition interacts with the whole system, etc.
If we do that, then we can just give all the documents to a coding agent and say "go ahead and implement this" with a minimal amount of confidence. But in doing this, I bet we will realize the following:
- the "effort" has never been about writing code itself. The code is just the material manifest of all the thought that went to think over a solution into the problems that the product is attempting to solve.
- we will likely be better off by using code generation tools (i.e, UML-to-code) and a "weak" LLM (than can run locally) than by playing the token lottery at the Anthropic Casino.
I mirror your thoughts. I think we'll end up with "perfect map" paradox = you cannot be vague or indecisive on what you want (and if you are then these decisions don't matter) and you're creating a 1:1 representation of what the code needs to be.
I'd substitute "owner" for the team and in that sense the owner will not need to be human.
We're at this state where Claude is great at doing the "middle" part of work, but it's crap at gathering requirements and verification of what it has done. I also don't see people caring about these aspects of software development as shown in the article
> The problem is that this approach is not sustainable. Errors compound. The cost to fix one issue might seem small at first, but over a stretch of time all these "oopsies" become architectural spaghetti that can only be fixed with a complete rewrite, which will certainly become more expensive than getting the code "organically" developed.
That's so far been called software development.
All software developed by people suffers from this issue.
Where exactly is the novelty?
> The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering.
Nonsense. The problem is exactly the same.
With agents iterations are much faster, and this can mean things can get messier faster but can get in shape just as fast.
Ironically, agents improve the quality of the deliverable as well. Approaches such as spec-driven development do a far better job delivering features up to spec than manual coding by flesh and blood developers.
There's an awful lot of baseless scaremongering in your post. You make it sound like with AI assisted coding developers stopped paying any attention to quality.
> All software developed by people suffers from this issue.
And that’s pretty much where you are wrong. Take any long running open source project and you can see the craftsmanship that goes into it. It may not be perfect, but hacks are clearly marked as such.
> And that’s pretty much where you are wrong. Take any long running open source project and (...)
I think you are demonstrating a clear lack of insight and experience in software development settings, including FLOSS projects. I can name you a dozen of fairly known FLOSS projects which are a big ball of mud. Just go to the likes of GitHub, check out the list of popular projects, and peek at their code. You will get a very mixed set of results.
> The compounding speed. Your devs might reach a point where they have to rewrite and refactor, in a decade.
I think that this is exactly why this scaremongering breaks down. If you believe the compounding speed is that greater, wouldn't you be compelled to accept that refactoring and cleaning things up is just as fast and effortless?
I mean, you have a tool that writes software for you following your commands. If you are that concerned with maintainability then what can possibly compell you to not invest any effort in it?
> wouldn't you be compelled to accept that refactoring and cleaning things up is just as fast and effortless?
No. not at all. Imagine that each unit of work (a new PR for a feature, a bugfix) builds something that is 99% close to optimal and you can only get to bring it to 100% if you spend time to really review and rewrite the "not good" part. Also, for the sake of argument, let's just say that the overall quality of the system is geometric mean of the quality score of each unit of work. The only way to get an "ideal" system is by ensuring that work done on it follows the "ideal" architecture - for whatever "ideal" means for the developers/maintainers.
You are arguing that you are saving time because you only have to write the 1% that the AI got wrong, so you'd be getting a 100x speed up. My argument is that there is not so much time because if you want 100% quality, you will have to review 100% of the code. Understanding the produced code is the time-consuming part, not typing it out.
So, the only way to have these time savings by working with coding agents is if you accept that the code generated is good enough to not have careful review. But if you do that, then each unit of work that you tell yourself "not ideal but good enough. Ship it and we refactor later" ends up bringing the overall system quality. If you have 10 of these "99% good enough" PRs, and your overall system score is already at 90%. With 50 of these, the score dives down to 60%.
This is what OP and I are talking about "compounding" issues: unless we get to a point where generated code does not need review at all, your development speed will always be bottle-necked by the human in the loop. The only way to get speed benefits from the code generation is if we remove the human in the loop, but in doing so quality will drop faster than you can fix it.
I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues. I’ve seen SOTA models make ridiculously stupid architectural decisions that they were then unable to back out of without being prompted very specifically, instead adding a patchwork of “fixes” on top.
I’m not saying that you can’t use AI to do it because I believe that with carefully controlled workflows and context management you can, but it’s not a simple prompt away, it’s requires guidance and understanding, and isn’t the speed demon that raw prompting is.
> I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues.
That's not really the point though. That presumes models are only useful if they are one-shot models. That is false.
I mean, what if your prompt successfully changes 20 source files and makes a mess in one? How much work did it saved?
And the elephant in the room is when models actually outperform whatever the prompter is able to deliver, and faster. That is somehow left out.
> That presumes models are only useful if they are one-shot models
That’s not at all what I’m saying.
I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.
I want them to be more useful outside of one-shot uses, but I find that they currently miss the mark.
> I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.
That's not my experience at all, and I have been using models that are far from being cutting edge. Even in the cases where a model generates utter nonsense, a couple of clarifying questions is all it takes to get it back on track.
But that might be a factor of the project being worked on, and the extension of the changes being asked.
I think this is overlooking the fact that assigning a coding assistant to fix the bugs it re-introduces for all eternity just leads to spiraling token costs, which might cost more than just hiring a competent engineer in the first place.
I think computers are incredibly cheap compared to humans. These models and infrastructure to run them are going to only get more efficient in time. Right now we are still using (for the most part) entire hardware architectures mostly shoehorned from one purpose (graphics) into another. As purpose-built hardware becomes more prevalent and the SOTA starts to slow down I can't imagine a $100k hardware box not being able to handle a small team of developer's needs for many things.
I do think there will be a place for the top 20% of software engineers forever. But most people are not in that top 20%, and the quality when you get below average is not a linear progression. It will not be that difficult for AI generated code to beat the "bottom end" of the industry since tbh it's hard for me to tell the difference between LLM generated code and some of the shit I've seen over the years. I've ran across code written by folks who don't know what an array is more than once.
Most software is not built by MIT and Stanford grads making $500k/yr in the Valley. It's built by work-a-day programmers in the middle of nowhere making $80k/yr to keep some niche small business going with hyper-specific software that was first designed for Windows 95. Or stuff like making horribly designed Wordpress plugins. Or Shopify integrations. etc. etc.
I've also seen these small businesses totally held back by incompetent programmers, and despite their best efforts and huge amounts (for them!) of investment they can never seem to fix it. These types of enterprises are having AI run circles around their current engineering practices, even if it would make most FAANG engineers gasp in horror.
Either way it will certainly be interesting to watch! I just wish I was closer to retirement.
In my experience, the refactors are just as bad, just in different ways. All you end up doing is treading water with different iterations of shitty code. By the time you get somewhere acceptable, you could've just fixed it up yourself.
My preferred workflow these days is to pair program with an LLM until it gets close-ish and then manually touch it up. Without that, it just produces junk in different forms.
Don't forget that you can adjust your requirements (either via plan or skill) to ensure the mistakes do not happen. The problem is that neither LLMs, nor humans (that don't work with the domain) will know they made these mistakes. Even coders don't think about everything all the time
That assumes you can write automated tests that reliably identify the mistakes over an entire codebase. Nice idea in theory. If it were actually possible, we would long since have generalized libraries of tests to catch every significant security and performance gotcha. What we have are static code analysis tools, fuzzers, etc. None of which have come close to eliminating security and performance problems. I don't see how AI somehome changes that.
Ah, I see what you mean now. Yes, my mind went straight to static analysis and testing (unit, feature, uat, mutation). Thanks for expanding on your point!
It's quick to build a hut in a green field, but slow to remodel the expanded building after. I think that will remain true regardless of if a team of sw developers are doing it, or an AI with a product manager or somewhere in between.
Technical debt remains the same. LLMs are found not to work as well when editing messy codebases - exactly the kind you get after using an LLM for a while. After a few weeks or months you have to either throw it away and start over, or involve a human at exorbitant prices.
> I think these authors are making a much stronger claim that AI is proficient or even an expert at software engineering.
The author specifically says:
> I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly (which is one reason we may need more, not less, coders in the future, to help with the explosion of new uses for software)
which acknowledges pretty clearly that engineers bring a level of insight and experience still missing from Mythos. Saying that, I totally disagree with his contention that this will always be true. It's pretty weird that the author of an article stressing the steep improvements in a model's capability can't seem to imagine further improvements in that capability. As if Mythos is where development ends or whatever gap remains between models and experts won't steadily narrow or eventually widen in reverse.
It is, and it's cool that it is, but the calibration is important. Statements like this:
> With Fable the spell has gotten powerful enough that I am no longer sure I am the wizard. I am closer to a patron. I describe what I want, I pay for it, and I judge the result. The conjuring happens somewhere I cannot watch, in hundreds of small choices I never get a vote on. The work has shifted from process to outcome. I no longer steer; I commission.
have a very different meaning coming from a non-technical researcher than they would from someone who builds software for a living.
Making side projects isn't a trillion dollar industry tho, adding to the fact that we are facing another global supply chain crisis due to the Iran War; the US is about to commit the biggest self-own ever in the history of empire.
The US has been on a course of self-owns ever since Trump got into office. That they still are a dominant power on the globe shows how much they were one before Trump, but it seems to be changing. At every self-own they commit, China laughs and inches up a little closer. I think we will see the day, when they are evenly matched in our lifetimes.
But which self-own exactly do you mean, of the many there are?
Well, right, but if the real use case for LLMs is "making software that wasn't economical to make before" that's bearish for the labs because it means they're only going to be chasing the low end of the market.
> I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly [...]
People have said things like this many times in the past, and, in the past (perhaps not now), it's always been a misunderstanding of what is good and bad, what's difficult and easy.
For example, someone would draw a UI in a GUI painter that generates code (or a resource file), and a manager would see it and think the majority of the work towards the product is done. (Incidentally, then there seemed to be a reaction, towards making your UI mockups look abstract or otherwise different from runnable code, helping the nontechical to understand that this isn't 90% of the finished product.)
Or a student intern hacks out a homework-grade demo, and a manager who understands neither software engineering nor product domain says "we just need some engineers to polish it up for production", and thinks the student is a star and why can't their engineers be as brilliant and productive. (I might have once been that energetic intern, who was happy for the encouragement, but then learned more, and saw it was a thing.)
This common misunderstanding was sometimes self-correcting -- when trying to ship became a disaster of misery and regretted-attrition, or the product was poorly received by the market because it wasn't thought through nor implemented well, or building subsequent functionality atop it was a nightmare. (But adverse effects of bad approaches is one of the reasons for management and ICs to job-hop, before the unwanted effects affect them personally.)
What might be different now is that some of these AI tools are outputting better-engineered work than some software engineers, and much faster.
At the back of my mind, I'm wondering how the really great software engineers will continue to stand out, as the discipline is being devalued in the minds of most leadership, and anyone can prompt an AI to generate something that superficially appears to them like what they assume a great software engineer would produce. (Even if the great engineer would do much better quality of implementation, have innovative ideas that ML from open source code would not, and maybe arrive at better product concepts as they worked through the problems.)
This is why the AI companies are rushing to IPO. By the end of next year you’ll be running most of your AI on device. They have no moat, they’ve reached the limits of scaling, most of the magic can be distilled into smaller models, and they know it
Qwen's ~30B-class models are genuinely good enough for use if you can find a machine with enough memory bandwidth to run them at 30-90 tokens/second. It's been extremely telling that Qwen stopped releasing 120b class models. At some point in the next 10 years (maybe 3?) someone is going to release an Opus 4.5 class 256B model you can run locally. Right now our engineers use about $800/mo worth of opus tokens; at that rate the ROI for local LLM is ~10 months
I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware.
I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf.
Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models).
It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement.
I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range.
PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines.
I have come at this at a slightly different angle.
I am a fully-burned-out freelancer (in the last couple of years so severely and totally that I thought I had early onset dementia, and I am still not sure I don't). I don't really have an off-ramp to anything else yet, but the sea-change in the industry has been contributing to my feeling that I should knock it on the head.
I must get past broad understanding of AI to deep understanding, but I have to find a way to do this which sits well with freelancer ethics (sustainability, stability, control of destiny).
So I decided I would start out with that operating principle that ultimately this stuff is just going to be local: models will eventually hit some level of practicality for most tasks and technological progress guarantees that they will eventually run on desktops.
I decided to learn how to run models locally properly, see how far I get with opencode (and Pi and Zed experiments), and grow outwards from there to metered models (opencode go, openrouter etc.)
Knowledge first; what can I do that meaningfully changes my outcomes and confidence with no cost and no exposure to sudden change?
I have a secondhand M1 Max (excellent GPU bandwidth), and I am really shocked to find that arguably that level of practicality is already here.
Qwen 3.6 35B can really do a lot. And — not sure if you have tested it — but in some ways I think the Gemma 4 26B is better. Particularly for more commonplace dev tech — it is very knowledgeable about the sort of low-end web dev stack that is most common (Wordpress, PHP, MySQL).
I have been getting 75 tokens/sec with (GGUF) Gemma-4 26B QAT and MTP. (Can't get anywhere close with MLX, for some reason.)
A similar sort of speed with an MLX Qwen 3.6 35B. I have a sneaking suspicion that maybe llama.cpp is now faster than MLX on this older kit so I might try seeing what llama.cpp can do there, too.
Not blazing fast, but fast enough that there are plenty of experiments and small jobs I can do before I even get to using Big Pickle!
How are you running that GGUF, and how many tokens/sec are you getting without MTP? My M1 Max gives me 65 t/s for non-MTP unsloth/gemma-4-26B-A4B-it-qat-GGUF (UD-Q4_K_XL), but with MTP that actually goes down to 56 t/s (at 63% accepted drafts).
I hadn't done any really radical testing so I've just had another look.
Without the MTP drafter, it is pretty consistently 75 tokens per second anyway, which is interesting.
With the MTP drafter it reaches well above 95 tokens per second handling the prompt and it will slowly drop to 65 or so with the output tokens as the prediction success rate slowly drops.
But with generated output it seems to me that the predictions are always going to drop dramatically over time.
I think my results here are broadly consistent with what people say about success rates with smaller and sparse models. I am going to test with n-max 4 in agentic situations at some point, and I may see whether it has much impact on the 31B model which is too slow to be practical otherwise.
I have a very unqualified feeling that MTP will matter more in agentic coding because of the larger prompts.
But my biggest issue since I installed it, I think, is that the combination is occasionally messing with markdown generation during thinking, and sometimes possibly losing the </think> at the end. I've seen it enough now to be fairly sure it is the Gemma MTP causing it. There is an open bug in the vLLM project about this and I wonder if something similar is going on in llama.cpp.
The speed without the MTP drafter is pretty solid so I am content to let more experienced people than me handle things while I learn other stuff, but I might go looking for some testing code that can prove it sometime.
Majority of my agentic setup is pi / Claude code where every single Chinese models are not as good except commercial 1T models .
Local is a pipe dream . If you can run it cheap occasionally why commercial companies can’t run it cheaper 24/7 and lower the costs ? The answer is simple. Use cases are more demanding and hence you need more from model not less .
Sure if you task is to do a narrow labeling task on 1m records small optimized model is good . If you want to do complex things , it shifts with models advancements
This sounds like something someone at IBM in 1986 would say trying to sell their mainframes. "PCs will never be a thing. No one's gonna want a computer."
I'm seeing some impressive results from folks that can afford 10k+ GPUs right now. But those GPUs will all be hand me downs in 10 years. So pipe dream? Hmmm...... that's not how this industry works.
Those are not GPUs available on iPhones. Will we get there eventually? Maybe! Maybe we end up with GPU clusters built on the edge (e.g. cell towers) for offloading, maybe it’s never economical, maybe a different model architecture makes it simpler, who knows.
But it doesn’t seem anywhere imminent with our current world state.
My computer is 15,000 times faster and costs in inflation adjusted dollars half that of my computer in 1995. There's zero reason to think that won't happen over the next 30 years again.
For whatever reason every generations thinks they are the peak. Naw man. You're just a blip at the bottom of the logarithmic chart.
- was the pause in model scaling a result of the benefits of RL & SFT being easier to access and quicker than scaling, or was it genuinely the result of scaling being low ROI now?
- are power densities necessary to provide high quality on device inference possible? Can the best, technically feasible, architectures accomodate T scale models and run them off batteries that fit in your hand?
- will thing slow down enough to allow edge depoloyments to realise value vs. centralised deployments.
- do edge use cases drive enough revenue to get this to happen?
- can local inference make up for model scale? Does that make sense in a latency/power race with the central infrastructure? Is there a sweet spot here?
It has slowed down massively for CPUs at least. e.g. modern CPUs are hardly more than 3-5x faster than those from 10 years ago. There is zero reason to think won’t happen over the next 10 years again.
This isn't an crazy statement (cpu performance metrics have mostly stalled their meteoric rise from prior to the 2000s)
But it also doesn't capture the entire picture.
CPU metrics mostly stalled for two reasons.
1. There wasn't much demand for the extra capacity. Even low end cpus from a decade ago are plenty capable for just browsing the web and typing up documents. It takes a novel use-case to drive demand again (or a desire to do things like play new games).
2. The interest in CPU development shifted in response to mobile. Given point #1 and the state of battery development.... the blocker wasn't "performance". It was "performance per watt". And on that metric you couldn't be more wrong.
Since ~2005, MIPS per watt has improved 15x to 30x.
Also - fun news is that the traditional CPU pipeline really isn't the bottleneck for AI workloads. So we're going to see incredible interest in things like memory bandwidth and other inference related hardware bottlenecks, which haven't already been optimized.
> There wasn't much demand for the extra capacity. Even low end cpus from a decade ago are plenty capable for just browsing the web and typing up documents.
It stalled before the rise of PC-as-Internet-portal.
I bought a high end PC in 2003, and 5 years later the PCs were not much faster - probably not even 2x. Around 2008-2010 was when most people started using PCs as a way to connect to the Internet.
It stalled because scaling got a lot more challenging. Not because of lack of demand.
Yes, but it only stalled along a single dimension - Single core clock speed.
I was building gaming machines in the early 2000s, I absolutely remember the 4ghz wall that cpus hit.
But it wasn't a real wall... because we then got one of the arguably most influential processors ever in the Core 2 duo. Which... blew the limit away by giving you two processors clocked at 2.93 GHz each.
And honestly, even then - it was lack of demand (we could go to 4+ghz, but we didn't want to pay the power bill for the rest of the system - the planned pentium 5 was 7-10ghz on paper, but they canceled the project because keeping it fed and cool was too hard for personal desktop machines).
Of Note - we did reach these speeds on consumer hardware (ex - in 2012, Andre Yang hit 8.794Ghz on an AMD FX-8350)
So it was never "impossible" to keep scaling. It just wasn't worth it compared to going multi-core.
---
And maybe it's because I was in my formative years at this time, but you're off by 5+ years with this:
> Around 2008-2010 was when most people started using PCs as a way to connect to the Internet.
Gmail was a web only email client released in 2004. Wikipedia was released in 2001. Web browsing was very much one of the "killer" apps for computers by the 2000s. What do you think the damn 2000s dot-com bubble crash was?
at the risk of aging myself - I was born in '89, and I literally do not remember a time where we didn't have DSL speeds and above (friends houses often still had dial-up until ~2005, though).
> Gmail was a web only email client released in 2004
Well, Gmail was actually one of the last web based email clients people used :-) Yahoo mail, Hotmail, and so many others predate Gmail by years.
> Web browsing was very much one of the "killer" apps for computers by the 2000s.
One of them. People still used non-browser apps for all kinds of things: Media consumption (people didn't watch movies on Youtube), Office (Google Docs was very much a niche thing for many years), photo-editing (lots of pirated versions of Photoshop/Lightroom years after the iPhone release), etc.
Most non-mail, non-social media, non-shopping stuff people do on the web these days was a dedicated SW from the vendor in those days. Want to make a photobook? Download this Windows binary and set it up there. It will then communicate with the server for the order (no browser utilized).
> at the risk of aging myself - I was born in '89, and I literally do not remember a time where we didn't have DSL speeds and above (friends houses often still had dial-up until ~2005, though).
Spring chicken! My first online experience was on a 340 baud modem :-)
Because I have a fixed expenditure on my local machine, and I can be absolutely sure of the costs over a long horizon (5+ years, for low end hardware life, 10+ years with moderate care). Not something that's true for cloud costs.
Your argument is actually really similar to an argument around the time Uber started kicking into gear and expanding.
It went:
---
"Why should I own a car when it's actually cheaper to just Uber for all my rides, compared to the cost of buying, maintaining, and insuring a car?"
---
And that wasn't an insane argument at that exact moment. Uber was pricing itself in the range of $5-$7 a ride, was novel and high quality.
Except take a look around today... Uber in my area went from ~$5 a ride to ~$27 a ride for the same trip. Uber's quality has also degraded quite a bit. It went from primarily high end, new cars with immaculately clean interiors to "average".
So want to make a wager on what's going to happen with cloud costs over the next decade for inference?
Because my strong hunch is they're going to follow exactly the same trend. They will stop being subsidized, providers WILL downgrade model quality to improve operating costs (and you'll have no control over this outside of enterprise contracts), and companies will start exploring "additional revenue options"... which means they'll shove ads and sponsored content into your results.
Is it worth being ~10-18 months behind the latest and greatest to avoid that entire set of shenanigans? I'd vote yes... I pay one time up front, and get usage limited by my hardware for the cost of electricity over a 10 year timeline. That's a decent deal with no surprises.
You're welcome to rent, but renting makes you subject to the whims of the owners. They're being very nice right now to attract all the flies. That's not a mistake, and it's absolutely a trap.
---
Side note - if you're only able to do labeling tasks with a local model... you're holding something very, very wrong.
Keep working on your agentfu because there is a sweet spot with subagents and parallelizable plans. It’s not about better, it’s about efficiency and picking the right model for the job. You can achieve the same results as frontier models with the right type of planning and context management on local Chinese models.
I was freaked out being stuck with OpenAI and Anthropic. I setup qwen3.6:35b-mlx on my Mac Studio M1 Ultra and was blown away really. I am no longer afraid that Anthropic or OpenAI will be able to control the market.
This does not include any particularly large models. But the models it contains (Qwen3.6 27B and Qwen3.6 35B-A3B) are the local models people have been very excited about lately. So they didn't release any larger models, and the models people praise so much are from this most recent release.
If they stop releasing their larger models because they want to monetize, would we expect them to release better small models that can outcompete those?
there's pros and cons to it for them. Clearly, they get good branding (at least in enthusiast circles). Perhaps more important is they get community work on optimization. There have been significant performance uplifts on the Qwen3.6 models from the open-source community since they were launched (at a minimum, multi-token prediction is now working with them. It is almost a 2x token generation speedup)
I just want a tiny tiny model that runs on device that knows for autocomplete that, for example, I want to say "I'll be right back" instead of "I'll be right Brian". That's my #1 AI ask right now. Please, Apple.
I want Siri to let me “add to my calendar, dinner Peter’s house Sunday at 5pm” and not assume the location is the restaurant called Peter’s House in another state. It’s astounding how poor Siri is at using the data I’ve given it access to
Well, let's not forget that text models are not the only models! Video models are much slower and need comparatively more resources, and all they can do even at that size is generate videos a few seconds long. Clearly a ton more work is going to go into those, and demand for them will probably increase as more creative tools get authored using them as a central part of the workflow. Low-res local rendering for preview might be a thing, but the lion's share of the work for high-res, near-realtime rendering is going to be done on huge clusters for a long time yet.
This is definitely a good point. I imagine the max capacity for video models is significantly lower than for text models (there just aren't as many professionals in video as there are people who write text or code) but I could be wrong.
I think there’s still an open question around are the ultra-large next-gen models worth it? For those of us without early access to Mythos, it’s hard to verify whether it’s been held back from the public due to actually being “too dangerously powerful to release yet” as implied or because the gains aren’t outpacing the costs.
I think GPT 4.5 showed that there is indeed a practical limit we're close too. That was supposedly a high-trillions of parameter model that was deprecated almost immediately because it was slow, insanely expensive, and had questionable benefits over the smaller models. Though apparently the new Mythos and whatever GPT Spud is (if it wasn't 5.5) are back up in the high trillions.
Actually having used it a bit, I'm quite excited to see a modern model of similar size.
I think what people didn't realize was, just because the GPT-4.5 model didn't get better on the benchmarks, didn't mean the model wasn't different than the earlier models. It was being compared to thinking models that were being developed at the same time.
The GPT 4.5 model still has some of the most "human" like abilities in communication even though it isn't particularly good a problem solving. It hadn't under gone the same type of reinforcement training.
I still use GPT 4.5 sometimes, in creative exercises it can be surprisingly effective. The model is still available.
yes and no. We've reached the point where larger models are higher quality, but they're also too expensive and slow to be used broadly. The giant models, however are still useful for training smaller models that are actually deployable.
I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.
Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
> Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.
Maybe not per watt, but unless you already happen to own a 3900 cited by that post, you'd have to buy that as well, which is currently selling for around $1400 used.
3090s are running $1400 now? Wowsers. I thought I was overspending when I bought 6x of them for around $800 a pop.
Might be time to sell, to be honest. It's fun to have that at home, but I can't justify having $10k (with memory, mobo, cpu, etc) sitting in my basement without being fully utilized.
I do have a 3090 Ti on my gaming PC, but even my old M1 MBP (with a mere 32gb of RAM) is quite competent and can run a quantized `Gemma4-26B-A4B` in the background while I do other stuff.
Where you are developing software. Its significantly faster to use google gemini and copy paste code back and forth compared to having gemini edit files for you.
well to be fair that's right now, I think the question is what about in 6 months, 12 months, 2 years?
Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?
Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.
Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics
For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)
Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.
The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.
In the coding realm, I think we'll be seeing 35, 70, and 150B models sold where you pay a few hundred to a few thousand dollars up front and get a year of monthly/bi-monthly updates where they've trained it on new coding documentation and repos.
Right now there is no reason since tokens are subsidized heavily. However when OpenAI/Anthropic will drop the $200/month pricing since most likely it eventually will become unsustainable you'd rather get MacBook Pro M6 Ultra with 128GB ram and go local then pay thousands every month for tokens.
I have his blog in my RSS app and I click every pelican test because it's fun. I think criticizing it for lack of scientific or technical rigor kind of misses its point. It's a fun curiosity.
Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:
Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.
"there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
Today, even that loose connection to utility has been broken..."
reply