Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On slightly off topic note: Codex is absolutely fantastic right now. I'm constantly in awe since switching from Claude a week ago.


I'm currently "working" on a toy 3d Vulkan Physx thingy. It has a simple raycast vehicle and I'm trying to replace it with the PhysX5 built in one (https://nvidia-omniverse.github.io/PhysX/physx/5.6.1/docs/Ve...)

I point it to example snippets and webdocumentation but the code it gens won't work at all, not even close

Opus4.6 is a tiny bit less wrong than Codex 5.4 xhigh, but still pretty useless.

So, after reading all the success stories here and everywhere, I'm wondering if I'm holding it wrong or if it just can't solve everything yet.


LLMs can really only mostly do trivial things still, they're always going to do very bad work outside of what your average web developer does day-to-day, and even those things aren't a slam dunk in many cases.


That fits with my experience. I used Claude Code to put together a pretty complex CRUD app and it worked quite well. I prompted it to write the code for the analysis worker, and it produced some quite awful code with subtle race conditions which would periodically crash the worker and hang the job.

On the plus side, I got to see first-hand how Postgres handles deadlocks and read up on how to avoid them.


I don't know about "only doing trivial things". I've built a fully threaded webmail replacement for Gmail using imap, indexes mail to postgres, local Django webapp renders everything in a Gmail/Outlook style threaded view with text/html bodies and attachments and a better local search than gmail, and runs all locally. Started as a "could I?" and ended up exceeding all my expectations


That would be considered a trivial thing, why wouldn't it be? It's just basic crud you're doing. Nothing unique and that hasn't been written about tens of thousands of times across millions of books/blogs/comments before.


I used it to analyze a single player game binary from steam, hook into all the relevant game state modifying functions, add full multiplayer state sync and then also hook all the relevant portions of the UI to add multiplayer to the game.

I'm not saying it's the hardest thing but I also wouldn't consider it trivial.


While I’ve had tremendous success with Golang projects and Typescript Web Apps, when I tried to use Metal Mesh Shaders in January, both Codex and Claude both had issues getting it right.

That sort of GPU code has a lot of concepts and machinery, it’s not just a syntax to express, and everything has to be just right or you will get a blank screen. I also use them differently than most examples; I use it for data viz (turning data into meshes) and most samples are about level of detail. So a double whammy.

But once I pointed either LLM at my own previous work — the code from months of my prior personal exploration and battles for understanding, then they both worked much better. Not great, but we could make progress.

I also needed to make more mini-harnesses / scaffolds for it to work through; in other words isolating its focus, kind of like test-driven development.


It works somewhat well with trivial things. That's where most of these success stories are coming from.


Exactly this, the SNR is polluted by this anecdata because someone was able to implement a CRUD backend they couldn’t before


My impression is that it always comes down to how well what you’re trying to do pattern-matches the training set.


When it comes to agents like codex and CC it seems to come down to how well you can describe what you want to do, and how well you can steer it to create its own harness to troubleshoot/design properly. Once you have that down, I haven't found a lot of things you cannot do.


Breaking down and describing things in sufficient detail can be one way to ensure that the LLM can match it to its implicit knowledge. It still depends on what you’re trying to do in how much detail you have to spell out things to the LLM. It’s almost a tautology that there’s always some level of description that the LLM will be able to take up.


Well, not just breaking down the task at hand, but also how you instruct it to do any work. Just saying "Do X" will give you very different results from "Do X, ensure Y, then verify with Z", regardless of what tasks you're asking it to do.

That's also how you can get the LLM to do stuff outside of the training data in a reasonably good way, by not just including the _what_ in the prompt, but also the _how_.


Instead of "pointing it" at docs, you need to paste the docs into context. Otherwise it will skim small parts by searching. Of course if you're using an obscure tool you need to supply more context.

Xhigh can also perform worse than High - more frequent compaction, and "overthinking".


I’ve noticed the models still can’t complete complex tasks

Such as:

Adding fine curl noise to a volumetric smoke shader

Fixing an issue with entity interpolation in an entity/snapshot netcode

Find some rendering bugs related to lightmaps not loading in particular cases, and it actually introduced this bug.

Just basic stuff.


They are definitely behind in 3D graphics from my experience. But surprisingly decent at HPC/low level programming. I think they are definitely training on ML stuff to perhaps kick off recursive self improvement.


Nah, it only lives up to the hype for crud apps and web ui. As soon as you stop doing webshit it becomes way less useful.

(Don’t get mad at me, I’m a webshit developer)


Most of the folks are building CRUD apps with AI and that works fine.

What you're doing is more specialized and these models are useless there. It's not intelligence.

Another NFT/Crypto era is upon us so no you're not holding it wrong.


This is pretty wrong. Anyone who thinks this stuff is similar to NFTs and crypto hasn’t been paying attention.


Indeed this time it's different


" or if it just can't solve everything yet."

Obviously it cannot. But if you give the AI enough hints, clear spec, clear documentation and remove all distracting information, it can solve most problems.


Most simple problems with plenty of prior art, sure


Codex/GPT5.4 is just superior to Opus4.6 for coding. I swear it costs me 1/2 of the tokens to achieve the same results and it always follows through the plan to completion compared to Opus that takes shortcuts and sweeps things under the rug until I discover them through testing.

I'm not accusing anyone of foul play and I don't have financial interests in either company, but it feels like "something" within Code Claude/Anthropic models is optimizing to make you spend more tokens instead of helping you complete the task.


I have also switched from claude to codex a few weeks ago. After deciding to let agents only do focused work I needed less context, and the work was easier to review. Then I realized codex can deliver the same quality, and it's paid through my subscription instead of per token.


Codex has been good quality wise, but I hit limits on the Codex team subscription so quickly it's almost more hassle that it is worth.


I made this switch months ago, ChatGPT 5.4 being a smarter model, but I’ve had subjective feelings of degradation even on 5.4 lately. There’s a lot of growth in usage right now so not sure what kind of optimizations their doing at both companies


Agreed. Watching the intermediate "Thinking about X ... Now I'll do Y" text on GPT 5.4 lately has been like watching a hypothetical smart drug wear off.

All of the major models have been getting worse lately, not just Opus.


Makes me wonder if the output is starting to get back into the training input and we're seeing the first signs of model collapse.


Business model collapse, maybe.

Can't wait, I need to buy some RAM for my local model server.


I use Codex at home and Opus at work. They're both brilliant.


I would switch to Codex, but Altman is such a naked sociopath and OpenAI so devoid of ethical business practices that I can't in good conscience. I'm not under any illusion that Anthropic is ethical, but it is so far a step up from OpenAI.


Enemy centered decision making


I'm with you on the ethical part, but everything is a spectrum. All the AI leadership are some shade of evil. There's no way the product would be effective if they weren't. I don't like that Sam Altman is a lunatic, but frankly they all are. I also recognize that these are massive companies filled with non shitty engineers who are actually responsible for a lot of the magic. Conflating one charlatan with the rest of it is a tragedy of nuance.


Yeah, but there's distinct difference between "risks their company because they refuse to help with killing little kids" and "happily helping with genocide".

One of these is better.


Cannot you use Codex (which is open source, unlike Claude Code) with Claude, even via Amazon Bedrock?


Codex with Anthrophic's models is not as good as using the models with the harness it was trained in mine for. Same goes vice-versa too.


[dead]


There's not one thing that stands out, but he abandoned the entire core principles of OpenAI (took a 180), constantly lies to people and doesn't plan to stop.

https://www.newyorker.com/magazine/2026/04/13/sam-altman-may...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: