Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great website, 2 things:

1 - Gpt-image-2 seems to pass the Flat Earth test? (if not, I'm sure the paid thinking 2k version passes it).

2 - Since NB2 was earlier, many gold medals are assigned to it, even though now GI2 passes them too, example the Octopus test NB2 14 attempts but GI2 just 2 (BTW number of attempts should affect the score I guess?)



So if you zoom in (click the zoom button on the actual gpt-image-2 of the flat Earth), you’ll see that a lot of the people are anatomical impossibilities, which is one of the disallowed criteria on the list. The faces also look like melted candles.

This is one of those areas where even state-of-the-art models still struggle. You’re asking for a high level of detail at a per-person level, which means you end up with lots and lots of very small objects that all need to be rendered with convincing detail.

I should probably explain the scoring rubric better - it's in the (i) info icon. If you click the pass/fail button towards the top, it switches from a simple pass/fail view to a weighted score. That weighted score is based on three things: level of adherence to the prompt, visual fidelity, and the number of attempts.

I've tried to keep my criteria as objective as possible, but there's just a certain level of unavoidable subjectivity to it.

For example, with the octopus image: Even though the minimum criteria might be five tentacles covered, having all eight is much closer to the ideal of “an octopus,” so it usually gets bumped up to a higher rating (bronze, silver, gold).

Honestly, I think I agree that the gpt-image-2 probably should be upgraded to a gold medal. Thanks for pointing that out!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: