Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It feels absurd.

A Chat-GPT4 model will generate unit tests? Hallelujah! I hate writing tests! Yay!

Except how do you know it's generating the right tests? Can it explain its reasoning? Unit tests are a weak form of automated specification. Why are we inferring our specifications from examples to begin with? Who is going to walk through the reasoning and verify these are the right tests and make sense and specify the correct properties? Can Chat-GPT discover properties and prove theorems?

We used to do this in code review where humans could explain their reasoning. Now we have Chat-GPT4 which will give you a plausible-sounding answer that is completely wrong and makes no sense. We have to read every line it generates and make sure it contains no errors, is properly specified, makes sense, etc... something we're extremely ill-equipped to do.

The problem of programming, for me, hasn't been about how much code I write or how quickly I write it. It has always been about solving the right problems with elegant solutions. The code itself is an artifact of the real work.

CoPilot just doesn't really help here. It doesn't understand specifications and doesn't do any reasoning. It can't take a specification, generate a program, discover new abstractions that make the solution more elegant, and explain its reasoning. It can generate a heck of a lot of code though! Wow! Is it the right code? Maybe!

But that's what we get with humans, right? No!

Humans can explain their reasoning.



Copilot user here.

Copilot (the existing gpt-3 one) definitely helps at writing unit tests. Yeah, sometimes it doesn't nail it, but one thing it can do reliably is to repeat a pattern, and I don't know about you, but my unit tests tend to repeat the same pattern (with some tweaks to test this-or-that-case). Quite often it infers the correct change from the name I gave the test method, but even if it doesn't it'll write a 90% correct case for me. I imagine the GPT-4 version will do more of the same with better results.

It cannot replace reasoning, but it can augment it (by suggesting patterns and implementations from its latent space that I hadn't thought of), and worst case it can replace quite a bit of typing.

Long-term, it remains to be seen how far bigger/better/stronger LLMs can push the illusion of rationality. In many fields, they may be able to simply build their ability to pattern-match to beyond some threshold of usefulness. Perhaps (some subset of) programming will be one of them.


> one thing it can do reliably is repeat a pattern

Isn't this something we've built into every modern language(and arguably the entire point of languages)? If you have multiple pieces of code that share code with tweaks(to test this or that case for example), shouldn't you parameterize the bulk of the code instead of getting autocomplete to parameterize it for you and dump it into your source file multiple times?


Testing best practices have the opposite philosophy for the most part. Avoid abstraction as much as possible. Do repeat yourself. Because a bug in tests is insidious, so you want to minimize that. One of the best ways to minimize bugs is to explicitly avoid abstraction.


Copilot helps me write parametrized unit tests. You might find this example useful: https://til.simonwillison.net/gpt3/writing-test-with-copilot


Oh shit, you're right, I forgot about loops. Guess I'll go uninstall copilot now.


I have a "Generate... -> Test for function" in my JetBrais IDE out of the box for several years, and it takes care of boilerplate pretty well.


That just generated an empty test function in a convenient place for me. I'm not just talking about boilerplate, it's definitely a more... organic-feeling sort of pattern matching. In fact, one of the things I find most interesting about it is the sort of mistakes it makes, like generating wrong field names (as if it simply took a guess). This is the sort of thing that I've grown to expect the deterministic tooling of IDEs to get right, so it always surprises me a bit.

By the same token, often it takes a stab at generating code based on something's name (plus whatever context it's looking at) and does a better job than the IDE could, because the IDE just sees datatypes and code structure. It really does feel like a complementary tool.


You haven't tested how powerful copilot is, have you?


Why is it absurd? If it writes readable code, you can check it. Writing the first version often takes a long time, so this is clearly a breakthrough.


Then there's a vicious circle. You need to have technical expertise to evaluate whether the code from these models is fit for purpose. Until these models get sufficiently reliable that you can use them without worrying if the results are correct then you still need developers. This may be much better than Stack Overflow, but I imagine it will still suffer from the same problems with regard to copying "answer" code.

I would give an answer from ChatGPT where it confidently told me that I should evaluate an object detection model by ranking matches using negative IoU (and it indeed generated code to do it and gave a confident explanation of how this was normal in computer vision, but it was completely backwards).


I would much rather use it as a code review tool than become the code review tool. I suspect the latter will happen at a lot of companies, though.


Why? When the latter is just as effective but also much quicker?


It isn't just as effective for me. No matter how much I'd like to, I can't review code with the same thoughtfulness and thoroughness that I apply when I write it. I know the same is true for the people who have reviewed my PRs as well, but maybe its different for others. I do use Copilot but mostly it only generates one liners for me that save a little time.


Empirical studies on large-scale projects employing informal code review (the study I'm talking about monitored the Qt project repositories) suggest that humans have a very low impact on error rates. Reviewing more than ~200LOC every couple of hours makes the effect disappear.

So you're not alone. You can even point to the plethora of "find the undefined behaviour," tests: humans are really bad at finding errors in code.


Based on your statement, I feel you haven't used GPT based code much. The code it generates is generally beautifully organized and commented.

Sure, it can be flat out wrong, but it is always eminently readable. Self-documentating code with clean comments, far above the standard I see on average human code. That's about as good as "explaining it's reasoning" as you get.

Also, you can literally ask chatGPT to explain the code, line by the line, and it will. So there's that.

Is it ready now? Certainly not if it will be expected to the do the entire job of a SWE. But it is already extremely useful, especially to less experienced devs as both a production for specific tasks and as learning tool. And it will only get better.


I'm not sure this works out in the long run. We're currently using extremely generalized tools here, it's difficult for it to establish any "reasoning" when it does not have a "history" to rely on so to speak. Which is where our reasoning stems from, history. We opt for Solution B because we previously tested Solution A in another similar project.

I just don't see this as being a barrier for too long. Not when more companies opt into training data internally.


That won't happen only using text. Language does not capture the full spectrum of thought which is what is at work in case of an expert. Companies training their custom models or increasing the model size, etc. won't change this fundamentally but it might help in some aspects. We need models that can observe/capture the "history" from everything else (i.e. physical world) in addition to text. Or perhaps mix LLMs with other models in order for our models to posses something we call "common sense". This common sense is needed when you transition from "Solution B" to "because we previously tested Solution A in another similar project". There are a LOT going on with this transition between A and B that might not be apparent to you.

But in general I agree with the basic premise of what you say in that, it will eventually happen.


this all sounds very plausible and convincing until this part:

> But that's what we get with humans, right? No!

> Humans can explain their reasoning.

can you explain your reasoning as to why a language model would never be able to match a human's ability to explain its reasoning?

i've met a lot of humans that are quite bad at this, and i will likely never know for sure why they wrote the unit tests that they wrote unless i rewrite those tests to myself.

but if you have them explain their reasoning enough, and if that reasoning is plausible enough, and if the relationship between that reasoning and what they did is strong enough, consistently enough — you start to trust them.

you don't trust gpt4 to write code for you. which makes sense. but that doesn't mean as much as you think it means, i think.


> can you explain your reasoning as to why a language model would never be able to match a human's ability to explain its reasoning?

It’s a language model, not a reasoning model. A lot of its training data happens to be logical, so it sounds logical, but it’s still just acting on probability. Thinking it’s “explaining” anything it produces is a mistake.


> Except how do you know it's generating the right tests? Can it explain its reasoning?

In my own project I've got it generating solutions to errors, and if possible, it also generates simple unit tests to validate the fix. What I haven't implemented yet (and likely won't, because Copilot does it now) but I have tested is generating a pull request describing the fix, why it works, and the same for the tests.

> Now we have Chat-GPT4 which will give you a plausible-sounding answer that is completely wrong and makes no sense.

With limited scopes, it works quite well. For example, something fails because a DOM reference is undefined in a React component. GPT will add a condition to assert that the reference is present, then generate a simple test which mounts the component with stubbed references that are present or undefined using jsdom. The tests makes sense. A quick scan shows they're sensible, and upon running them, they do work.

I began adding a recursive feature which would automatically debug issues with its own solutions, but it can get a little weird in some cases. Likely due to bad prompting – I haven't dedicated enough time to it. But it can also make it so tests with errors are revised and corrected so they will at least run.

All of that with a coherent explanation of what was changed, why, tests, and why they assert the fix is valid.

Is it perfect? No. Could it be useful? Absolutely. I'm a little sad Copilot makes my project redundant because it was actually very exciting to build. There is real potential here. I started the project in order to learn and validate GPT, and I'm very convinced it has genuine utility and massive potential.

> CoPilot just doesn't really help here. It doesn't understand specifications and doesn't do any reasoning. It can't take a specification, generate a program, discover new abstractions that make the solution more elegant, and explain its reasoning. It can generate a heck of a lot of code though! Wow! Is it the right code? Maybe!

I think the key is limited scopes. Like with the React component example, the solution is small, easy to reason about, and tedious to resolve yourself. I understand why it doesn't work, I get the error, and spinning up an entire branch and PR to clean up the mistake is a bad use of my time. I don't want Copilot/AI to work magic, but I'm okay with it resolving minor mistakes and misuses of languages and libraries here and there.

I do think it will grow from here to do more and actually be good at it, though.


Just a note I think it's 3.5 for the code work. 4 would probably be prohibitively expensive to run and they carefully mention that they use 4 for the PRs and a few other bits but not the code gen - there they just talk about chatgpt. I'd love to be wrong about this.


> Can it explain its reasoning?

Yes, actually, it can.


That's interesting, considering the fact that it doesn't do any reasoning. Sure, it can generate plausible sounding explanations, but it generates those just like it generates everything: one token at a time, based on the expected probability of that token appearing next if this text were found in the wild.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: