3 Months of Vibe Coding

TL;DR I use a lot of AI-assisted coding now. It is a huge productivity boost, but the agents still do a lot of things wrong and need to be guided by humans.

Background

Genesis Computing, my new employer, encourages the use of AI tools, specifically the Cursor IDE. We have virtually unlimited access to the most powerful “MAX” models. The last several weeks have been VERY different from everything I’ve done before. I often start by telling AI what features I need implemented or what bugs I need fixed in plain English, and the agent does 80% of the work. My job is to supervise it, make sure it does not go wild, review code, maintain proper design, and provide direction. However, it is still far from “vibe coding” in the pure sense. Agents don’t see the big picture, and if given full freedom, they will ruin your project quite quickly.

I’ve accumulated a list of battle stories that could fill several long posts, so I’ll focus on the most important things.

What is an agent

An “agent” is a tool like Cursor or Claude Code that calls an underlying LLM model such as ChatGPT or Sonnet and uses it to work with your code base. It provides a set of tools to the model—things like “grep” or “edit file”—using which it can examine and manipulate your code. The agent also breaks up big tasks into smaller chunks and makes sure the model actually finishes the job. By themselves, models don’t handle long lists of TODO items very well; they tend to stop in the middle and declare victory too early.

Things agents do well

Boilerplate code and prototypes

Agents are very good at building boilerplate code from scratch. They know every technology under the sun. Want a FastAPI server that implements the OpenAI protocol? Done in 5 minutes. Want a NextJS application that sends messages over WebSockets? No problem. Simple gRPC client/server pair? Right away!

This opens the door to prototyping at astronomical speeds. You can have a skeleton application in minutes and a decent proof-of-concept in hours. But don’t confuse a prototype with production code. Agents are not yet capable of building reliable production applications of any meaningful size on their own.

Code analysis

Agents can reason about code and answer questions like “describe the sequence of events when the user presses the STOP button,” or “how can I navigate to the Questions dialog in the UI,” or “do we ever read the GARBAGE_PILE table from the database.” The nice thing is that the agent does not judge: you can ask stupid questions, repeat questions, and it will patiently answer everything.

Model quality matters a lot here. Cheaper/faster models tend to be less precise. Even expensive ones like Opus 4.1 or Sonnet 4.5 may make incorrect assumptions without fact-checking. If you have a component named DetailsView and a tab named “Details,” they may assume the tab shows the DetailsView component without checking. If your code follows some non-trivial convention—e.g., “color” is always RGB and “colour” is always CMYK—they may treat it as a spelling variation. My favorite example is when Sonnet and Opus told me that the frontend distinguishes two types of messages using a field I had added to the backend 30 seconds earlier, which the frontend did not even know about.

Log analysis

In complex systems, logs quickly become too large for humans to analyze efficiently. Agents are better than people at searching for relevant lines, finding patterns, and connecting dots in a wall of text. Most of this is not magical—they use the same techniques a person would—but they do it faster and with fewer mistakes.

Keep in mind, however, that agents use LLMs that emulate human reasoning. If your log message would confuse a person, it will likely confuse the model as well. If your log says "Thread 12345 stopped", the LLM will assume the thread actually stopped and will base further reasoning on that. If your team decided that “thread stopped” really means “thread finished processing one line and will then continue,” the LLM will not know that and may arrive at the wrong conclusions.

Things agents do OK

Fixing bugs

The best results come when the agent can check on its own whether the bug is fixed, so the human is out of the loop entirely. It’s fun to watch the agent make hypotheses, write a fix, test it, fail, try again, and so on until it finds the solution. Sometimes it takes 3–4 iterations or more. More expensive models do better. Don’t try to fix nasty bugs with composer-1.

If human interaction is required, things slow down dramatically. The agent will apply a fix and declare victory (“the button should be right-aligned now”), only for you to run the code and find that the button hasn’t moved a pixel. Then it asks you to report what’s in the console, paste logs, etc. The more of this you can automate, the better.

Like humans, agents often jump to conclusions without verification. My favorite was: “Oh, you don’t see file creation messages? That must be because we’re not showing enough fields. Let me show the data length and timestamp in addition to the file name.” Of course, this had nothing to do with the issue.

Unit tests

Agents tend to build OK unit tests, but they need specific instructions. The best approach is to create an .md file explaining how you want your tests structured (e.g., given-when-then), how to name them, not to test private functions, not to mock private functions, not to touch real databases or networks, how to properly clean up mocks, etc. Once you’ve done this groundwork, the agent will write decent unit tests. Even then, it tends to give them awkward names like test_adder_handles_rational instead of test_adder_can_add_two_rational_numbers.

Things agents do badly

Seeing the big picture

Every time you invoke an agent, it starts from a clean slate. It has no long-term memory. It sees your code through the peephole of the context window and is laser-focused on the current task. It excels at tactics and struggles with strategy. You’re dealing with a very well-educated junior developer who knows everything about everything and speaks 150 languages but cannot teach itself not to copy-paste code or modify items in place after they’ve been put into an asynchronous queue.

Whenever there is a difficult problem, the LLM will prefer a local quick fix and will not recognize systemic failure. For example, it will introduce fallbacks, artificial delays, or repeated messages, but it will never tell you “dude, you can’t do this with polling reliably—switch to a push-based model.”

Keeping the code clean

LLMs strongly prefer local modifications and dislike refactoring. They may produce an excellent first version, but after a few iterations, spaghettification is inevitable. Here’s an actual AI-produced piece of code with only slight editing:

if function is not None:
    try:
        s_arguments = json.loads(arguments)
    except json.JSONDecodeError as e:
        if fixed_args:
            s_arguments = json.loads(fixed_args)
            arguments = fixed_args
        else:
            err_msg = f"Malformed JSON in tool arguments for {func_name}: {e!s}"
            safe_completion_callback({"success": False, "error": err_msg})
            return
    if getattr(function, "is_mcp", None):
        safe_completion_callback(function(s_arguments, status_update_callback))
        try:
            elapsed = time.time() - start_time
            logger.info(...)
        except Exception:
            pass
        try:
            from mycompany.core.metrics import metrics
            metrics.tool_call_finished(thread_id, func_name)
        except Exception:
            pass
        return

Automatic linters help but give no guarantees. If your linter warns about deep nesting, long functions, swallowed exceptions, etc., the agent will generally try to minimize those—but not always.

Following software design principles

Agents struggle with adhering to design principles that require global understanding. No matter how much you instruct them, they will:

Ignore DRY and repeat the same code in multiple places.
Ignore separation of concerns and create huge functions, classes, and files.
Complicate code with unnecessary checks or edge cases that can’t possibly happen. Using getattr even when the attribute is guaranteed to exist is my favorite.
Introduce bugs by skipping necessary checks or not handling edge cases that can happen.
Reuse poorly written half-baked features (“Perfect! I see you already have a shoot_yourself_in_the_foot() function!”).
Alter code unnecessarily when moving it.
Create race conditions.

You can vibe-code a 5K-line prototype without looking at the implementation and it might work, but anything larger requires code review.

Verifying assumptions

LLMs are language models. They are not good at strict logical reasoning. They will often come up with a hypothesis (“the thread must be hanging because we don’t have enough file descriptors”), not verify it, and continue as if it were true. Sometimes the hypothesis is concrete, sometimes it’s vague (“the client must somehow be getting the stop message via a different channel”).

They also get confused by unusual conventions or meanings of words. E.g., we have legacy artifacts spelled with a “c” and newer ones with a “q,” similar but not identical entities. The LLM readily assumes they’re the same and treats “artifaqt” as a misspelling.

Following instructions

LLMs are probabilistic machines. No matter how crisp and unambiguous your instructions are, they WILL be violated sooner or later. Give an LLM a red button labeled “DO NOT PRESS, THIS BUTTON DESTROYS THE WORLD,” and given enough time, it WILL press it. And then apologize for the mistake.

Some LLM habits are very hard to fight:

long explanations full of bullet points
using local imports in Python
inserting non-ASCII emojis into user facing messages: ❌ Access denied, the operation cannot continue.

You can reduce it, but you can’t eliminate it.

Consistency

I once asked the LLM to review my code and report problems. It pointed out some unhandled edge cases and missing None checks. We fixed them. In the same session, I asked the LLM to generate more code in the same class. It produced code with the EXACT same errors we had just fixed. Same model, same session, same context window. When I asked to review THAT code, it found the errors and fixed them. Why didn’t it write correct code in the first place? It emulates humans to well, I guess.

Some models are actually better in this than others. For example, if you ask composer-1to write unit tests, it will, but it would typically neglect to run them after changes, unless you remind it. Claude opus 4.5, on the other hand, usually will run the unit tests on its own, and fix any problems.

Perseverance

A lot of work has clearly gone into making agents avoid infinite loops and not frustrate users.

The result: agents are very quick to declare victory (“I found the issue!”) but also quick to give up. The most infuriating thing though is they rarely admit defeat, they present giving up as another victory. Tests that can’t be fixed get commented out (“all done, all tests pass”). Bugs that can’t be analyzed are “resolved” with useless statements like “Critical finding: thread A somehow fails to get the stop message.” Hard to implement features get quietly omitted.

Sometimes asking the agent to try harder helps. Sometimes switching to a more expensive model helps. Sometimes nothing helps and the agent runs in circles.

Refusing bad orders

Often, if you ask the LLM to shoot itself in the foot, it will happily do so. Examples:

You tell it to use a method or class that does not exist. It won’t object—it will just generate invalid code.
You tell it to call server-side code from the client. It will do it, and the code blows up at runtime.
You tell it to use a tool incorrectly (e.g., async with on a synchronous generator). It will do it, and the code blows up at runtime.

Conclusion

Agents are amazing and a huge productivity boost. You feel this most when there’s an outage and you have to do things the old way. You suddenly realize how much you rely on agents and how slow everything feels without them. But at the current stage, rumors that agents will replace humans are greatly exaggerated. Vibe-coding anything beyond a small prototype is not yet possible: the program becomes an unreliable mess faster than you can say “spaghetti race condition.” Code review and directional oversight are required.

But with proper usage, agents are REALLY helpful, and I can hardly imagine going back. Over time we develop growing bag of tricks to make the agents do what we want, and avoid pitfalls. The models also get better. I don’t see agents taking our jobs anytime soon, but knowing how to use them will become or maybe already has become an essential skill.

Ivan Krivyakov

Premature optimization is the root of all evil