{"id":5376,"date":"2025-12-08T03:00:54","date_gmt":"2025-12-08T08:00:54","guid":{"rendered":"https:\/\/ikriv.com\/blog\/?p=5376"},"modified":"2025-12-08T03:00:54","modified_gmt":"2025-12-08T08:00:54","slug":"3-months-of-vibe-coding","status":"publish","type":"post","link":"https:\/\/ikriv.com\/blog\/?p=5376","title":{"rendered":"3 Months of Vibe Coding"},"content":{"rendered":"<p><b>TL;DR<\/b> I use <b>a lot<\/b> of AI-assisted coding now. It is a huge productivity boost, but the agents still do a lot of things wrong and need to be guided by humans.<\/p>\n<h2>Background<\/h2>\n<p><a href=\"https:\/\/www.genesiscomputing.ai\/\">Genesis Computing<\/a>, my new employer, encourages the use of AI tools, specifically the Cursor IDE. We have virtually unlimited access to the most powerful \u201cMAX\u201d models. The last several weeks have been VERY different from everything I\u2019ve done before. I often start by telling AI what features I need implemented or what bugs I need fixed in plain English, and the agent does 80% of the work. My job is to supervise it, make sure it does not go wild, review code, maintain proper design, and provide direction. However, it is still far from \u201cvibe coding\u201d in the pure sense. Agents don\u2019t see the big picture, and if given full freedom, they will ruin your project quite quickly.<\/p>\n<p>I\u2019ve accumulated a list of battle stories that could fill several long posts, so I\u2019ll focus on the most important things.<\/p>\n<h2>What is an agent<\/h2>\n<p>An \u201cagent\u201d is a tool like Cursor or Claude Code that calls an underlying LLM model such as ChatGPT or Sonnet and uses it to work with your code base. It provides a set of tools to the model\u2014things like \u201cgrep\u201d or \u201cedit file\u201d\u2014using which it can examine and manipulate your code. The agent also breaks up big tasks into smaller chunks and makes sure the model actually finishes the job. By themselves, models don\u2019t handle long lists of TODO items very well; they tend to stop in the middle and declare victory too early.<\/p>\n<h2>Things agents do well<\/h2>\n<h3>Boilerplate code and prototypes<\/h3>\n<p>Agents are <b>very<\/b> good at building boilerplate code from scratch. They know every technology under the sun. Want a FastAPI server that implements the OpenAI protocol? Done in 5 minutes. Want a NextJS application that sends messages over WebSockets? No problem. Simple gRPC client\/server pair? Right away!<\/p>\n<p>This opens the door to prototyping at astronomical speeds. You can have a skeleton application in minutes and a decent proof-of-concept in hours. But don\u2019t confuse a prototype with production code. Agents are not yet capable of building reliable production applications of any meaningful size on their own.<\/p>\n<h3>Code analysis<\/h3>\n<p>Agents can reason about code and answer questions like \u201cdescribe the sequence of events when the user presses the STOP button,\u201d or \u201chow can I navigate to the Questions dialog in the UI,\u201d or \u201cdo we ever read the GARBAGE_PILE table from the database.\u201d The nice thing is that the agent does not judge: you can ask stupid questions, repeat questions, and it will patiently answer everything.<\/p>\n<p>Model quality matters <b>a lot<\/b> here. Cheaper\/faster models tend to be less precise. Even expensive ones like Opus 4.1 or Sonnet 4.5 may make incorrect assumptions without fact-checking. If you have a component named <code>DetailsView<\/code> and a tab named \u201cDetails,\u201d they may assume the tab shows the <code>DetailsView<\/code> component without checking. If your code follows some non-trivial convention\u2014e.g., \u201ccolor\u201d is always RGB and \u201ccolour\u201d is always CMYK\u2014they may treat it as a spelling variation. My favorite example is when Sonnet and Opus told me that the frontend distinguishes two types of messages using a field I had added to the backend 30 seconds earlier, which the frontend did not even know about.<\/p>\n<h3>Log analysis<\/h3>\n<p>In complex systems, logs quickly become too large for humans to analyze efficiently. Agents are better than people at searching for relevant lines, finding patterns, and connecting dots in a wall of text. Most of this is not magical\u2014they use the same techniques a person would\u2014but they do it faster and with fewer mistakes.<\/p>\n<p>Keep in mind, however, that agents use LLMs that emulate human reasoning. If your log message would confuse a person, it will likely confuse the model as well. If your log says <code>\"Thread 12345 stopped\"<\/code>, the LLM will assume the thread actually stopped and will base further reasoning on that. If your team decided that \u201cthread stopped\u201d really means \u201cthread finished processing one line and will then continue,\u201d the LLM will not know that and may arrive at the wrong conclusions.<\/p>\n<h2>Things agents do OK<\/h2>\n<h3>Fixing bugs<\/h3>\n<p>The best results come when the agent can check on its own whether the bug is fixed, so the human is out of the loop entirely. It\u2019s fun to watch the agent make hypotheses, write a fix, test it, fail, try again, and so on until it finds the solution. Sometimes it takes 3\u20134 iterations or more. More expensive models do better. Don\u2019t try to fix nasty bugs with <code>composer-1<\/code>.<\/p>\n<p>If human interaction is required, things slow down dramatically. The agent will apply a fix and declare victory (\u201cthe button should be right-aligned now\u201d), only for you to run the code and find that the button hasn\u2019t moved a pixel. Then it asks you to report what\u2019s in the console, paste logs, etc. The more of this you can automate, the better.<\/p>\n<p>Like humans, agents often jump to conclusions without verification. My favorite was: \u201cOh, you don\u2019t see file creation messages? That must be because we\u2019re not showing enough fields. Let me show the data length and timestamp in addition to the file name.\u201d Of course, this had nothing to do with the issue.<\/p>\n<h3>Unit tests<\/h3>\n<p>Agents tend to build OK unit tests, but they need specific instructions. The best approach is to create an <code>.md<\/code> file explaining how you want your tests structured (e.g., given-when-then), how to name them, not to test private functions, not to mock private functions, not to touch real databases or networks, how to properly clean up mocks, etc. Once you\u2019ve done this groundwork, the agent will write decent unit tests. Even then, it tends to give them awkward names like <code>test_adder_handles_rational<\/code> instead of <code>test_adder_can_add_two_rational_numbers<\/code>.<\/p>\n<h2>Things agents do badly<\/h2>\n<h3>Seeing the big picture<\/h3>\n<p>Every time you invoke an agent, it starts from a clean slate. It has no long-term memory. It sees your code through the peephole of the context window and is laser-focused on the current task. It excels at tactics and struggles with strategy. You\u2019re dealing with a very well-educated junior developer who knows everything about everything and speaks 150 languages but cannot teach itself not to copy-paste code or modify items in place after they\u2019ve been put into an asynchronous queue.<\/p>\n<p>Whenever there is a difficult problem, the LLM will prefer a local quick fix and will not recognize systemic failure. For example, it will introduce fallbacks, artificial delays, or repeated messages, but it will never tell you \u201cdude, you can\u2019t do this with polling reliably\u2014switch to a push-based model.\u201d<\/p>\n<h3>Keeping the code clean<\/h3>\n<p>LLMs strongly prefer local modifications and dislike refactoring. They may produce an excellent first version, but after a few iterations, spaghettification is inevitable. Here&#8217;s an actual AI-produced piece of code with only slight editing:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nif function is not None:\r\n    try:\r\n        s_arguments = json.loads(arguments)\r\n    except json.JSONDecodeError as e:\r\n        if fixed_args:\r\n            s_arguments = json.loads(fixed_args)\r\n            arguments = fixed_args\r\n        else:\r\n            err_msg = f&quot;Malformed JSON in tool arguments for {func_name}: {e!s}&quot;\r\n            safe_completion_callback({&quot;success&quot;: False, &quot;error&quot;: err_msg})\r\n            return\r\n    if getattr(function, &quot;is_mcp&quot;, None):\r\n        safe_completion_callback(function(s_arguments, status_update_callback))\r\n        try:\r\n            elapsed = time.time() - start_time\r\n            logger.info(...)\r\n        except Exception:\r\n            pass\r\n        try:\r\n            from mycompany.core.metrics import metrics\r\n            metrics.tool_call_finished(thread_id, func_name)\r\n        except Exception:\r\n            pass\r\n        return\r\n<\/pre>\n<p>Automatic linters help but give no guarantees. If your linter warns about deep nesting, long functions, swallowed exceptions, etc., the agent will generally try to minimize those\u2014but not always.<\/p>\n<h3>Following software design principles<\/h3>\n<p>Agents struggle with adhering to design principles that require global understanding. No matter how much you instruct them, they will:<\/p>\n<ul>\n<li>Ignore DRY and repeat the same code in multiple places.<\/li>\n<li>Ignore separation of concerns and create huge functions, classes, and files.<\/li>\n<li>Complicate code with unnecessary checks or edge cases that can&#8217;t possibly happen. Using <code>getattr<\/code> even when the attribute is guaranteed to exist is my favorite.<\/li>\n<li>Introduce bugs by skipping necessary checks or not handling edge cases that can happen.<\/li>\n<li>Reuse poorly written half-baked features (\u201cPerfect! I see you already have a <code>shoot_yourself_in_the_foot()<\/code> function!\u201d).<\/li>\n<li>Alter code unnecessarily when moving it.<\/li>\n<li>Create race conditions.<\/li>\n<\/ul>\n<p>You <i>can<\/i> vibe-code a 5K-line prototype without looking at the implementation and it <i>might<\/i> work, but anything larger requires code review.<\/p>\n<h3>Verifying assumptions<\/h3>\n<p>LLMs are <strong>language<\/strong> models. They are not good at strict logical reasoning. They will often come up with a hypothesis (\u201cthe thread must be hanging because we don\u2019t have enough file descriptors\u201d), not verify it, and continue as if it were true. Sometimes the hypothesis is concrete, sometimes it&#8217;s vague (\u201cthe client must somehow be getting the stop message via a different channel\u201d).<\/p>\n<p>They also get confused by unusual conventions or meanings of words. E.g., we have legacy artifacts spelled with a \u201cc\u201d and newer ones with a \u201cq,\u201d similar but not identical entities. The LLM readily assumes they\u2019re the same and treats \u201cartifaqt\u201d as a misspelling.<\/p>\n<h3>Following instructions<\/h3>\n<p>LLMs are probabilistic machines. No matter how crisp and unambiguous your instructions are, they WILL be violated sooner or later. Give an LLM a red button labeled \u201cDO NOT PRESS, THIS BUTTON DESTROYS THE WORLD,\u201d and given enough time, it WILL press it. And then apologize for the mistake.<\/p>\n<p>Some LLM habits are very hard to fight:<\/p>\n<ul>\n<li>long explanations full of bullet points<\/li>\n<li>using local imports in Python<\/li>\n<li>inserting non-ASCII emojis into user facing messages: <code>\u274c Access denied, the operation cannot continue.<\/code><\/li>\n<\/ul>\n<p>You can reduce it, but you can\u2019t eliminate it.<\/p>\n<h3>Consistency<\/h3>\n<p>I once asked the LLM to review my code and report problems. It pointed out some unhandled edge cases and missing <code>None<\/code> checks. We fixed them. In the same session, I asked the LLM to generate more code in the same class. It produced code with the EXACT same errors we had just fixed. Same model, same session, same context window. When I asked to review THAT code, it found the errors and fixed them. Why didn&#8217;t it write correct code in the first place? It emulates humans to well, I guess.<\/p>\n<p>Some models are actually better in this than others. For example,\u00a0 if you ask <code>composer-1<\/code>to write unit tests, it will, but it would typically neglect to run them after changes, unless you remind it. Claude opus 4.5, on the other hand, usually <strong>will\u00a0<\/strong>run the unit tests on its own, and fix any problems.<\/p>\n<h3>Perseverance<\/h3>\n<p>A lot of work has clearly gone into making agents avoid infinite loops and not frustrate users.<\/p>\n<p>The result: agents are very quick to declare victory (\u201cI found the issue!\u201d) but also quick to give up. The most infuriating thing though is they rarely admit defeat, they present giving up as another victory. Tests that can&#8217;t be fixed get commented out (\u201call done, all tests pass\u201d). Bugs that can\u2019t be analyzed are \u201cresolved\u201d with useless statements like \u201cCritical finding: thread A somehow fails to get the stop message.\u201d Hard to implement features get quietly omitted.<\/p>\n<p>Sometimes asking the agent to try harder helps. Sometimes switching to a more expensive model helps. Sometimes nothing helps and the agent runs in circles.<\/p>\n<h3>Refusing bad orders<\/h3>\n<p>Often, if you ask the LLM to shoot itself in the foot, it will happily do so. Examples:<\/p>\n<ul>\n<li>You tell it to use a method or class that does not exist. It won\u2019t object\u2014it will just generate invalid code.<\/li>\n<li>You tell it to call server-side code from the client. It will do it, and the code blows up at runtime.<\/li>\n<li>You tell it to use a tool incorrectly (e.g., <code>async with<\/code> on a synchronous generator). It will do it, and the code blows up at runtime.<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>Agents are amazing and a huge productivity boost. You feel this most when there\u2019s an outage and you have to do things the old way. You suddenly realize how much you rely on agents and how slow everything feels without them. But at the current stage, rumors that agents will replace humans are greatly exaggerated. Vibe-coding anything beyond a small prototype is not yet possible: the program becomes an unreliable mess faster than you can say \u201cspaghetti race condition.\u201d Code review and directional oversight <i>are<\/i> required.<\/p>\n<p>But with proper usage, agents are REALLY helpful, and I can hardly imagine going back. Over time we develop growing bag of tricks to make the agents do what we want, and avoid pitfalls. The models also get better. I don&#8217;t see agents taking our jobs anytime soon, but knowing how to use them will become or maybe already has become an essential skill.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR I use a lot of AI-assisted coding now. It is a huge productivity boost, but the agents still do a lot of things wrong and need to be guided <a href=\"https:\/\/ikriv.com\/blog\/?p=5376\" class=\"more-link\">[&hellip;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"Layout":"","footnotes":""},"categories":[32],"tags":[],"class_list":["entry","author-ikriv","post-5376","post","type-post","status-publish","format-standard","category-ai"],"_links":{"self":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5376","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5376"}],"version-history":[{"count":9,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5376\/revisions"}],"predecessor-version":[{"id":5421,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5376\/revisions\/5421"}],"wp:attachment":[{"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5376"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5376"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ikriv.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5376"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}