I don’t doubt that jobs associated with software engineering are undergoing a massive sea change. While I was far less certain as to its applicability a year ago, tools like Codex, Claude Code, and OpenCode are effective and – to me – can provide value. These tools are getting better at an extraordinarily high rate, but I don’t see it going up forever. Like any other advancement, the rate of advancement is an S-curve. We’re in the more-vertical bit of that curve right now.
Like others, I believe this is a sea change for software development, something that’s profound and that has enough value and impact to change quite a bit of how software development is done. Within that, I think the job of doing software development will and is changing, but unlike the hyperbolic assertions of some, software engineers will still be needed— and critically.
I spend a lot of time writing, teaching, sometimes explaining various technical topics to people. And from where I stand, good documentation is more important than ever for getting the best results out of agentic systems. I’ve been bouncing these ideas around for several months, and I want to share some of those thoughts. How I’ve been thinking about the topics, and sharing what’s worked for me. If you’re inclined to try it out, I hope this helps to provide a path for you.
Before I get into the specifics of what’s worked for me, and what hasn’t, I want to share some foundationals. These are my hypotheses on why I’m seeing the results I do, and hold up for me. I’ll leave it for you to judge for yourself.
LLMs don’t reason. They encode a tremendous amount of data/knowledge, but the fundamental nature of how an LLM works is “given prior words, predict the next word”.
I am confident in this – backed up by some really great research from Apple’s machine learning researchers: The Illusion of Thinking. It’s a formal research paper, very worth a read for the abstract and conclusion, even if the specifics are something you can’t easily follow.
Model capability jumps
Some surprising characteristics come out of this as you step up in model size. At the “smaller” end of the scale— 3 billion parameters or so— you’ve got good basic prediction. Enough that it’s sufficient for effectively predicting to fix spelling mistakes or complete a written sentence.
LLMs models can be loosely grouped by the number of “parameters” in the model – how “large” it is. There’s a whole thing here with “scaling laws” and that performance gets better as the models get bigger, and are trained longer, with diminishing returns at the far ends of those curves.
At 7 to 8 billion parameters, there’s enough encoded knowledge in the model that predicting the next word starts to provide new emergent capabilities: instruction following and the ability to consistently structure output— such as writing correct JSON, or the simplest of code.
At 40 billion parameters, there’s another jump – this is the size where, if it’s been in the training data, it can predict with some semblance of reasoning and improves notably when using “chains of thought”. This is also the point where if you generate content, then feed it back to the same LLM with prompts effectively asking “does this seem right”, it’ll consistently start to correct to better outputs – kind of the leading edge of where it can start to “hill climb” by following instructions.
At 100 to 120 billion parameters, the models start to usefully generalize. This is where GPT-3 was when it came out, and multi-lingual aspects of models are starting to get consistently good. World knowledge is solid, and (related to that) there’s enough fine-grained detail in the model that predicting on niche topics is far more solid.
I get a bit fuzzier after this, but I think somewhere around 400 billion parameters is where the models have enough detail in them, enough trained paths of “how to do things”, that it generalizes the predictions so that you can get effects such as multiple steps of reasoning, or basic planning. With recursive use of the model, it exhibits both planning and instruction following, which makes it the ideal starting point for something like agentic computing.
The latest “frontier” models are north of 1.2 trillion parameters today, maybe bigger – I don’t know where the latest ones are at. OpenAI, Anthropic, and Google don’t exactly share all the details at that leading edge. Even with the breadth of data, it can *only* predict patterns that it’s seen before. The generalization jump at 100 billion parameters means that it can start to apply patterns it’s seen in one place to another, but it still has to have seen it somewhere.
Memory and Probability
The other thing to know about LLMs is that they’re the ultimate “goldfish brain” – without you passing it in, they intrinsically have no memory of earlier conversations. They have a lot of knowledge embedded in that training, some of which is exposed easily. That is why “The capital of France is…” predicts Paris. And with more good training data, it gets better. But it still doesn’t reason – it doesn’t say “is this logically correct”, or apply anything akin to deductive or inductive reasoning today. Even the planning – you’ll see it in agents as a “thinking” mode – is using recursive calls back to the same LLM. First generating, then iterating on what they generated to improve the results – and that actually seems to work.
But for the love of all, please don’t think that it’s “reasoning through” anything – using any logical thought process. That space is a cutting edge of research – “neuro-symbolic” computing. What comes out as reasoning and planning today is reflecting what others have reasoned about, that’s been trained into the model. It probabilistically reflects those patterns out with generalization.
That last part – the probabilistically – is important. It means the story can (and likely will) change every time you ask. Some short predictions are solid enough that you’ll get consistent answers, but as you step “down the road of these tokens” – get to the finer-grained patterns – then it’ll follow down its model tracks, but fork at different places. The more context you give it – those up-front instructions, examples, etc. – help constrain that prediction. This is where the phrase “It’s all context engineering” started really popping up about a year ago in this space.
Agents
I mentioned this whole “chain of thought” thing earlier, and how you could recursively feed back generated content into an LLM to get improved output. So let’s use that – if you stand outside the model, control what context you feed it each time, and do that in a loop? Congratulations, that’s exactly what an agent is and does.
That ability to reliably act as an agent comes from the instruction-following tendencies of those larger models. Meaning that, if you give them instructions, the predicted output text tends to follow the pattern of looking like it’s following the instructions. The context of those instructions constrains the LLM prediction so that you get something really interesting.
The next thing agents added was an idea of “tool calling” – and that’s tightly fitted into how a model works. If you train the model to emit some specific text phrasing in a structured output, the agent code can look for that pattern and interpret that as “I should use this tool” – using the structured content it generated as arguments. The tools typically provide back natural language output, sometimes structured, that the agent uses as additional context and continues on. This pattern was standardized and generalized into what’s now MCP. It solidified the mix of exposing deterministic APIs and tools to these LLMs – which _aren’t_ deterministic.
The folks writing agents have refined what it means to call tools, tried out a bunch of different patterns, made mistakes, and found success; sometimes all at once. It’s a fair bit of work to create and maintain an MCP “server”, or those tools. But we’ll see more of both, because they give the agents a superpower – a way to deterministically get something done. With tool calling, asking for something like “2 + 2” will always equal 4 – where if you ran that through probabilistic generation, well… I wouldn’t expect that to work 100% of the time. Combine tool calling with the instruction following and a breadth of world knowledge to pull from, and you’re seeing genuinely useful results. This is where we are now – February 2026 – as I’m writing this.
The point at which I felt it was genuinely useful was when Claude Code added the notion of “skills” for agents. The idea is a variation on tool calling: including in the context that there is additional knowledge to access, and allow it to decide, based on its predicting planning output, when it should use that knowledge. The structure of skills is set up as sets of deeper knowledge, listed with short summaries of what’s contained within it, or why an agent would load it. This gives agents a sort of “progressive disclosure” that helps guide the predictive outputs. Skills are primarily instructions or descriptions, although they can also include scripts that an agent can write and run, to achieve any number of goals.
Okay – that’s a ton of background and my views/speculation/hypothesis on how this all works. Let me touch on what’s worked for me.
Suggestions from what’s working for me
So if you’re trying this agentic coding thing out, there are a couple key pieces of advice that made a huge difference for me.
(1) Ask me questions for anything ambiguous
Make sure that phrase is *always* in your instructions. If you don’t have that, and leave something vague, it’ll pick something. Maybe you’ll like it, maybe you won’t. What you get from each time you try it can be wildly different.
When you provide this up front, the agents do a much better job of asking for clarifications where they would have guessed and chosen a path randomly otherwise.
(2) Make a plan, then implement it
If you want to have it help you code, build a plan that helps constrain what it’ll generate. That whole instruction following / apparent reasoning thing works to your benefit here. You first create a plan to achieve what you’re after, work out the specifics of how it’ll achieve it with that plan up front, and when that’s evolved to your satisfaction, then let it implement.
I recommend providing instructions that it shouldn’t implement anything until you’ve approved the plan. Some of the earlier agent systems wouldn’t always anticipate that in their predictions, instead predicting that it should “just do it already”. They seem to be better about that today, but it’s worth keeping in mind.
Especially when working with an agent and a larger model (such as Codex or Claude), you can use the world knowledge of the model to help you create the plan. “Ask” the model for options, ask it to explain the pros and cons of choices, and with the “ask me questions for anything ambiguous” instruction already loaded, it’ll help to refine a plan to something pretty good.
Note: making a plan with an agent, and knowing how that agent will respond and run with a plan is a skill of yours. It’s something you’ll need to learn – so plan to try it out, plan on making mistakes, and learn from those mistakes. Making a better plan is the single best way to get better results out of an agentic coding assistant.
(3) Use deterministic feedback loops
Set up some structure that while you’re having the agent work on code, it can verify that what’s coming out is functional. The most obvious thing here are unit tests to me, but also linters – especially for interpreted, flexible languages such as Python, or TypeScript. In your instructions, make sure you have a line of instruction of “ensure the tests pass”, and maybe even give it the context of how to run the tests, or compile the code and run the tests. The world knowledge is good enough that if you’re using a standard project setup, it’ll often try the right thing, but if you tell it – it’s way more successful on that front.
The looping structure of agents will see the output from a failed compilation or failed test, recognize there’s a problem, predict that it should understand the problem and fix it, and then proceed to try and do that.
The agentic systems today – even more so than 6 months ago – are surprisingly good (surprising to me) at what AI researchers call “hill climbing”. The more deterministic things you can have it check against, with clear (AND CONSISTENT) “good/bad” scores, the better it’ll be able to use all the other systems it has to iterate and refine into what you’re after.
This is a space where the language you’re using can have a notable impact on how effective the agent is. TypeScript can get better results than JavaScript when it’s doing type checking and using that as a constraint. Likewise, a compiled language like Swift has even more benefits, with its layers of safety and guarantees that it checks – and provides warnings and errors when it can’t.
(4) Constrain what you’re doing
I’ve gotten much better results when I was really specific and kept what I was asking for to something as simple as possible, or – using that plan concept above, broke down the problem before letting the instructions following actions of an agent do their thing. I really wish I could tell you “how much” to constrain – but I can’t. For one, it’s changing pretty rapidly. The models can do a lot more today than they could 6 months ago, and WAY more than they could a year ago.
But the core of this breaks down to the simple idea – if it has to “reason” about what to do – and it’s more than a couple of steps or something obvious – there’s a higher likelihood that it’ll go awry. The agents I’ve used can appear to be pretty good at reasoning, but that probabilistic nature can catch up unless you’re keeping track of what’s happening with some deterministic thing outside of the LLM’s predictive output. Claude Code, for example, has a “to-do” tool where it’ll create todos for itself, and then check them off as it completes them, with the simple instructions to check the next to-do, then do it, mark it complete when done.
That comment about the goldfish brain here? Yeah – this is where it comes from. It does great with lists when it’s using tools, not so much when it isn’t.
(5) Find, use, and create skills
The libraries and frameworks you want to use, how to use them effectively, the patterns of software architecture you want – that’s all ambiguous to an LLM when you start. These are the perfect places to provide it with additional context. Sometimes it’s as simple as “use SwiftUI”, but the more detail that you can give it about what makes “good” use of your framework, the better.
I don’t honestly know all the boundaries around this particular advice, but it’s been incredibly useful for me. I’ve written a couple of skills for myself, and felt like the results have been better. That I need to rely on “feels” for knowing if it’s better or not is pricking under my skin – it’s driving me a bit nuts. There are some paths to “testing” if it’s better or not – but that rabbit hole is a space called “evals” in the Agentic/LLM world. And it’s a really, really deep rabbit hole. Maybe more on that in another post. Right now my subjective results are the evaluator, as messy and time-consuming as that is.
The real take-away here is this – if you’re finding yourself providing the same set of instructions more than once to different sessions using the agent to do something, capture it down. That collection will grow. Some of those instructions will be specific to a technology or task. Group those together into a file or set of files, and make a skill from it. The patterns and tools at OpenSkills provide some structure that you can use for any agent. And there are growing collections of skills that are pretty easy to find on GitHub.
Never use a skill without reading it!
That should go without saying, but a lot of folks just want something to work. I hate to be the bearer of bad news, but there are a lot of malicious intents out there with skills. Always read a skill before you just plop it into your collection and use it. You paying attention here is huge. You’re the auditor and guidance for all this – it should be what you want the agent to help you do and how – if you don’t agree, don’t use the skill. Or edit it, and try out your own variant with how you think it should work.
(6) You’re responsible for memory (for now)
As I’m writing this (February 2026), the agents are starting to step into the space where they’re retrieving data on their own – in effect, using memories that you provide. But they aren’t (yet?) self-sufficient enough to know _when_ to store things like skills as memories.
You can provide too many skills, or too many tools, to an agent. It can get “confused” (meaning it doesn’t consistently pick how it uses all those tools), and the number of tools it takes to confuse these models is surprisingly low.
Keep the context that it always needs to load and use concise, accurate, and confined to the project at hand. Change it if you’re changing what you’re doing, and you’ll get improved results. Tools and skills, where it wouldn’t be clear to you which you might choose, shouldn’t be offered together to an agent. That’s pretty much a guaranteed recipe for inconsistent use (aka “disaster”).
Discussions, research, a multitude of ideas, and implementations are on the leading edges of the development of agents. Because of that, I expect this space to evolve even in the next couple of iterations of software releases.
(7) Checkpoint and reset
Agents with code use patterns that are self-reinforcing. Far more than humans writing the same code. Speaking for myself, I’m used to multiple different patterns and efforts being used in the same codebase. It’s not great, and I don’t love it – but it doesn’t screw me up all that much. It’ll tend to screw up an agent. The more consistent your codebase is, the more consistent the agent can be in generating code that works with the same patterns.
These agents can generate code a lot faster than most of us can type. So the magical feeling is that you’ve got a bit of a superpower at your fingertips to try something out, see if it works. Do it. Try it. That’s the benefit, but know how to back out of it if it doesn’t work.
You can do some instructions about refactoring or re-adjusting how the code works to new patterns, but that’s one of the harder tasks for agents to do well. If you’re adding something to your codebase and trying something new, use git (or whatever source control) to be explicit about making a starting point and see where it goes with your plans and instructions. If you don’t like it, rather than trying to evolve it “back and to the left”, reset back to an earlier commit and try again with updated instructions and the lessons learned in your head.
I keep a notepad (okay, a text editor really) and copy/paste my instructions, tweaking them and trying them out. Sometimes I’ll do this on a branch, run it with the instructions, then reset to the head of the branch and give it another go with modified instructions. A different technology choice, a different pattern, and so on.
When I’m really interested in the problem space, I get excited about exploring, about poking it and seeing how different patterns work. Just make sure you do one at a time, and from a consistent starting point, as you’re exploring the possibilities.
I would not be surprised if this also was a space that changes as how agents work evolves. This whole area is where groups like Google’s DeepMind team really excel – mixing exploration (search) and planning, in with expectations or results and memory. Those ideas are a bit beyond the bleeding edge of what’s available for most agents today, but are being actively explored.
(8) Use the LLM where it’s good, and scripts or code where it isn’t
This is probably the flimsiest advice – I don’t have this pinned down to anything more concrete. I’ve run into this far more using agents to help me to do data analysis than code, but it applies just the same. If what you’re asking the LLM to do is reading and writing a ton of data, consistently, mechanistically – without translating it, summarizing, or such – then you’ll be WAY better off having the LLM write a script or code to process the data than running all the content into the LLM and back out as predicted text. In small pieces, it can work fine and be pretty seamless, but it’s fundamentally noisy and unpredictable. A deterministic script isn’t, and you’ll get consistent results with a script. You won’t with the output of an LLM.
You also tend to pay for “tokens” (words) that you send in and get out from an LLM. Not funneling MB (or more) of data through it when you don’t need to will save you bucks.
If you need to do something like this – summarizing content, for example – you get far better results when you take advantage of (4) above – keep it constrained. While you *can* dump way more content to these latest agent models, you get better results when you can process it in constrained, consistent chunks.
Keep this in mind when you’re making your plans or choosing how to use the agentic system. Choosing up front if something is better with a probabilistic path or a deterministic one ends up being both a skill, and sort of a fundamental one in using the agents effectively. (And yes, by calling this a skill, I mean I’m still barking my shins on this and learning myself.)
(9) Small, sharp tools
The phrase, as it relates to the philosophy behind Unix, goes back to the 70s at Bell Labs. In a too-short summary:
- text is the primary interface
- write programs that do one thing, and do it well
- composable tools that read, process, and output text
For most coding agents, text is the “primary interface” that you use to communicate with the agent, and not surprisingly, the agents are being trained to (and are getting) tremendously good at using the classic “Unix tools” that focus on taking in text and sending it back out.
I think there’s a growing space here (reinforced by examples such as Playwright) where providing custom CLI tools to agents is another powerful way to expand how effective agents can be. Microsoft just released a CLI to control and run Playwright – a visible or headless browser that’s used for testing or interacting with sites on the internet like any other browser, just controlled by code in addition to any interactions you might make with it.
What stands out to me is that there are patterns in what and how you output things that work better with agents – concise, “token-efficient” output, clear and meaningful errors with instructions for alternative usage, output patterns that expect to have the most relevant information about progress within the last 5 lines of the output, and so on.
For now, text is the universal language – including some crazy emoji emphasis (for better or worse) – for communicating instructions to agents. There are some early paths for capturing and iterating based on screenshots with some agents, but I haven’t trodden down those roads as yet.
Note: If you started into computing well after graphical user interfaces were big – this may feel like an exceedingly awkward shift. I have the benefit of being old enough that I grew up with command-line interfaces and shells as the first point of contact with a computer. Even learning command line tools like you’ll find in a terminal/shell is its own skill, and takes some time and understanding. I’ve always found them extremely powerful and recommend learning how to use them. These are yet another thing to learn, which can be overwhelming if you’re already feeling underwater with learning the rest of this.
Sometimes just being aware that it is a skill in its own right makes approaching it a bit easier.
It’s a set of skills – give yourself grace to learn them
Wrapping up, I’ll just reiterate that using agents to help solve coding problems is a new skill in itself. How to use these tools is something we’re all learning. The rate of change for anything in computing has always been high, with software on the fastest iteration cycle. How to get effective results from agents, and how to arrange all the parts they need, is at the forward, bleeding edge of that rate of change.
I favor taking a view of exploration and even play where I can, and encourage you to do so as well. I recognize that I’m in a supremely privileged position to be able to think of it all this way. Give yourself leeway to try things, to screw it up, to get the feedback, and try again.
Like just about anything in technology engineering, it’s a big darn puzzle. And like most of the puzzles in our industry, there are facets that are easier to solve, and some that take a lot longer.
I think there’s more here, but I haven’t advanced enough in my own learning journey to suggest further steps or advice. The space that I’m looking forward to is what I’d generally term as “systems thinking” – working towards thoughtfully (and intentionally) composing systems, effectively using abstraction and encapsulation, understanding the interfaces and implications of assembling these pieces into new structures and how that’ll work – or not.
As a last note, let me tip my hat to Simon Willison, who provided a link to How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt – this is right in line with what I’m seeing.
You must be logged in to post a comment.