What does AI actually produce well, and where does it fall apart?
It's an understatement to say that LLMs have transformed the practice of programming. Where LLMs lead software engineering is an open question.
The distinction does more than just preserve the identities of experienced engineers. The former is objectively true. Given a task and a chosen programming language, any modern LLM will produce code that adheres perfectly to the grammar of that language. There may be bugs, or the model may have implemented a solution to the wrong problem, but we’re well past LLMs making syntax errors. The idea generalizes: the more constrained the language grammar, the more likely the LLM should be to produce a correct output.
Contrary to what the hype cycle would have you believe, this has nothing to do with any inherent reasoning (read: thinking) capability of LLMs and everything to do with what LLMs are and how they are trained. Put another way: producing code that compiles is one pattern-matching problem. Producing code that does what you described is another, layered on top of it. Producing code that does what you actually meant — distinct from what you described — is layered on top of that. Each layer is self-similar to the layer below, but LLMs are most capable near the bottom of that ladder.
The transformer architecture underpinning modern LLMs was built to identify as many relationships between spans of text as is computationally feasible. For programming languages, this is a constrained optimization space. There is great variety in what those languages are used for, but the set of allowable characters and sequences that comprise a given language is quite small. Compare that with natural language, with its effectively unbounded lexicon and grammatical flexibility.
As an example, every main function ever written in the Go programming language goes as follows:
func main() {
// stuff
} Not once does it appear as:
{ func
// stuff
main)( } Now take a typical English sentence, “I went to the store yesterday” and all of its valid permutations:
The same six words can be arranged in (at least) six different ways. While any of these would trivially be generated by an LLM describing its day off, problems arise when this variance is amplified across every linguistic dimension. Generating source code has the structural advantage that one of its largest subproblems, the grammar, is effectively fixed before inference occurs. Grammar for natural language is an open problem, and one that may never be closed. This is largely because natural language grammar is descriptive, dependent upon the way that language is spoken between people. Rules we’re taught as axioms are really just a best attempt to capture a common parlance. Within a single programming language, there is no variance along this dimension. The complexity lies in how a small, fixed set of symbols is arranged.
All of this hints at an answer to the posed question: LLMs are great at solving constrained-grammar problems. This is less esoteric than it appears. A recipe has a constrained grammar, so do spreadsheets, calendars, to-do lists, and myriad everyday tools. When an LLM is trained on corpora of constrained-grammar text, it can literally pay more attention to how those symbols are arranged than to discovering the valid structure from scratch. The catch here is that you need enough training data that adequately covers the variance of the languages you’re intending to generate.
It is hopefully now clearer why, until very recently, it was difficult for an LLM to count the r’s in strawberry, or why image-generation models routinely confuse left for right. The same architecture that produces flawless code fails at these tasks for the same reason: these problems sit at a level very far from the symbols. The concept of counting, of letters as letters, of the word strawberry as a sequence of characters distinct from the fruit it names — none of this appears in the training data the same way Go syntax does. The concepts themselves need to be inferred by their usage, as represented in the data. Inference from usage is a much weaker signal than training on explicit examples of a fixed grammar.
To be clear, this isn’t the “LLMs are dumb, humans are smart” argument. On the contrary — we now have a computational system that can maintain reference to itself and to another agent across multiple turns of interaction. This is new and weird.
I also don’t claim that the limitations described here are permanent. AGI might well be possible, but I don’t think “make computer bigger” is the path that gets us there. Representability has a better shot, in my opinion. Figuring out what can be expressed in a form the architecture actually privileges, and where the boundaries of that form sit.
What I do think is overstated is the imminent capability and utility of LLMs beyond a few domains in their current default usage. The places where these tools work reliably today are the places where the underlying representations are constrained enough that the architecture’s strengths line up with the problem. This is why, on our quest to discover effective AI workflows, we’ve climbed the ladder from vibe coding to prompt engineering, to context engineering, to… whatever the new buzzword is. At each refinement, we tell the model less in hopes that it can do more. We constrain and specialize the language used in our prompts so as not to pollute the context window with symbols that are irrelevant for actually solving the problem at hand. At the same time we arrange those symbols into more and more structured, predictable representations like specification documents and code itself. That work, applied to the every day, is where there are large potential gains to be made for the average AI user.
And that work isn’t being taught uniformly. The countless certifications, courses and explainers are almost entirely scoped to technical domains. There are reasons. The tools already work well on tech-shaped problems, so the momentum carries the resources there. Tech is also where the money is. Thirdly, corralling human messiness into a form a model can work with isn’t really an AI skill at all — it’s what critical thinking and the ability to faithfully externalize one’s thoughts have always been. That’s never been easy, and the people who do it well have been doing it forever, calling it something else.
Most of the ideas explored in this series explore about what this work could look like, why the current default interfaces actively make it harder, and where these tools, used differently, could actually help.