> It’s one of those things that crackpots keep trying to do, no matter how much you tell them it could never work. If the spec defines precisely what a program will do, with enough detail that it can be used to generate the program itself, this just begs the question: how do you write the spec? Such a complete spec is just as hard to write as the underlying computer program, because just as many details have to be answered by spec writer as the programmer.
It actually makes sense that code is becoming amorphous and we will no longer scale in terms of building out new features (which has become cheap), but by defining stricter and stricter behavior constraints.
Program generation from a spec meant something vastly different in 2007 than it does now. People can and are generating programs from underspecified prompts. Trying to be systematic about how prompts work is a worthwhile area to explore.
What he misses is that it's much easier to change the spec than the code. And if the cost of regenerating the code is low enough, then the code is not worth talking about.
Is it? If the spec is as detailed as the code would be? If you make a change to one part of the spec do you now have inconsistencies that the LLM is going to have to resolve in some way? Are we going to have a compiler, or type checker type tools for the spec to catch these errors sooner?
As far as I can tell it's not a new language, but rather an alternative workflow for LLM-based development along with a tool that implements it.
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
> The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it
Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet. We are looking into ways to "catch up" the specs with whatever changes happen in the code not through CodeSpeak (agents or manual changes or whatever). It's an interesting exercise. In the case of agents, it's very helpful to look at the prompts users gave them (we are experimenting with inspecting the sessions from ~/.claude).
More generally, `codespeak takeover` [1] is a tool to convert code into specs, and we are teaching it to take prompts from agent sessions into account. Seems very helpful, actually.
I think it's a valid use case to start something in vibe coding mode and then switch to CodeSpeak if you want long-term maintainability. From "sprint mode" to "marathon mode", so to speak
1. You are right that we can redefine what is code. If code is the central artefact that humans are dealing with to tell machines and other humans how the system works, then CodeSpeak specs will become code, and CodeSpeak will be a compiler. This is why I often refer to CodeSpeak as a next-level programming language.
2. I don't think being deterministic per se is what matters. Being predictable certainly does. Human engineers are not deterministic yet people pay them a lot of money and use their work all the time.
Compiler is not 100% deterministic. Its output can change when you upgrade its version, its output can change when you change optimization options. Using profile-guided optimization can also change between runs.
If you change inputs then obviously you will get a different output. Crucially using the same inputs, however, produces the same output. So compilers are actually deterministic.
Also they seem to want to run this as a business, which seems absurd to me since I don't see how they can possibly charge money, and anyway the idea is so simple that it can be reimplemented in less than a week (less than a day for a basic version) and those alternative implementations may turn out to be better.
It also seems to be closed-source, which means that unless they open the source very soon it will very likely be immediately replaced in popularity by an open source version if it turns out to gain traction.
Also a bit formal. Maybe something like this will be the output of the prompt to let me know what the AI is going to generate in the binary, but I doubt I will be writing code like this in 5 years, English will probably be fine at my level.
> Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
Working on that as well. We need to be a lot more flexible and configurable
* This isn't a language, it's some tooling to map specs to code and re-generate
* Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
* Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months
* Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base?
* Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent
I do think there are opportunities in this space, but what I'd like to see is:
* write text specifications
* model transforms text into a *formal* specification
* then the formal spec is translated into code which can be verified against the spec
2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark.
But you can also get there by generating tests from the formal specification that validate the implementation.
Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
That if at the beginning of your sentence is doing a whole lot of work. Indeed, if we could formally and provably (another extremely loaded word) generate good code that'd be one thing, but proving correctness is one of those basically impossible tasks.
> but proving correctness is one of those basically impossible tasks.
To aim for a meeting of the minds... Would you help me out and unpack what you mean so there is less ambiguity? This might be minor terminological confusion. It is possible we have different takes, though -- that's what I'm trying to figure out.
There are at least two senses of 'correctness' that people sometimes mean: (a) correctness relative to a formal spec: this is expensive but doable*; (b) confidence that a spec matches human intent: IMO, usually a messy decision involving governance, organizational priorities, and resource constraints.
Sometimes people refer to software correctness problems in a very general sense, but I find it hard to parse those. I'm familiar with particular theoretical results such as Rice's theorem and the halting problem that pertain to arbitrary programs.
* With tools like {Lean, Dafny, Verus, Coq} and in projects like {CompCert, sel4}.
You got it completely backwards. The claim is that if the code does exactly what the spec says (which generated tests are supposed to "prove") then the actual code does not matter, even if it's different each time.
The point they are making is the tests are neither necessary nor sufficient alone to prove the code does exactly what the spec says. Looking at the tests isn't enough to prove anything; as an extreme example, if no one involved looks at the code, then the tests can just be static always passing and you wouldn't know either way whether or not the code matches the spec or not.
If anyone cared enough they could look at the code and see the problem immediately and with little effort, but we're encouraging a world where no one cares enough to put even that baseline effort because *gestures at* the tests are passing. Who cares how wrong the code is and in what ways if all the lights are green?
> If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
If the spec is so complete that it covers everything, you might as well write the code.
The benefit of writing a spec and having the LLM code it, is that the LLM will fill in a lot of blanks. And it is this filling in of blanks that is non-deterministic.
Except one shoe is made by children in a fire-trap sweatshop with no breaks, and the other was made by a well paid adult in good working conditions.
The ends don’t justify the means. The process of making impacts the output in ways that are subtle and important, but even holding the output as a fixed thing - the process of making still matters, at least to the people making it.
If you are a “programmer” you are going to be the kids in the sweatshop. On the enterprise dev side where most developers work, it’s been headed in that direction for at least a decade where it was easy enough to become a “good enough” generic full stack/mobile/web etc dev.
Even on the BigTech side being able to reverse a btree on the whiteboard and having on your resume that you were a mid level developer isn’t enough either anymore
If you look at the comp on that side, it’s also stagnated for decade. AI has just accelerated that trend.
While my job has been at various percentages to produce code for 30 years, it’s been well over a decade since I had to sell myself on “I codez real gud”. I sell myself as a “software engineer” who can go from ambiguous business and technical requirements, deal with politics, XYProblems, etc
Exactly. I work in a consulting company as a customer facing staff consultant - highest level - specializing in cloud + app dev. We don’t hire anyone less than staff in the US. Anything lower is hired out of the country.
That’s exactly my point. “Programming” was clearly becoming commoditized a decade ago.
Out of bounds behavior is sometimes a known unknown, but in the era of generated code is exclusively unknown unknowns.
Good luck speccing out all the unanticipated side effects and undefined behaviors. Perhaps you can prompt the agent in a loop a bnumber of times but it's hard to believe that the brute-force throw-more-tokens-at-it approach has the same level of return as a more attentive audit by human eyeballs.
Are you as a developer 100% able to trust that you didn’t miss anything? Your team if you are a team lead who delegates tasks to other developers? If you outsource non business things like Salesforce integrations etc do you know all of the code they wrote? Your library dependencies? Your infrastructure providers?
I don’t know. I’m making a point that the only people whose sole responsibility is code that they personally write are mid level ticket takers.
I don’t review every line of code by everyone whose output I’m responsible for, I ask them to explain how they did things and care about their testing, the functional and non functional requirements and hotspots like concurrency, data access patterns, architectural issues etc.
For instance, I haven’t done web development since 2002 except for a little copy and paste work. I completely vibe coded three internal web admin sites for separate projects and used Amazon Cognito for authentication. I didn’t look at a line of code that AI generated any more than I would have looked at a line of code for a website I delegated to the web developer. I cared about functionality and UX.
If it's wrong then it's not provably correct (for any value of 'proof').
How you define your proof is up to you. It might be a simple test, or an exhaustive suite of tests, or a formal proof. It doesn't matter. If the output of the code is correct by your definition, then it doesn't matter what the underlying code actually is.
> * model transforms text into a formal specification
formal specification is no different from code: it will have bugs :)
There's no free lunch here: the informal-to-formal transition (be it words-to-code or words-to-formal-spec) comes through the non-deterministic models, period.
If we want to use the immense power of LLMs, we need to figure out a way to make this transition good enough
If what you're after is determinism, then your solution doesn't offer it. Both the formal specification and the code generated from it would be different each time. Formal specifications are useful when they're succinct, which is possible when they specify at a higher level of abstraction than code, which admits many different implemementations.
The point would presumably be to formalise it, then verify that the formal version matches what you actually meant. At which point you can't/shouldn't regenerate it, but you can request changes (which you'd need to verify and approve).
But the code produced from the formal spec would still be nondeterministic. And I believe CodeSpeak doesn't wish to regenerate the entire program with each spec change, but apply code changes based on the changes to the spec. Maybe there could be other benefits to formalisation in this case, but determinism isn't one of them.
Validating programs against a formal spec is very, very hard for foundational computational complexity reasons. There's a reason why the largest programs whose code was fully verified against a formal spec, and at an enormous cost, were ~10KLOC. If you want to do it using proofs, then lines of proof outnumber lines of code 10-1000 to 1, and the work is far harder than for proofs in mathematics (that are typically much shorter). There are less absolute ways of checking spec conformance at some useful level of confidence, and they can be worthwhile, but they require expertise and care (I'm very much in favour of using them, but the thought that AI can "just" prove conformance to a formal spec ignores the computational complexity results in that field).
For most cases we don't need nearly that comprehensive verification. This is expecting more off AI written code than we ever bother to subject most human written code to. There's a vast chasm there we only need to even slightly start to bridge to get to far higher confidence levels than the typical human dev team achieves.
I use Kiro IDE (≠ Kiro CLI) primarily as a spec generator.
In my experience, it's high-quality for creating and iterating on specs. Tools like Cursor are optimized for human-driven vibing -- they have great autocomplete, etc. Kiro, by contrast, is optimized around spec, which ironically has been the most effective approach I've found for driving agents.
I'd argue that Cursor, Antigravity, and similar tools are optimized for human steering, which explains their popularity, while Kiro is optimized for agent harnesses. That's also why it’s underused: it's quite opinionated, but very effective. Vibe-coding culture isn't sold on spec driven development (they think it's waterfall and summarily dismiss it -- even Yegge has this bias), so people tend to underrate it.
Kiro writes specs using structured formats like EARS and INCOSE (which is the spc format used in places like Boeing for engineering reqs). It performs automated reasoning to check for consistency, then generates a design document and task list from the spec -- similar to what Beads does. I usually spend a significant amount of time pressure-testing the spec before implementing (often hours to days), and it pays off. Writing a good, consistent spec is essentially the computer equivalent of "writing as a tool of thought" in practice.
Once the spec is tight, implementation tends to follow it closely. Kiro also generates property-based tests (PBTs) using Hypothesis in Python, inspired by Haskell's QuickCheck. These tests sweep the input domain and, when combined with traditional scenario-based unit tests, tend to produce code that adheres closely to the spec. I also add a small instruction "do red/green TDD" (I learned this from Simon Willison) and that one line alone improved the quality of all my tests.
Kiro can technically implement the task list itself, but this is where agents come in. With the spec in hand, I use multiple headless CLI agents in tmux (e.g., Kiro CLI, Claude Code) for implementation. The results have been very good. With a solid Kiro spec and task list, agents usually implement everything end-to-end without stopping -- I haven’t found a need for Ralph loops. (agents sometimes tend to stop mid way on Claude plans, but I've never had that happen with Kiro, not sure why, maybe it's the checklist, which includes PBT tests as gates).
didn't have the strongest start, but the Kiro IDE is one of the best spec generators I've used, and it integrates extremely well with agent-driven workflows.
My process has organically evolved towards something similar but less strictly defined:
- I bootstrap AGENTS.md with my basic way of working and occasionally one or two project specific pieces
- I then write a DESIGN.md. How detailed or well specified it is varies from project to project: the other day I wrote a very complete DESIGN.md for a time tracking, invoice management and accounting system I wanted for my freelance biz. Because it was quite complete, the agent almost one-shot the whole thing
- I often also write a TECHNICAL-SPEC.md of some kind. Again how detailed varies.
- Finally I link to those two from the AGENTS. I also usually put in AGENTS that the agent should maintain the docs and keep them in sync with newer decisions I make along the way.
This system works well for me, but it's still very ad hoc and definitely doesn't follow any kind of formally defined spec standard. And I don't think it should, really? IMO, technically strict specs should be in your automated tests not your design docs.
I think many have adopted "spec driven development" in the way you describe.
I found it works very well in once-off scenarios, but the specs often drift from the implementation.
Even if you let the model update the spec at the end, the next few work items will make parts of it obsolete.
Maybe that's exactly the goal that "codespeak" is trying to solve, but I'm skeptical this will work well without more formal specifications in the mix.
> specs often drift from the implementation
> Maybe that's exactly the goal that "codespeak" is trying to solve
Yes and yes. I think it's an important direction in software engineering. It's something that people were trying to do a couple decades ago but agentic implementation of the spec makes it much more practical.
I have the same basic workflow as you outlined, then I feed the docs into blackbird, which generates a structured plan with task and sub tasks. Then you can have it execute tasks in dependency order, with options to pause for review after each task or an automated review when all child task for a given parents are complete.
It’s definitely still got some rough edges but it has been working pretty well for me.
There should be a setting to include specific files in every prompt/context. I’m using zed and when you fire up an agent / chat it explicitly states that the file(s) are included.
Is that really true? I haven’t tried to do my own inference since the first Llama models came out years ago, but I am pretty sure it was deterministic: if you fixed the seed and the input was the same, the output of the inference was always exactly the same.
1.) There is typically a temperature setting (even when not exposed, most major providers have stopped exposing it [esp in the TUIs]).
2.) Then, even with the temperature set to 0, it will be almost deterministic but you'll still observe small variations due to the limited precision of float numbers.
> but you'll still observe small variations due to the limited precision of float numbers
No. Floating number arithmetic is deterministic. You don't get different answers for the same operations on the same machine just because of limited precision. There are reasons why it can be difficult to make sure that floating point operations agree across machines, but that is more of a (very annoying and difficult to make consistent) configuration thing than determinism.
(In general it is mildly frustrating to me to see software developers treat floating point as some sort of magic and ascribe all sorts of non-deterministic qualities to it. Yes floating point configuration for consistent results across machines can be absurdly annoying and nigh-impossible if you use transcendental functions and different binaries. No this does not mean if your program is giving different results for the same input on the same machine that this is a floating point issue).
In theory parallel execution combined with non-associativity can cause LLM inference to be non-deterministic. In practice that is not the case. LLM forward passes rarely use non-deterministic kernels (and these are usually explicitly marked as such e.g. in PyTorch).
You may be thinking of non-determinism caused by batching where different batch sizes can cause variations in output. This is not strictly speaking non-determinism from the perspective of the LLM, but is effectively non-determinism from the perspective of the end user, because generally the end user has no control over how a request is slotted into a batch.
Limited precision of float numbers is deterministic. But there's whole parallelism and how things are wired together, your generation may end up on a different hardware etc.
And models I work with (claude,gemini etc) have the temperature parameter when you are using API.
In reality you give the same programmer an update to the existing spec, and they change the code to implement the difference. Which is exactly what the thing in OP is doing, and exactly what should be done. There's simply no reason to regenerate the result.
The entire thing about determinism is a red herring, because 1) it's not determinism but prompt instability, and 2) prompt instability doesn't matter because of the above. Intelligence (both human and machine) is not a formal domain, your inputs lack formal syntax, and that's fine. For some reason this basic concept creates endless confusion everywhere.
I think your objections miss the point. My informal specs to a program are user-focused. I want to dictate what benefits the program will give to the person who is using it, which may include requirements for a transport layer, a philosophy of user interaction, or any number of things. When I know what I want out of a program, I go through the agony of translating that into a spec with database schemas, menu options, specific encryption schemes, etc., then finally I turn that into a formal spec within which whether I use an underscore or a dash somewhere becomes a thing that has to be consistent throughout the document.
You're telling me that I should be doing the agonizing parts in order for the LLM to do the routine part (transforming a description of a program into a formal description of a program.) Your list of things that "make no sense" are exactly the things that I want the LLMs to do. I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.
I want to see specs that move away from describing the specific functionality of programs altogether, and more into describing a usefulness or the convenience of a program that doesn't exist. I want to be able to feed the LLM requirements of what I want a program to be able to accomplish, and let the LLM research and implement the how. I only want to have to describe constraints i.e. it must enable me to be able to do A, B, and C, it must prevent X,Y, and Z; I want it to feel free to solve those constraints in the way it sees fit; and when I find myself unsatisfied with the output, I'll deliver it more constraints and ask it to regenerate.
This resonates. I built an AI career consultant (CareerCraft AI) with exactly this philosophy — describe what the user needs to accomplish, not how to implement it.
The spec was essentially: "A user shares their professional background and a target job posting. The AI has a conversation to understand their experience, then generates a tailored resume and cover letter as downloadable PDFs. It remembers their profile within the session so they can generate documents for multiple roles."
That's it. No formal grammar, no structured templates for the conversation flow. The LLM handles the "how" — when to ask clarifying questions, how to reframe experience for different roles, what to emphasize based on the job description.
The interesting finding: natural language specs work remarkably well for applications where the output is also natural language (documents, advice, analysis). The formal spec approach makes more sense when the output needs to be deterministic code.
> I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.
Be careful what you wish for. This sounds great in theory but in practice it will probably mean a migration path for the users (UX changes, small details changed, cost dynamics and a large etc.)
I tried this recently with what I thought was a simple layout, but probably uncommon for CSS. It took an extremely long back and forth to nail it down. It seemingly had no understanding how to achieve what I wanted. A couple sentences would have been clear to a person. Sometimes LLMs are fantastic and sometimes they are brain dead.
Definitely won't use it for prod ofc but may try it out for a side-project.
It seems that this is more or less:
- instead of modules, write specs for your modules
- on the first go it generates the code (which you review)
- later, diffs in the spec are translated into diffs in the code (the code is *not* fully regenerated)
this actually sounds pretty usable, esp. if someone likes writing. And wherever you want to dive deep, you can delve down into the code and do "microoptimizations" by rolling something on your own (with what seems to be called here "mixed projects").
That said, not sure if I need a separate tool for this, tbh. Instead of just having markdown files and telling cause to see the md diff and adjust the code accordingly.
Instead of imperatively letting the agents hammer your codebase into shape through a series of prompts, you declare your intent, observe the outcome and refine the spec.
The agents then serve as a control plane, carrying out the intent.
When you translate spec to tests (if those are traditional unit tests or any automated tests that call the rest of the code), that fixes the API of the code, i.e. the code gets designed implicitly in the test generation step. Is this working well in your experience?
This concept is assuming a formalized language would make things easier somehow for an llm. That’s making some big assumptions about the neuro anatomy if llms. This [1] from the other day suggests surprising things about how llms are internally structured; specifically that encoding and decoding are distinct phases with other stuff in between. Suggesting language once trained isn’t that important.
We are not trying to make things easier for LLMs. LLMs will be fine. CodeSpeak is built for humans, because we benefit from some structure, knowing how to express what we want, etc.
The problem with formal prompting languages is they assume the bottleneck is ambiguity in the prompt. In my experience building agents, the bottleneck is actually the model's context understanding. Same precise prompt, wildly different results depending on what else is in the context window. Formalizing the prompt doesn't help if the model builds the wrong internal representation of your codebase. That said curious to see where this goes.
Two pieces of advice I keep seeing over & over in these discussions-- 1) start with a fresh/baseline context regularly, and 2) give agents unix-like tools and files which can be interacted with via simple pseudo-English commands such as bash, where they can invoke e.g. "--help" to learn how to use them.
I'm not sure adding a more formal language interface makes sense, as these models are optimized for conversational fluency. It makes more sense to me for them to be given instructions for using more formal interfaces as needed.
The title writer might be doing the project a disservice by using the term "formal" to describe it, given that the project talks a lot about "specs". I mistook it to imply something about formal specification.
My quick understanding is that isn't really trying to utilize any formal specification but is instead trying to more-clearly map the relationship between, say, an individual human-language requirement you have of your application, and the code which implements that requirement.
I've done something similar for queries. Comments:
* Yes, this is a language, no its not a programming language you are used to, but a restricted/embellished natural language that (might) make things easier to express to an LLM, and provides a framework for humans who want to write specifications to get the AI to write code.
* Models aren't deterministic, but they are persistent (never gonna give up!). If you generate tests from your specification as well as code, you can use differential testing to get some measure (although not perfect) of correctness. Never delete the code that was generated before, if you change the spec, have your model fix the existing code rather than generate new code.
* Specifications can actually be analyzed by models to determine if they are fully grounded or not. An ungrounded specification is going to not be a good experience, so ask the model if it thinks your specification is grounded.
* Use something like a build system if you have many specs in your code repository and you need to keep them in sync. Spec changes -> update the tests and code (for example).
We tend to obsess over abstractions, frameworks, and standards, which is a good thing. But we already have BDD and TDD, and now, with english as the new high-level programming language, it is easier than ever to build. Focusing on other critical problem spaces like context/memory is more useful at this point. If the whole purpose of this is token compression, I don't see myself using it.
I'm writing the tool as proof of the spec. Still very much a pre-alpha phase, but I do have a working POC in that I can specify a series of prompts in my YAML language and execute the chain of commands in a local agent.
One of the "key steps" that I plan on designing is specifically an invocation interceptor. My underlying theory is that we would take whatever random series of prose that our human minds come up with and pass it through a prompt refinement engine:
> Clean up the following prompt in order to convert the user's intent
> into a structured prompt optimized for working with an LLM
> Be sure to follow appropriate modern standards based on current
> prompt engineering reasech. For example, limit the use of persona
> assignment in order to reduce hallucinations.
> If the user is asking for multiple actions, break the prompt
> into appropriate steps (**etc...)
That interceptor would then forward the well structured intent-parsed prompt to the LLM. I could really see a step where we say "take the crap I just said and
turn it into CodeSpeak"
What a fantastic tool. I'll definitely do a deep dive into this.
From what I was able to understand during the interview there, it's not actually a language, more like an orchestrator + pinning of individual generated chunks.
The demo I've briefly seen was very very far from being impressive.
Got rejected, perhaps for some excessive scepticism/overly sharp questions.
My scepticism remains - so far it looks like an orchestrator to me and does not add enough formalism to actually call it a language.
I think that the idea of more formal approach to assisted coding is viable (think: you define data structures and interfaces but don't write function bodies, they are generated, pinned and covered by tests automatically, LLMs can even write TLA+/formal proofs), but I'm kinda sceptical about this particular thing. I think it can be made viable but I have a strong feeling that it won't be hard to reproduce that - I was able to bake something similar in a day with Claude.
You can basically condense this entire "language" into a set of markdown rules and use it as a skill in your planning pipeline.
And whatever codespeak offers is like a weird VCS wrapper around this. I can already version and diff my skills, plans properly and following that my LLM generated features should be scoped properly and be worked on in their own branches. This imo will just give rise to a reason for people to make huge 8k-10k line changes in a commit.
> And whatever codespeak offers is like a weird VCS wrapper around this.
I'm still getting used to the idea that modern programs are 30 lines of Markdown that get the magic LLM incantation loop just right. Seems like you're in the same boat.
I am trying a similar spec driven development idea in a project I am working on. One big difference is that my specifications are not formalized that much. Tney are in plain language and are read directly by the LLM to convert to code. That seems like the kind of thing the LLM is good at. One other feature of this is that it allows me to nudge the implmentation a little with text in the spec outside of the formal requirements. I view it two ways, as spec-to-code but also as a saved prompt. I haven't spent enough time with it to say how successfuly it is, yet.
Do you save these "prompts" so you can improve, and in turn improve the code. to me Spec Driven Development is more than a spec to generate code, structured or not.
Interesting project, but I think it's solving the wrong bottleneck. The gap between what I want and what the model produces isn't primarily a language problem — it's a knowledge problem. You can write the most precise spec imaginable, but if the model doesn't have domain-specific knowledge about your product's edge cases, undocumented behaviors, or the tribal knowledge your team has accumulated, the output will be confidently wrong regardless of how formally you specified it.
I've been working on this from the other direction — instead of formalizing how you talk to the model, structure the knowledge the model has access to. When you actually measure what proportion of your domain knowledge frontier models can produce on their own (we call this the "esoteric knowledge ratio"), it's often only 40-55% for well-documented open source projects. For proprietary products it's even lower. No amount of spec formalism fixes that gap — you need to get the missing knowledge into context.
Isn't that the point though? In the development loop, you'd diagnose why it's not building what you expect, so you flush out those previous implicit or even subconscious edge cases, undocumented behaviors, and tribal knowledge and codify them into the spec.
It would actually end up being a lot easier to maintain than a bunch of undocumented spaghetti.
This seems like a step backwards. Programming Languages for LLMs need a lot of built in guarantees and restrictions. Code should be dense. I don't really know what to make of this project. This looks like it would make everything way worse.
I've had good success getting LLMs to write complicated stuff in haskell, because at the end of the day I am less worried about a few errant LLM lines of code passing both the type checking and the test suite and causing damage.
It is both amazing and I guess also not surprising that most vibe coding is focused on python and javascript, where my experience has been that the models need so much oversight and handholding that it makes them a simple liability.
The ideal programming language is one where a program is nothing but a set of concise, extremely precise, yet composable specifications that the _compiler_ turns into efficient machine code. I don't think English is that programming language.
The pattern we keep converging on is to treat model calls like a budgeted distributed system, not like a magical API. The expensive failures usually come from retries, fan-out, and verbose context growth rather than from a single bad prompt. Once we started logging token use per task step and putting hard ceilings on planner depth, costs became much more predictable.
This doesn't seem particularly formal. I still remain unconvinced reducing is really going to be valuable. Code obviously is as formal as it gets but as you trend away from that you quickly introduce problems that arise from lack of formality. I could see a world in which we're all just writing tests in the form of something like Gherkin though.
People seem weirdly eager to talk to LLMs in proto-code instead of fixing the base problem that LLMs are just unreliable interpreters. If your tool needs a new human-friendly DSL to avoid the ambiguity of plain English, maybe what you really want is to be writing actual code or specs with a type system and feedback loop. Any halfway formalism gives a false sense of precision, and you still get blindsided by the same model quirks, just dressed up differently.
> I could see a world in which we're all just writing tests in the form of something like Gherkin though.
Yes, and the implementation... no one actually cares about that. This would be a good outcome in my view. What I see is people letting LLMs "fill in the tests", whereas I'd rather tests be the only thing humans write.
While I'm also a bit skeptical, I think some formalism could really simplify everything. The programming world has lots of words that mean close to the same thing (subroutine, method, function, etc. ). Why not choose one and stick to it for interactions with the LLM? It should save plenty of complexity.
Under "Prerequisites"[0] I see: "Get an Anthropic API key".
I presume this is temporary since the project is still in alpha, but I'm curious why this requires use of an API at all and what's special about it that it can't leverage injecting the prompt into a Claude Code or other LLM coding tool session.
One requirement for a programming language to be “good” is that doing this, with sufficient specificity to get all the behavior you want, will be more verbose than the code itself.
The other piece that has always struck me as a huge inefficiency with current usage of LLMs is the hoops they have to jump through to make sense of existing file formats - especially making sense of (or writing) complicated semi-proprietary formats like PDF, DOC(X), PPT(X), etc.
Long-term prediction: for text, we'll move away from these formats and towards alternatives that are designed to be optimal for LLMs to interact with. (This could look like variants of markdown or JSON, but could also be Base64 [0] or something we've not even imagined yet.)
If LLMs can't deal with those legacy file formats, I don't trust them to be able to deal with anything. The idea that LLMs are so sophisticated that we have a need to dumb down inputs in order to interact with them is self-contradictory.
While I agree, the parent also talks about efficiency. If a different format increases efficiency, that could be reason enough to switch to it, even if understanding doesn’t improve and already was good before.
I'm gonna be honest here, I opened this website excited thinking this was a sort of new paradigm or programming language, and I ended up extremely confused at what this actually is and I still don't understand.
Is it a code generator tool from specs? Ugh. Why not push for the development of the protocol itself then?
A few days ago I released https://github.com/b4rtaz/incrmd
, which is similar to Codespeak. The main difference is that the specification is defined at the *project* level. I'm not sure if having the specification at the *file* level is a good choice, because the file structure does not necessarily align with the class structure, etc.
i’ve been doing this for a while, you create an extra file for every code file, sketch the code as you currently understand it (mostly function signatures and comments to fill in details), ask the LLM to help identify discrepancies. i call it “overcoding”.
i guess you can build a cli toolchain for it, but as a technique it’s a bit early to crystallize into a product imo, i fully expect overcoding to be a standard technique in a few years, it’s the only way i’ve been able to keep up with AI-coded files longer than 1500 lines
I tried looking through some of the spec samples, and it was not clear what the "language" was or that there was any syntax. It just looks like a terse spec.
In my building and research of Simplex, specs designed for LLM consumption don't need a formalized syntax as much as they just need an enforced structure, ideally paired with a linter. An effective spec for LLMs will bridge the gap between natural language and a formal language. It's about reducing ambiguity of intent because of the weaknesses and inconsistencies of natural language and the human operator.
When we understand that AI allows the spec to be in English (or any natural language), we might stop attempting to build "structured english" for spec.
So, instead of making LLMs smarter let’s make everything abstract again? Because everyone wants to learn another tool? Or is this supposed to be something I tell Claude, “Hey make some code to make some code!” I’m struggling to see the benefit of this vs. just telling Claude to save its plan for re-use.
Yep, you're right, I read this too fast - it's also breaking long lines into many and I read this in reverse. I just imagined how much I could reduce my own LOC by adjusting the print width on my prettier settings..
Getting so close to the idea. We will only have Englishscripts and don’t need code anymore. No compiling. No vibe coding. No coding. Https://jperla.com/blog/claude-electron-not-claudevm
it's not a new question if the as-is programming languages are optimal for LLMs: a language for LLM use would have to strongly typed. But that's about it for obvious requirements.
I would just like to point out the fun fact that instead of the brave new MD speak, there is still a `codespeak.json` to configure the build system itself...
...which seems to suggest that the authors themselves don't dogfood their own software. Please tell me that Codespeak was written entirely with Codespeak!
Instead of that json, which is so last year, why not use an agent to create an MD file to setup another agent, that will compile another MD file and feed it to the third agent, that... It is turtles, I mean agents, all the way down!
We created programming languages to direct programs. Then created LLM's to use English to direct programs. Now we've create programming languages to direct LLM's. What is old is new again!
The tweet I saw a few weeks ago about LLMs enabling building stupid ideas that would have never been built otherwise particularly resonates with this one.
Another great way to shrink your codebase 10x? Rewrite it in APL. If less code means less information, what are we gonna do when missing information was important?
As someone who hates writing (and thus coding) this might be a good tool, but how’s is it different from doing the same in claude? And I only see python, what about other languages, are they also production grade?
The intent of the idea is there, and I agree that there should be more precise syntax instead of colloquial English. However, it's difficult to take CodeSpeak seriously as it looks AI generated and misses key background knowledge.
I'm hoping for a framework that expands upon Behavior Driven Development (BDD) or a similar project-management concept. Here's a promising example that is ripe for an Agentic AI implementation, https://behave.readthedocs.io/en/stable/philosophy/#the-gher...
THe HN title seems very misleading to me. How is this, in any sense of the word, "formal?" I don't see that particular word used to describe this tool on the web page itself.
The site does describe it as a "programming language," which feels like a novel use of the term to me. The borders around a term like "programming language" are inherently fuzzy, but something like "code generation tool" better describes CodeSpeak IMHO.
This is pretty lame. I WANT to write code, something that has a formal definition and express my ideas in THAT, not some adhoc pseudo english an LLM then puts the cowboy hat on and does what the hotness of the week is.
Programming is in the end math, the model is defined and, when done correctly follows common laws.
I cannot read light on black. I don't know, maybe it's a condition, or simply just part of getting old. But my eyes physically hurt, and when I look up from reading a light-on-black screen, even when I looked at only for a short moment, my eyes need seconds to adjust again.
I know dark mode is really popular with the youngens but I regularly have to reach for reader mode for dark web pages, or else I simply cannot stand reading the contents.
Unfortunately, this site does not have an obvious way of reading it black-on-white, short of looking at the HTML source (CTRL+U), which - in fact - I sometimes do.
I find dark mode much easier to read and far less eye strain. I guess it just shows that users should be the ones to set the preference. There are studies on monkeys showing light mode leading to myopia. Although lately I have come to learn there are lots of poorly done studies.
Same for me, has been my whole life. I complain about it all the time. It's well documented that people can read black on light far better and with less eye strain than light on black; yet there seems to be a whole generation of developers determined to force us all to try and read it. Even the media sites like Netflix, Prime, etc. force it. At least Tubi's is somewhat more readable.
Sometimes a site will include a button or other UI element to choose a light theme but I find it odd that so many sites which are presumed to be designed by technically competent people, completely ignore accessibility concerns.
The most common mistake I see (on this website at least) is the assumption that one's programming competence is equal to their competence in other things.
Do you sit in a bright room? Right now, during the night, I see your comment like this: https://i.imgur.com/c7fmBns.png, but during the day when the room is bright, I also see everything with light themes/background colors, otherwise it is indeed hard to see properly.
When it’s dark (I can’t stand bright rooms at night), I lower the brightness of my screens instead of going for dark mode. I have astigmatism and any tiny bright spot is hard to focus on. It’s easier when the bright part is large and the dark parts are small (black on white is best).
We built LLMs so that you can express your ideas in English and no longer need to code.
Also, English is really too verbose and imprecise for coding, so we developed a programming language you can use instead.
Now, this gives me a business idea: are you tired of using CodeSpeak? Just explain your idea to our product in English and we'll generate CodeSpeak for you.
Yeah. It's hard to express and understand nested structures in a natural language yet they are easy in high-level programming languages. E.g. "the dog of first son of my neighbour" vs "me.neighbour.sons[0].dog", "sunny and hot, or rainy but not cold" vs "(sunny && hot) || (rainy && !cold)".
In the past maths were expressed using natural language, the math language exists because natural language isn't clear enough.
That seems like it could lead to imprecise outcomes, so I've started a business that defines a spec to output the correct English to input to your product.
"In order to make machines significantly easier to use, it has been proposed (to try) to design machines that we could instruct in our native tongues. this would, admittedly, make the machines much more complicated, but, it was argued, by letting the machine carry a larger share of the burden, life would become easier for us. It sounds sensible provided you blame the obligation to use a formal symbolism as the source of your difficulties. But is the argument valid? I doubt."
My gut says Kotlin is great for individual developer experience. But I never heard or saw credible reports on the Total Cost of Ownership, e.g., Kotlin engineers hiring, swapping out on a team.
> It’s one of those things that crackpots keep trying to do, no matter how much you tell them it could never work. If the spec defines precisely what a program will do, with enough detail that it can be used to generate the program itself, this just begs the question: how do you write the spec? Such a complete spec is just as hard to write as the underlying computer program, because just as many details have to be answered by spec writer as the programmer.
Joel Spolsky, stackoverflow.com founder, Talk at Yale: Part 1 of 3 https://www.joelonsoftware.com/2007/12/03/talk-at-yale-part-...
It actually makes sense that code is becoming amorphous and we will no longer scale in terms of building out new features (which has become cheap), but by defining stricter and stricter behavior constraints.
Program generation from a spec meant something vastly different in 2007 than it does now. People can and are generating programs from underspecified prompts. Trying to be systematic about how prompts work is a worthwhile area to explore.
What he misses is that it's much easier to change the spec than the code. And if the cost of regenerating the code is low enough, then the code is not worth talking about.
Is it? If the spec is as detailed as the code would be? If you make a change to one part of the spec do you now have inconsistencies that the LLM is going to have to resolve in some way? Are we going to have a compiler, or type checker type tools for the spec to catch these errors sooner?
As far as I can tell it's not a new language, but rather an alternative workflow for LLM-based development along with a tool that implements it.
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
As far as I can tell C is not a new language, but rather an alternative workflow for assembly development along with a tool that implements it.
I second that :)
> The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it
Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet. We are looking into ways to "catch up" the specs with whatever changes happen in the code not through CodeSpeak (agents or manual changes or whatever). It's an interesting exercise. In the case of agents, it's very helpful to look at the prompts users gave them (we are experimenting with inspecting the sessions from ~/.claude).
More generally, `codespeak takeover` [1] is a tool to convert code into specs, and we are teaching it to take prompts from agent sessions into account. Seems very helpful, actually.
I think it's a valid use case to start something in vibe coding mode and then switch to CodeSpeak if you want long-term maintainability. From "sprint mode" to "marathon mode", so to speak
[1] https://codespeak.dev/blog/codespeak-takeover-20260223
> Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet.
Will we though? Wouldn't AI need to reach a stage where it is a tool, like a compiler, which is 100% deterministic?
Two things to mention here:
1. You are right that we can redefine what is code. If code is the central artefact that humans are dealing with to tell machines and other humans how the system works, then CodeSpeak specs will become code, and CodeSpeak will be a compiler. This is why I often refer to CodeSpeak as a next-level programming language.
2. I don't think being deterministic per se is what matters. Being predictable certainly does. Human engineers are not deterministic yet people pay them a lot of money and use their work all the time.
We will and soon because it does not have to be deterministic like a compiler. It only has to pass all tests.
Who is writing the tests?
There are different kinds of tests:
* regression tests – can be generated
* conformance tests – often can be generated
* acceptance tests – are another form of specification and should come from humans.
Human intent can be expressed as
* documents (specs, etc)
* review comments, etc
* tests with clear yes/no feedback (data for automated tests, or just manual testing)
And this is basically all that matters, see more here: https://www.linkedin.com/posts/abreslav_so-what-would-you-sa...
In the future users will write the tests
Compiler is not 100% deterministic. Its output can change when you upgrade its version, its output can change when you change optimization options. Using profile-guided optimization can also change between runs.
If you change inputs then obviously you will get a different output. Crucially using the same inputs, however, produces the same output. So compilers are actually deterministic.
Why are we eliminating our own job and maybe hobby so eagerly? Whatever. It is done.
Also they seem to want to run this as a business, which seems absurd to me since I don't see how they can possibly charge money, and anyway the idea is so simple that it can be reimplemented in less than a week (less than a day for a basic version) and those alternative implementations may turn out to be better.
It also seems to be closed-source, which means that unless they open the source very soon it will very likely be immediately replaced in popularity by an open source version if it turns out to gain traction.
Also a bit formal. Maybe something like this will be the output of the prompt to let me know what the AI is going to generate in the binary, but I doubt I will be writing code like this in 5 years, English will probably be fine at my level.
> Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
Working on that as well. We need to be a lot more flexible and configurable
This doesn't make too much sense to me.
* This isn't a language, it's some tooling to map specs to code and re-generate
* Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
* Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months
* Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base?
* Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent
I do think there are opportunities in this space, but what I'd like to see is:
* write text specifications
* model transforms text into a *formal* specification
* then the formal spec is translated into code which can be verified against the spec
2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark.
But you can also get there by generating tests from the formal specification that validate the implementation.
Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
That if at the beginning of your sentence is doing a whole lot of work. Indeed, if we could formally and provably (another extremely loaded word) generate good code that'd be one thing, but proving correctness is one of those basically impossible tasks.
> but proving correctness is one of those basically impossible tasks.
To aim for a meeting of the minds... Would you help me out and unpack what you mean so there is less ambiguity? This might be minor terminological confusion. It is possible we have different takes, though -- that's what I'm trying to figure out.
There are at least two senses of 'correctness' that people sometimes mean: (a) correctness relative to a formal spec: this is expensive but doable*; (b) confidence that a spec matches human intent: IMO, usually a messy decision involving governance, organizational priorities, and resource constraints.
Sometimes people refer to software correctness problems in a very general sense, but I find it hard to parse those. I'm familiar with particular theoretical results such as Rice's theorem and the halting problem that pertain to arbitrary programs.
* With tools like {Lean, Dafny, Verus, Coq} and in projects like {CompCert, sel4}.
Let's rephrase:
Since nobody involved actually cares whether the code works or not, it doesn't matter whether it's a different wrong thing each time.
You got it completely backwards. The claim is that if the code does exactly what the spec says (which generated tests are supposed to "prove") then the actual code does not matter, even if it's different each time.
The point they are making is the tests are neither necessary nor sufficient alone to prove the code does exactly what the spec says. Looking at the tests isn't enough to prove anything; as an extreme example, if no one involved looks at the code, then the tests can just be static always passing and you wouldn't know either way whether or not the code matches the spec or not.
If anyone cared enough they could look at the code and see the problem immediately and with little effort, but we're encouraging a world where no one cares enough to put even that baseline effort because *gestures at* the tests are passing. Who cares how wrong the code is and in what ways if all the lights are green?
> If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
If the spec is so complete that it covers everything, you might as well write the code.
The benefit of writing a spec and having the LLM code it, is that the LLM will fill in a lot of blanks. And it is this filling in of blanks that is non-deterministic.
> If the spec is so complete that it covers everything, you might as well write the code.
Welcome to the usual offshoring experience.
That's a huge "if."
I usually invert those to reduce nesting
Sure, but where are the formal acceptance tests to validate against?
Besides, you can deterministically generate bad code, and not deterministically generate good code.
The code is what the code does.
The shoe is what the shoe does.
Except one shoe is made by children in a fire-trap sweatshop with no breaks, and the other was made by a well paid adult in good working conditions.
The ends don’t justify the means. The process of making impacts the output in ways that are subtle and important, but even holding the output as a fixed thing - the process of making still matters, at least to the people making it.
The end is whether the code meets the functional and non functional requirements.
And guess how much shoe companies make who manufacture shoes in sweatshop conditions versus the ones who make artisanal handcrafted shoes?
Ah yes - we should all strive to maximize shareholder value - triangle shirtwaist be damnned.
Btw in my metaphor, we - the programmers - are the kids in the sweatshop.
If you are a “programmer” you are going to be the kids in the sweatshop. On the enterprise dev side where most developers work, it’s been headed in that direction for at least a decade where it was easy enough to become a “good enough” generic full stack/mobile/web etc dev.
Even on the BigTech side being able to reverse a btree on the whiteboard and having on your resume that you were a mid level developer isn’t enough either anymore
If you look at the comp on that side, it’s also stagnated for decade. AI has just accelerated that trend.
While my job has been at various percentages to produce code for 30 years, it’s been well over a decade since I had to sell myself on “I codez real gud”. I sell myself as a “software engineer” who can go from ambiguous business and technical requirements, deal with politics, XYProblems, etc
What do you think programmers in offshoring consulting shops are? Sadly.
Exactly. I work in a consulting company as a customer facing staff consultant - highest level - specializing in cloud + app dev. We don’t hire anyone less than staff in the US. Anything lower is hired out of the country.
That’s exactly my point. “Programming” was clearly becoming commoditized a decade ago.
Functional requirements are known knowns.
Out of bounds behavior is sometimes a known unknown, but in the era of generated code is exclusively unknown unknowns.
Good luck speccing out all the unanticipated side effects and undefined behaviors. Perhaps you can prompt the agent in a loop a bnumber of times but it's hard to believe that the brute-force throw-more-tokens-at-it approach has the same level of return as a more attentive audit by human eyeballs.
Are you as a developer 100% able to trust that you didn’t miss anything? Your team if you are a team lead who delegates tasks to other developers? If you outsource non business things like Salesforce integrations etc do you know all of the code they wrote? Your library dependencies? Your infrastructure providers?
It seems like ^ and ^^ agree to me. Am I missing something?
I don’t know. I’m making a point that the only people whose sole responsibility is code that they personally write are mid level ticket takers.
I don’t review every line of code by everyone whose output I’m responsible for, I ask them to explain how they did things and care about their testing, the functional and non functional requirements and hotspots like concurrency, data access patterns, architectural issues etc.
For instance, I haven’t done web development since 2002 except for a little copy and paste work. I completely vibe coded three internal web admin sites for separate projects and used Amazon Cognito for authentication. I didn’t look at a line of code that AI generated any more than I would have looked at a line of code for a website I delegated to the web developer. I cared about functionality and UX.
Yet the people voting with their wallets seem to go with cheaper option, regardless of what hides behind it.
Being shoes, offshoring, Webwidgets or AI generated code.
Sure. People go for the cheapest option that fits their requirements, mostly.
But we’re the shoemakers, not the consumers. It’s actually our job to preserve our own and our peers quality of life.
Cheapest good option possible doesn’t have to be the sweatshop - tho the shareholders of nike or zara would have you believe that.
I would be very comfortable with - re-run 100 times with different seeds. If the outcome is the same every time, you're reliably good to go.
Even when it's wrong each time?
If it's wrong then it's not provably correct (for any value of 'proof').
How you define your proof is up to you. It might be a simple test, or an exhaustive suite of tests, or a formal proof. It doesn't matter. If the output of the code is correct by your definition, then it doesn't matter what the underlying code actually is.
> * model transforms text into a formal specification
formal specification is no different from code: it will have bugs :)
There's no free lunch here: the informal-to-formal transition (be it words-to-code or words-to-formal-spec) comes through the non-deterministic models, period.
If we want to use the immense power of LLMs, we need to figure out a way to make this transition good enough
If what you're after is determinism, then your solution doesn't offer it. Both the formal specification and the code generated from it would be different each time. Formal specifications are useful when they're succinct, which is possible when they specify at a higher level of abstraction than code, which admits many different implemementations.
The point would presumably be to formalise it, then verify that the formal version matches what you actually meant. At which point you can't/shouldn't regenerate it, but you can request changes (which you'd need to verify and approve).
But the code produced from the formal spec would still be nondeterministic. And I believe CodeSpeak doesn't wish to regenerate the entire program with each spec change, but apply code changes based on the changes to the spec. Maybe there could be other benefits to formalisation in this case, but determinism isn't one of them.
Even with classic compilation, it is only the semantic behavior that is preserved.
What the Church–Rosser property/confluence is in term rewriting in lambda calculus is a possible lens.
To have a formally verified spec, one has to use some decidable fragment of FO.
If you try to replace code generation with rewriting things can get complicated fast.[2]
Rust uses affine types as an example and people try to add petri-nets[0] but in general petri-net reachability is Ackerman-complete [1]
It is just the trade off of using a context free like system like an LLM with natural language.
HoTT and how dependent types tend to break isomorphic ≃ equal Is another possible lens.
[0] https://arxiv.org/abs/2212.02754v3
[1] https://arxiv.org/abs/2212.02754v3
[2] https://arxiv.org/abs/2407.20822
It doesn't matter if the code is different if the spec is formal enough to validate the software against it.
I have no idea about codespeak - I was responding to the comments above, not about codespeak.
Validating programs against a formal spec is very, very hard for foundational computational complexity reasons. There's a reason why the largest programs whose code was fully verified against a formal spec, and at an enormous cost, were ~10KLOC. If you want to do it using proofs, then lines of proof outnumber lines of code 10-1000 to 1, and the work is far harder than for proofs in mathematics (that are typically much shorter). There are less absolute ways of checking spec conformance at some useful level of confidence, and they can be worthwhile, but they require expertise and care (I'm very much in favour of using them, but the thought that AI can "just" prove conformance to a formal spec ignores the computational complexity results in that field).
For most cases we don't need nearly that comprehensive verification. This is expecting more off AI written code than we ever bother to subject most human written code to. There's a vast chasm there we only need to even slightly start to bridge to get to far higher confidence levels than the typical human dev team achieves.
Rehashing my comment from before:
I use Kiro IDE (≠ Kiro CLI) primarily as a spec generator. In my experience, it's high-quality for creating and iterating on specs. Tools like Cursor are optimized for human-driven vibing -- they have great autocomplete, etc. Kiro, by contrast, is optimized around spec, which ironically has been the most effective approach I've found for driving agents.
I'd argue that Cursor, Antigravity, and similar tools are optimized for human steering, which explains their popularity, while Kiro is optimized for agent harnesses. That's also why it’s underused: it's quite opinionated, but very effective. Vibe-coding culture isn't sold on spec driven development (they think it's waterfall and summarily dismiss it -- even Yegge has this bias), so people tend to underrate it.
Kiro writes specs using structured formats like EARS and INCOSE (which is the spc format used in places like Boeing for engineering reqs). It performs automated reasoning to check for consistency, then generates a design document and task list from the spec -- similar to what Beads does. I usually spend a significant amount of time pressure-testing the spec before implementing (often hours to days), and it pays off. Writing a good, consistent spec is essentially the computer equivalent of "writing as a tool of thought" in practice.
Once the spec is tight, implementation tends to follow it closely. Kiro also generates property-based tests (PBTs) using Hypothesis in Python, inspired by Haskell's QuickCheck. These tests sweep the input domain and, when combined with traditional scenario-based unit tests, tend to produce code that adheres closely to the spec. I also add a small instruction "do red/green TDD" (I learned this from Simon Willison) and that one line alone improved the quality of all my tests. Kiro can technically implement the task list itself, but this is where agents come in. With the spec in hand, I use multiple headless CLI agents in tmux (e.g., Kiro CLI, Claude Code) for implementation. The results have been very good. With a solid Kiro spec and task list, agents usually implement everything end-to-end without stopping -- I haven’t found a need for Ralph loops. (agents sometimes tend to stop mid way on Claude plans, but I've never had that happen with Kiro, not sure why, maybe it's the checklist, which includes PBT tests as gates).
didn't have the strongest start, but the Kiro IDE is one of the best spec generators I've used, and it integrates extremely well with agent-driven workflows.
My process has organically evolved towards something similar but less strictly defined:
- I bootstrap AGENTS.md with my basic way of working and occasionally one or two project specific pieces
- I then write a DESIGN.md. How detailed or well specified it is varies from project to project: the other day I wrote a very complete DESIGN.md for a time tracking, invoice management and accounting system I wanted for my freelance biz. Because it was quite complete, the agent almost one-shot the whole thing
- I often also write a TECHNICAL-SPEC.md of some kind. Again how detailed varies.
- Finally I link to those two from the AGENTS. I also usually put in AGENTS that the agent should maintain the docs and keep them in sync with newer decisions I make along the way.
This system works well for me, but it's still very ad hoc and definitely doesn't follow any kind of formally defined spec standard. And I don't think it should, really? IMO, technically strict specs should be in your automated tests not your design docs.
I think many have adopted "spec driven development" in the way you describe.
I found it works very well in once-off scenarios, but the specs often drift from the implementation. Even if you let the model update the spec at the end, the next few work items will make parts of it obsolete.
Maybe that's exactly the goal that "codespeak" is trying to solve, but I'm skeptical this will work well without more formal specifications in the mix.
> specs often drift from the implementation > Maybe that's exactly the goal that "codespeak" is trying to solve
Yes and yes. I think it's an important direction in software engineering. It's something that people were trying to do a couple decades ago but agentic implementation of the spec makes it much more practical.
I have been building this in my free time and it might be relevant to you: https://github.com/jbonatakis/blackbird
I have the same basic workflow as you outlined, then I feed the docs into blackbird, which generates a structured plan with task and sub tasks. Then you can have it execute tasks in dependency order, with options to pause for review after each task or an automated review when all child task for a given parents are complete.
It’s definitely still got some rough edges but it has been working pretty well for me.
AGENTS.md is nice but I still need to remind models that it exists and they should read it and not reinvent the wheel every time.
There should be a setting to include specific files in every prompt/context. I’m using zed and when you fire up an agent / chat it explicitly states that the file(s) are included.
> Models aren't deterministic
Is that really true? I haven’t tried to do my own inference since the first Llama models came out years ago, but I am pretty sure it was deterministic: if you fixed the seed and the input was the same, the output of the inference was always exactly the same.
LLMs are not deterministic:
1.) There is typically a temperature setting (even when not exposed, most major providers have stopped exposing it [esp in the TUIs]).
2.) Then, even with the temperature set to 0, it will be almost deterministic but you'll still observe small variations due to the limited precision of float numbers.
Edit: thanks for the corrections
> but you'll still observe small variations due to the limited precision of float numbers
No. Floating number arithmetic is deterministic. You don't get different answers for the same operations on the same machine just because of limited precision. There are reasons why it can be difficult to make sure that floating point operations agree across machines, but that is more of a (very annoying and difficult to make consistent) configuration thing than determinism.
(In general it is mildly frustrating to me to see software developers treat floating point as some sort of magic and ascribe all sorts of non-deterministic qualities to it. Yes floating point configuration for consistent results across machines can be absurdly annoying and nigh-impossible if you use transcendental functions and different binaries. No this does not mean if your program is giving different results for the same input on the same machine that this is a floating point issue).
In theory parallel execution combined with non-associativity can cause LLM inference to be non-deterministic. In practice that is not the case. LLM forward passes rarely use non-deterministic kernels (and these are usually explicitly marked as such e.g. in PyTorch).
You may be thinking of non-determinism caused by batching where different batch sizes can cause variations in output. This is not strictly speaking non-determinism from the perspective of the LLM, but is effectively non-determinism from the perspective of the end user, because generally the end user has no control over how a request is slotted into a batch.
Limited precision of float numbers is deterministic. But there's whole parallelism and how things are wired together, your generation may end up on a different hardware etc.
And models I work with (claude,gemini etc) have the temperature parameter when you are using API.
You shouldn't be downvoted - LLMs could in theory be deterministic, but they currently are not, due to how models are implemented.
All my self-hosted inference has temperature zero and no randomness.
It is absolutely workable, current inference engines are just lazy and dumb.
(I use a Zobrist hash to track and prune loops.)
Maybe we're entering the non-deterministic applications too. No more mechanical predictable thing.. more like 90% regular and then weird.
Slightly sarcastic but not sure this couldn't become a thing.
How is your 2 step process not susceptible to all the exact same pitfalls you listed above?
> Models aren't deterministic - every time you would try to re-apply you'd likely get different output
So like when you give the same spec to 2 different programmers.
Yes, if you had each programmer rewrite the code from scratch each time you updated the spec.
In reality you give the same programmer an update to the existing spec, and they change the code to implement the difference. Which is exactly what the thing in OP is doing, and exactly what should be done. There's simply no reason to regenerate the result.
The entire thing about determinism is a red herring, because 1) it's not determinism but prompt instability, and 2) prompt instability doesn't matter because of the above. Intelligence (both human and machine) is not a formal domain, your inputs lack formal syntax, and that's fine. For some reason this basic concept creates endless confusion everywhere.
Also like this: https://codeassociates.github.io/conversations-with-claude/c...
Except each time you compile your spec you’re re-writing it from scratch with a different programmer.
I think your objections miss the point. My informal specs to a program are user-focused. I want to dictate what benefits the program will give to the person who is using it, which may include requirements for a transport layer, a philosophy of user interaction, or any number of things. When I know what I want out of a program, I go through the agony of translating that into a spec with database schemas, menu options, specific encryption schemes, etc., then finally I turn that into a formal spec within which whether I use an underscore or a dash somewhere becomes a thing that has to be consistent throughout the document.
You're telling me that I should be doing the agonizing parts in order for the LLM to do the routine part (transforming a description of a program into a formal description of a program.) Your list of things that "make no sense" are exactly the things that I want the LLMs to do. I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.
I want to see specs that move away from describing the specific functionality of programs altogether, and more into describing a usefulness or the convenience of a program that doesn't exist. I want to be able to feed the LLM requirements of what I want a program to be able to accomplish, and let the LLM research and implement the how. I only want to have to describe constraints i.e. it must enable me to be able to do A, B, and C, it must prevent X,Y, and Z; I want it to feel free to solve those constraints in the way it sees fit; and when I find myself unsatisfied with the output, I'll deliver it more constraints and ask it to regenerate.
This resonates. I built an AI career consultant (CareerCraft AI) with exactly this philosophy — describe what the user needs to accomplish, not how to implement it.
The spec was essentially: "A user shares their professional background and a target job posting. The AI has a conversation to understand their experience, then generates a tailored resume and cover letter as downloadable PDFs. It remembers their profile within the session so they can generate documents for multiple roles."
That's it. No formal grammar, no structured templates for the conversation flow. The LLM handles the "how" — when to ask clarifying questions, how to reframe experience for different roles, what to emphasize based on the job description.
The interesting finding: natural language specs work remarkably well for applications where the output is also natural language (documents, advice, analysis). The formal spec approach makes more sense when the output needs to be deterministic code.
Try it if you're curious: https://super.myninja.ai/apps/6de082c7-a05f-4fc5-a7d3-ab56cc...
> I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.
Be careful what you wish for. This sounds great in theory but in practice it will probably mean a migration path for the users (UX changes, small details changed, cost dynamics and a large etc.)
I tried this recently with what I thought was a simple layout, but probably uncommon for CSS. It took an extremely long back and forth to nail it down. It seemingly had no understanding how to achieve what I wanted. A couple sentences would have been clear to a person. Sometimes LLMs are fantastic and sometimes they are brain dead.
[delete]
It isn't a formal language, look at the goose example:
https://codespeak.dev/blog/greenfield-project-tutorial-20260...
It is a formal "way" aka like using json or xml like tons of people are already doing.
Software products specifications are written in real language, not in first order logic.
Ugh, I just wish there was a deterministic and formal way to tell a computer what I want...
This is actually... pretty cool?
Definitely won't use it for prod ofc but may try it out for a side-project.
It seems that this is more or less:
this actually sounds pretty usable, esp. if someone likes writing. And wherever you want to dive deep, you can delve down into the code and do "microoptimizations" by rolling something on your own (with what seems to be called here "mixed projects").That said, not sure if I need a separate tool for this, tbh. Instead of just having markdown files and telling cause to see the md diff and adjust the code accordingly.
We'd love to hear your feedback! Feel free to come to our discord to ask questions/share experience: https://l.codespeak.dev/discord
I think this is 100% the right direction:
Instead of imperatively letting the agents hammer your codebase into shape through a series of prompts, you declare your intent, observe the outcome and refine the spec.
The agents then serve as a control plane, carrying out the intent.
Very much agree. I like the imperative vs declarative angle you take here. Thank you!
What I found more useful is an extra step. Spec to tests, and then red tests to code and green tests.
LLMs works on both translation steps. But you end up with an healthy amount of tests.
I tagged each tests with the id of the spec so I do get spec to test coverage as well.
Beside standard code coverage given by the tests.
Very much agree on coverage. We're actually doing something in that area: https://codespeak.dev/blog/coverage-20260302
For now, it's only about test coverage of the code, but the spec coverage is coming too.
When you translate spec to tests (if those are traditional unit tests or any automated tests that call the rest of the code), that fixes the API of the code, i.e. the code gets designed implicitly in the test generation step. Is this working well in your experience?
Yes it is passable.
Good enough that I don't review it.
Granted, it is a personal project that I care only to the point that I want it to work. There are no money on the line. Nothing professional.
I believe that part of the secret is that I force CC to run the whole est suites after it change ANY file. Using hooks.
It makes iteration slower because it kinda forces it to go from green to green. Or better from red to less red (since we start in red).
But overall I am definitely happy with the results.
Again, personal projects. Not really professional code.
Another trick that I use.
I force the code to be almost 100% dependency injection-able.
It simplifies a lot writing tests and getting the coverage. And I see the LLM being able to handle it very very well.
This concept is assuming a formalized language would make things easier somehow for an llm. That’s making some big assumptions about the neuro anatomy if llms. This [1] from the other day suggests surprising things about how llms are internally structured; specifically that encoding and decoding are distinct phases with other stuff in between. Suggesting language once trained isn’t that important.
[1] https://news.ycombinator.com/item?id=47322887
We are not trying to make things easier for LLMs. LLMs will be fine. CodeSpeak is built for humans, because we benefit from some structure, knowing how to express what we want, etc.
The problem with formal prompting languages is they assume the bottleneck is ambiguity in the prompt. In my experience building agents, the bottleneck is actually the model's context understanding. Same precise prompt, wildly different results depending on what else is in the context window. Formalizing the prompt doesn't help if the model builds the wrong internal representation of your codebase. That said curious to see where this goes.
Two pieces of advice I keep seeing over & over in these discussions-- 1) start with a fresh/baseline context regularly, and 2) give agents unix-like tools and files which can be interacted with via simple pseudo-English commands such as bash, where they can invoke e.g. "--help" to learn how to use them.
I'm not sure adding a more formal language interface makes sense, as these models are optimized for conversational fluency. It makes more sense to me for them to be given instructions for using more formal interfaces as needed.
The title writer might be doing the project a disservice by using the term "formal" to describe it, given that the project talks a lot about "specs". I mistook it to imply something about formal specification.
My quick understanding is that isn't really trying to utilize any formal specification but is instead trying to more-clearly map the relationship between, say, an individual human-language requirement you have of your application, and the code which implements that requirement.
I've done something similar for queries. Comments:
* Yes, this is a language, no its not a programming language you are used to, but a restricted/embellished natural language that (might) make things easier to express to an LLM, and provides a framework for humans who want to write specifications to get the AI to write code.
* Models aren't deterministic, but they are persistent (never gonna give up!). If you generate tests from your specification as well as code, you can use differential testing to get some measure (although not perfect) of correctness. Never delete the code that was generated before, if you change the spec, have your model fix the existing code rather than generate new code.
* Specifications can actually be analyzed by models to determine if they are fully grounded or not. An ungrounded specification is going to not be a good experience, so ask the model if it thinks your specification is grounded.
* Use something like a build system if you have many specs in your code repository and you need to keep them in sync. Spec changes -> update the tests and code (for example).
We tend to obsess over abstractions, frameworks, and standards, which is a good thing. But we already have BDD and TDD, and now, with english as the new high-level programming language, it is easier than ever to build. Focusing on other critical problem spaces like context/memory is more useful at this point. If the whole purpose of this is token compression, I don't see myself using it.
Agreed. There is definitely a similarity to BDD.
this is really exciting and dovetails really closely with the project I'm working on.
I'm writing a language spec for an LLM runner that has the ability to chain prompts and hooks into workflows.
https://github.com/AlexChesser/ail
I'm writing the tool as proof of the spec. Still very much a pre-alpha phase, but I do have a working POC in that I can specify a series of prompts in my YAML language and execute the chain of commands in a local agent.
One of the "key steps" that I plan on designing is specifically an invocation interceptor. My underlying theory is that we would take whatever random series of prose that our human minds come up with and pass it through a prompt refinement engine:
> Clean up the following prompt in order to convert the user's intent > into a structured prompt optimized for working with an LLM > Be sure to follow appropriate modern standards based on current > prompt engineering reasech. For example, limit the use of persona > assignment in order to reduce hallucinations. > If the user is asking for multiple actions, break the prompt > into appropriate steps (**etc...)
That interceptor would then forward the well structured intent-parsed prompt to the LLM. I could really see a step where we say "take the crap I just said and turn it into CodeSpeak"
What a fantastic tool. I'll definitely do a deep dive into this.
Feels like writing tests before writing code, but with LLMs :)
From what I was able to understand during the interview there, it's not actually a language, more like an orchestrator + pinning of individual generated chunks.
The demo I've briefly seen was very very far from being impressive.
Got rejected, perhaps for some excessive scepticism/overly sharp questions.
My scepticism remains - so far it looks like an orchestrator to me and does not add enough formalism to actually call it a language.
I think that the idea of more formal approach to assisted coding is viable (think: you define data structures and interfaces but don't write function bodies, they are generated, pinned and covered by tests automatically, LLMs can even write TLA+/formal proofs), but I'm kinda sceptical about this particular thing. I think it can be made viable but I have a strong feeling that it won't be hard to reproduce that - I was able to bake something similar in a day with Claude.
I find it weird that this comment is gray but it's the only one in the thread so far that mentions TLA+ which is highly relevant here.
You can basically condense this entire "language" into a set of markdown rules and use it as a skill in your planning pipeline.
And whatever codespeak offers is like a weird VCS wrapper around this. I can already version and diff my skills, plans properly and following that my LLM generated features should be scoped properly and be worked on in their own branches. This imo will just give rise to a reason for people to make huge 8k-10k line changes in a commit.
> And whatever codespeak offers is like a weird VCS wrapper around this.
I'm still getting used to the idea that modern programs are 30 lines of Markdown that get the magic LLM incantation loop just right. Seems like you're in the same boat.
I am trying a similar spec driven development idea in a project I am working on. One big difference is that my specifications are not formalized that much. Tney are in plain language and are read directly by the LLM to convert to code. That seems like the kind of thing the LLM is good at. One other feature of this is that it allows me to nudge the implmentation a little with text in the spec outside of the formal requirements. I view it two ways, as spec-to-code but also as a saved prompt. I haven't spent enough time with it to say how successfuly it is, yet.
Do you save these "prompts" so you can improve, and in turn improve the code. to me Spec Driven Development is more than a spec to generate code, structured or not.
Interesting project, but I think it's solving the wrong bottleneck. The gap between what I want and what the model produces isn't primarily a language problem — it's a knowledge problem. You can write the most precise spec imaginable, but if the model doesn't have domain-specific knowledge about your product's edge cases, undocumented behaviors, or the tribal knowledge your team has accumulated, the output will be confidently wrong regardless of how formally you specified it.
I've been working on this from the other direction — instead of formalizing how you talk to the model, structure the knowledge the model has access to. When you actually measure what proportion of your domain knowledge frontier models can produce on their own (we call this the "esoteric knowledge ratio"), it's often only 40-55% for well-documented open source projects. For proprietary products it's even lower. No amount of spec formalism fixes that gap — you need to get the missing knowledge into context.
Isn't that the point though? In the development loop, you'd diagnose why it's not building what you expect, so you flush out those previous implicit or even subconscious edge cases, undocumented behaviors, and tribal knowledge and codify them into the spec.
It would actually end up being a lot easier to maintain than a bunch of undocumented spaghetti.
This seems like a step backwards. Programming Languages for LLMs need a lot of built in guarantees and restrictions. Code should be dense. I don't really know what to make of this project. This looks like it would make everything way worse.
I've had good success getting LLMs to write complicated stuff in haskell, because at the end of the day I am less worried about a few errant LLM lines of code passing both the type checking and the test suite and causing damage.
It is both amazing and I guess also not surprising that most vibe coding is focused on python and javascript, where my experience has been that the models need so much oversight and handholding that it makes them a simple liability.
The ideal programming language is one where a program is nothing but a set of concise, extremely precise, yet composable specifications that the _compiler_ turns into efficient machine code. I don't think English is that programming language.
The pattern we keep converging on is to treat model calls like a budgeted distributed system, not like a magical API. The expensive failures usually come from retries, fan-out, and verbose context growth rather than from a single bad prompt. Once we started logging token use per task step and putting hard ceilings on planner depth, costs became much more predictable.
This doesn't seem particularly formal. I still remain unconvinced reducing is really going to be valuable. Code obviously is as formal as it gets but as you trend away from that you quickly introduce problems that arise from lack of formality. I could see a world in which we're all just writing tests in the form of something like Gherkin though.
> I could see a world in which we're all just writing tests in the form of something like Gherkin though.
That works great in practice, Gherkin even has a markdown dialect [1].
If you combine it with a tool like aico [2] you can have a really effective development workflow.
[1] https://github.com/cucumber/gherkin/blob/main/MARKDOWN_WITH_...
[2] https://github.com/jurriaan/aico
People seem weirdly eager to talk to LLMs in proto-code instead of fixing the base problem that LLMs are just unreliable interpreters. If your tool needs a new human-friendly DSL to avoid the ambiguity of plain English, maybe what you really want is to be writing actual code or specs with a type system and feedback loop. Any halfway formalism gives a false sense of precision, and you still get blindsided by the same model quirks, just dressed up differently.
> I could see a world in which we're all just writing tests in the form of something like Gherkin though.
Yes, and the implementation... no one actually cares about that. This would be a good outcome in my view. What I see is people letting LLMs "fill in the tests", whereas I'd rather tests be the only thing humans write.
> Yes, and the implementation... no one actually cares about that.
There has been a profession in place for many decades that specifically addresses that...Software Engineering.
While I'm also a bit skeptical, I think some formalism could really simplify everything. The programming world has lots of words that mean close to the same thing (subroutine, method, function, etc. ). Why not choose one and stick to it for interactions with the LLM? It should save plenty of complexity.
Under "Prerequisites"[0] I see: "Get an Anthropic API key".
I presume this is temporary since the project is still in alpha, but I'm curious why this requires use of an API at all and what's special about it that it can't leverage injecting the prompt into a Claude Code or other LLM coding tool session.
[0]: https://codespeak.dev/blog/greenfield-project-tutorial-20260...
Literally the first example on the main page declared as code.py would result in an indentation error :)
One requirement for a programming language to be “good” is that doing this, with sufficient specificity to get all the behavior you want, will be more verbose than the code itself.
Conceptually, this seems a good direction.
The other piece that has always struck me as a huge inefficiency with current usage of LLMs is the hoops they have to jump through to make sense of existing file formats - especially making sense of (or writing) complicated semi-proprietary formats like PDF, DOC(X), PPT(X), etc.
Long-term prediction: for text, we'll move away from these formats and towards alternatives that are designed to be optimal for LLMs to interact with. (This could look like variants of markdown or JSON, but could also be Base64 [0] or something we've not even imagined yet.)
[0] https://dnhkng.github.io/posts/rys/
If LLMs can't deal with those legacy file formats, I don't trust them to be able to deal with anything. The idea that LLMs are so sophisticated that we have a need to dumb down inputs in order to interact with them is self-contradictory.
While I agree, the parent also talks about efficiency. If a different format increases efficiency, that could be reason enough to switch to it, even if understanding doesn’t improve and already was good before.
We already have a language for talking to LLMs: Polish
https://www.zmescience.com/science/news-science/polish-effec...
I'm gonna be honest here, I opened this website excited thinking this was a sort of new paradigm or programming language, and I ended up extremely confused at what this actually is and I still don't understand.
Is it a code generator tool from specs? Ugh. Why not push for the development of the protocol itself then?
A few days ago I released https://github.com/b4rtaz/incrmd , which is similar to Codespeak. The main difference is that the specification is defined at the *project* level. I'm not sure if having the specification at the *file* level is a good choice, because the file structure does not necessarily align with the class structure, etc.
A formal way for a senior to tell AI (clueless junior) to do a senior's job? Once again, who checks and fixes the output code?
Of course an expert would throw it out and design/write it properly so they know it works.
i’ve been doing this for a while, you create an extra file for every code file, sketch the code as you currently understand it (mostly function signatures and comments to fill in details), ask the LLM to help identify discrepancies. i call it “overcoding”.
i guess you can build a cli toolchain for it, but as a technique it’s a bit early to crystallize into a product imo, i fully expect overcoding to be a standard technique in a few years, it’s the only way i’ve been able to keep up with AI-coded files longer than 1500 lines
So is it basically Markdown? The landing does not articulate, unfortunately, what the key contribution is.
I tried looking through some of the spec samples, and it was not clear what the "language" was or that there was any syntax. It just looks like a terse spec.
In my building and research of Simplex, specs designed for LLM consumption don't need a formalized syntax as much as they just need an enforced structure, ideally paired with a linter. An effective spec for LLMs will bridge the gap between natural language and a formal language. It's about reducing ambiguity of intent because of the weaknesses and inconsistencies of natural language and the human operator.
I read through the thing and don't quite understand what this adds that the dozens of LLM coding wrappers don't already do.
You write a markdown spec.
The script takes it and feeds it to an LLM API.
The API generates code.
Okay? Where is this "next-generation programming language" they talk about?
When we understand that AI allows the spec to be in English (or any natural language), we might stop attempting to build "structured english" for spec.
So, instead of making LLMs smarter let’s make everything abstract again? Because everyone wants to learn another tool? Or is this supposed to be something I tell Claude, “Hey make some code to make some code!” I’m struggling to see the benefit of this vs. just telling Claude to save its plan for re-use.
This raises a question --- how well do LLMs understand Loglan?
https://www.loglan.org/
Or Lojban?
https://mw.lojban.org/
Isn't the case study.... too contrived and trivial? The largest code change is 800 lines so it can readily fit in a model's context.
However, there is no case for more complicated, multi-file changes or architecture stuff.
I think I prefer Tracey https://github.com/bearcove/tracey
"Shrink your codebase 5-10x"
"[1] When computing LOC, we strip blank lines and break long lines into many"
I don't think this is the gotcha you think it is...
I imagine this is before and after- not just after.
As in, they aren't just making lines long and removing whitespace (something models love to do when you ask it to remove lines of code)
Yep, you're right, I read this too fast - it's also breaking long lines into many and I read this in reverse. I just imagined how much I could reduce my own LOC by adjusting the print width on my prettier settings..
> The spec is the source of truth
This feels wrong, as the spec doesn't consistently generate the same output.
But upon reflection, "source of truth" already refers to knowledge and intent, not machine code.
> not machine code
Actually, computers, being machines, do equate machine code and source of truth.
The next step will be to formalize all the instructions possible to give to a processor and use that language!
Getting so close to the idea. We will only have Englishscripts and don’t need code anymore. No compiling. No vibe coding. No coding. Https://jperla.com/blog/claude-electron-not-claudevm
this will have to compile to something tho? So there will always be code
Instead of using tabs, it would be much better to show the comparison side by side.
Also, the examples feel forced, as if you use external libraries, you don't have to write your own "Decode RFC 2047"
I want to see an LLM combined with correctness preserving transforms.
So for example, if you refactor a program, make the LLM do anything but keep the logic of the program intact.
Looks like JSON like YAML. It is still English. Was hoping for something like Lojban.
> codespeak login
Instant tab close!
Huh? I don't see a login. All the code examples and set up instructions are available to anyone visiting the page.
It's in the Quickstart article here:
https://codespeak.dev/blog/greenfield-project-tutorial-20260...
it's not a new question if the as-is programming languages are optimal for LLMs: a language for LLM use would have to strongly typed. But that's about it for obvious requirements.
"Coming soon: Turning Code into Specs"
There you have it: Code laundering as a service. I guess we have to avoid Kotlin, too.
I avoid Kotlin as a principal, any language that can't get the type and variable name in the correct order; I avoid them completely.
Yes I'm also one of those LLM skeptics but actually this looks interesting.
This is basically what I talked about maybe a year ago. Glad to see someone is taking it on.
Then of course we are going to ask LLMs to generate specifications in this new language
Exactly as necessary as Kotlin itself.
Good code is the specification.
So, back to a programming language, albeit “simplified.”
Is this more like a programming language, or more like a specification system akin to UML?
Same, saw the code and thought are we trying to generate things like we did with UML. That also promised some English to UML to Code.
What's old is new.
I would just like to point out the fun fact that instead of the brave new MD speak, there is still a `codespeak.json` to configure the build system itself...
...which seems to suggest that the authors themselves don't dogfood their own software. Please tell me that Codespeak was written entirely with Codespeak!
Instead of that json, which is so last year, why not use an agent to create an MD file to setup another agent, that will compile another MD file and feed it to the third agent, that... It is turtles, I mean agents, all the way down!
https://thinkwright.ai/simplex
I think stuff like Langflow and n8n are more likely to be adopted, alongside with some more formal specifications.
We created programming languages to direct programs. Then created LLM's to use English to direct programs. Now we've create programming languages to direct LLM's. What is old is new again!
The tweet I saw a few weeks ago about LLMs enabling building stupid ideas that would have never been built otherwise particularly resonates with this one.
Another great way to shrink your codebase 10x? Rewrite it in APL. If less code means less information, what are we gonna do when missing information was important?
Alas, I thought I invented this.
https://news.ycombinator.com/item?id=47284030
As someone who hates writing (and thus coding) this might be a good tool, but how’s is it different from doing the same in claude? And I only see python, what about other languages, are they also production grade?
The intent of the idea is there, and I agree that there should be more precise syntax instead of colloquial English. However, it's difficult to take CodeSpeak seriously as it looks AI generated and misses key background knowledge.
I'm hoping for a framework that expands upon Behavior Driven Development (BDD) or a similar project-management concept. Here's a promising example that is ripe for an Agentic AI implementation, https://behave.readthedocs.io/en/stable/philosophy/#the-gher...
https://en.wikipedia.org/wiki/Literate_programming
Buddy invented RobotFramework, great job.
I for one can't wait to be a confident CodeSpeak programmer /sarc
Does this make it a 6th generation language?
Sounds like crap.
So, just a markdown file?
THe HN title seems very misleading to me. How is this, in any sense of the word, "formal?" I don't see that particular word used to describe this tool on the web page itself.
The site does describe it as a "programming language," which feels like a novel use of the term to me. The borders around a term like "programming language" are inherently fuzzy, but something like "code generation tool" better describes CodeSpeak IMHO.
Ok we've deformalized the title above.
This is pretty lame. I WANT to write code, something that has a formal definition and express my ideas in THAT, not some adhoc pseudo english an LLM then puts the cowboy hat on and does what the hotness of the week is.
Programming is in the end math, the model is defined and, when done correctly follows common laws.
I cannot read light on black. I don't know, maybe it's a condition, or simply just part of getting old. But my eyes physically hurt, and when I look up from reading a light-on-black screen, even when I looked at only for a short moment, my eyes need seconds to adjust again.
I know dark mode is really popular with the youngens but I regularly have to reach for reader mode for dark web pages, or else I simply cannot stand reading the contents.
Unfortunately, this site does not have an obvious way of reading it black-on-white, short of looking at the HTML source (CTRL+U), which - in fact - I sometimes do.
I find dark mode much easier to read and far less eye strain. I guess it just shows that users should be the ones to set the preference. There are studies on monkeys showing light mode leading to myopia. Although lately I have come to learn there are lots of poorly done studies.
Same for me, has been my whole life. I complain about it all the time. It's well documented that people can read black on light far better and with less eye strain than light on black; yet there seems to be a whole generation of developers determined to force us all to try and read it. Even the media sites like Netflix, Prime, etc. force it. At least Tubi's is somewhat more readable.
Sometimes a site will include a button or other UI element to choose a light theme but I find it odd that so many sites which are presumed to be designed by technically competent people, completely ignore accessibility concerns.
The most common mistake I see (on this website at least) is the assumption that one's programming competence is equal to their competence in other things.
Do you sit in a bright room? Right now, during the night, I see your comment like this: https://i.imgur.com/c7fmBns.png, but during the day when the room is bright, I also see everything with light themes/background colors, otherwise it is indeed hard to see properly.
Unfortunately, in my case, it's not a matter of lighting conditions.
When it’s dark (I can’t stand bright rooms at night), I lower the brightness of my screens instead of going for dark mode. I have astigmatism and any tiny bright spot is hard to focus on. It’s easier when the bright part is large and the dark parts are small (black on white is best).
Yeah I am the same.
Definitely in the minority on this one as dark mode is really popular these days.
Really hard to describe how it is literally physically painful for my eyes. Very strange.
I feel your pain. For me it is the opposite: I get headaches from bright backgrounds because I'm light-sensitive.
I'll stick to Polish
We built LLMs so that you can express your ideas in English and no longer need to code.
Also, English is really too verbose and imprecise for coding, so we developed a programming language you can use instead.
Now, this gives me a business idea: are you tired of using CodeSpeak? Just explain your idea to our product in English and we'll generate CodeSpeak for you.
I'm sure that this time the language will be simple and English-like enough that execs can use it directly, similarly to COBOL and SQL.
The idea is this would be a kind of IL for natural language queries. Then the main LLM isn't dependent on quirks of English.
No joke. I'm 100% sure that if it's successful, we will find CC's skill to write specs for CodeSpeak.
Yeah. It's hard to express and understand nested structures in a natural language yet they are easy in high-level programming languages. E.g. "the dog of first son of my neighbour" vs "me.neighbour.sons[0].dog", "sunny and hot, or rainy but not cold" vs "(sunny && hot) || (rainy && !cold)".
In the past maths were expressed using natural language, the math language exists because natural language isn't clear enough.
Did you mean AbstractNeighborDispatcherFactory?
Damn, I am the product A-GAIN?
COBOL?
sssssh! if this catches on we can keep our jobs! (j/k, mostly)
That seems like it could lead to imprecise outcomes, so I've started a business that defines a spec to output the correct English to input to your product.
relevant Dijkstra https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
"In order to make machines significantly easier to use, it has been proposed (to try) to design machines that we could instruct in our native tongues. this would, admittedly, make the machines much more complicated, but, it was argued, by letting the machine carry a larger share of the burden, life would become easier for us. It sounds sensible provided you blame the obligation to use a formal symbolism as the source of your difficulties. But is the argument valid? I doubt."
I'm really glad random HN commenters know it better than someone that built a language that has been used in thousands of products.
Standard appeal to accomplishment, past success does not guarantee future success... especially on this joke comment
Kotlin is generally considered a bit of a dud in the modern programming language space.
I reckon this comment from 6 years ago predicts Kotlin's fate https://news.ycombinator.com/item?id=24197817 I consider it prophetic.
My gut says Kotlin is great for individual developer experience. But I never heard or saw credible reports on the Total Cost of Ownership, e.g., Kotlin engineers hiring, swapping out on a team.
It's a blessing when you're in the native Android / React Native / Flutter space.
Even Noble Prize owners made huge mistakes after the prize.
Somewhere Dijkstra is laughing his ass off.
The next step is to use AI to edit the spec... /s
Its early for April fools
"Don't be snarky."
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html