Part I. SCM as a database for the code
Despite L.Torvalds initially described git as “information manager from hell”, it was a very innovative revision control system. That was 20 years ago and the architecture shows its age, its core limitations causing heaps of accidental complexity. A jungle of commands, options and syntaxes with overlapping concerns, combinatorial maneuvers between worktree, stash, staging, commits, local and remote branches, and finally the abstraction of eternal cryptographically protected record we all rewrite daily to do the most basic stuff.
git is a filesystem, it says so on the box and it stores
blobs. That makes it rather blunt when it comes to merging the
changes. That is the reason behind lots of git ceremony and
limitations. With LLMs, generating code is cheap; sorting it all
out is more work though. Here, git distracts and limits developers.
Beagle is a database for your code, it stores coarse AST (abstract syntax tree), so it can do merges deterministically, non-intrusively, and fully aware of the syntax. It can address and describe changes in terms of symbols (functions, classes), not just files and hunks. Blob-level versioning remains as a fallback only. For Beagle, there is a file tree and AST trees of individual files with named nodes. “How did this function look a month ago”? “What new calls did my function accrete in a week?” This model allows to build much sharper tools. Remember that git codebase is 310KLoC of C code and about the same amount of sh/perl/tcl. That is x15 more than LevelDB, a third of PostgreSQL, about three SQLites, and generally in the ballpark of a general-purpose database. But, the underlying model limits greatly what we can query git about. Beagle’s model is less limiting.
Technically, Beagle is able to version and merge AST trees as CRDTs. That differentiates it from other stack-of-patches VCSes, such as Pijul, Darcs and others. Fundamentally, CRDTs add some metadata to avoid lots of guessing later on. (Diff is algorithmic guessing, 3-way merge too.) Data units get ids and/or timestamps, become addressable, so merges become deterministic and non-intrusive. One can attach and detach branches by a checkbox. That does not ensure semantic correctness, but at least one can iterate on it faster.
CRDT’s non-intrusive merges give lots of freedom in slicing and dicing larger repos into branches, overlays, submodules, and so on. If you can merge, you can split. Want to keep a subdir both here and as a separate project? No problem. Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.
The classic approach to complexity: minimize primitives, but make them composable. Fundamentally, Beagle moves changes between worktrees and the repo, with 2x2=4 potential data maneuvers (repo to repo, repo to worktree, worktree to repo, within worktree). That is way lower than git’s 6x6=36 (local and remote branches, commits, stash, staging and worktree). At the plumbing layer, the 3 maneuvers involving Beagle are served by commands PUT, GET, POST respectively (yes, HTTP vocabulary on the command line).
The smallest unit of repo change is a waypoint, which is a nameless Ctrl+S type of event. The most basic workflow is just a trail of Ctrl+S/Ctrl+Z events creating a set of AST changes. Post-factum, changes can be selected, recombined and declared a new commit or branch, or added to an existing one. Any porcelain command is a mosaic of GET/POST/PUT over some subset/union of changes. The difference between snapshot, branch, staging, stash or overlay is all but non-existant. These are named groupings of changes. An overlay is a “permanent editable changeset” that can be attached/detached to/from a branch. A branch is a short-lived stack of changes forked off the head. Technically, either of them is just a set of delta files (SSTs) producing some state of the worktree. git-style and Beagle-style porcelain can be used in parallel.
The next section talks about Beagle’s porcelain project/ repo/ branch/ overlay model which is slightly different from git’s: repos are closer to git repos (but one level up), branches are like git branches but lighter, and overlays have no parallel in git at all. CRDT merges are deterministic and non-intrusive, so one can merge left and right, using worktree as a palette for blending. In fact, the entire porcelain story is a way to sort changes into orderly boxes of different colors (repos, branches, overlays, snapshots) with different labels on them.
The section after that talks about Beagle’s core/plumbing commands: GET,
POST, PUT and DELETE.
Skip next two sections if you want to see the resulting UX first. Long story short: mainly the same four commands plus URI-based syntax for everything.
How to make a command/ referral language flexible enough to express all the use cases by composing a minimal number of plain intuitive primitives? This problem is essentially a language problem.
In respect to addressing, Beagle bets on URIs. What worked for a World Wide Web in all its vastness, should also work for intra/inter repo referencing.
Encouraged by that idea, Beagle sets the scope of the system to global. One key feature of git was to only version an entire project as a whole. Lets think: what can we do to version an entire working system, all sources and configs, so each repo is a small GitHub hosting a number of projects?
If we want to limit ourselves to 4 basic kinds of maneuvers, those are:
We assume the current worktree is linked to one fixed place in the repo. Things look a bit too primitive so far. Then, we chalk the repo into squares:
main.team.company.com or
release.product.entity.org;@gritzko/librdx (like a GitHub path).
So a full URI is like http://main.replicated.live/@gritzko/librdxHere the maneuver #4 gets subdivided into submaneuvers, the most frequent case being changeset exchange between repos. Note that repos are not scoped to a project. When we create a repo, we “fork the world”. That mostly makes sense because projects form their own dependency graphs anyway, so version alpha of project A needs version beta of project B and so on. Once we create a repo, we may put in all the relevant code. With syntax-aware CRDT merge, we can be a bit bolder in forking things, as we retain enough metadata to ease merges.
On top of that, the mapping between file system paths and projects is
not 1:1. First of all, one project can have several worktrees, that is
normal. Second, one worktree can contain several blended branches or
projects. Merging the branches is nothing special, let’s talk about
the other case. Suppose we want to split one project into the base and
its overlays. For example, prompts, plans and TODOs live in the same
dirs in the worktree, but belong to a different overlay project in the
repo, @gritzko/librdx vs @gritzko/librdx.ai. We can work with the
source, we can add the AI work docs, or we can deal with prompts and
logs separately from sources.
The last caveat for those familiar with git (all of us) is branches.
Apart from the head, a project can have multiple marked branches,
which are supposed to merge in near future. The distinction here is that
branches are scoped to a project/repo, and have no public identity.
When each developer teams up with AI, cheaper transient branching is
necessary, locally and within a team. Beagle branches are somewhat
lighter than git branches. While a branch is essentially a sticky note
on a hash, CRDT merges are deterministic and non-intrusive, so merging
(blending) branches invokes much less work and ceremony than merging
git branches.
GET POST PUT DELETEBack to the original question, lets see whether an URI based referencing
language and 4 HTTP verbs are sufficient to express the operations we want.
GET, POST, PUT and DELETE correspond to maneuvers #2, #1, #4, #4 resp.
Maneuver #3 is cp, rm, vim, etc.
GET http://repo.team.entity.org/project?branchA simple checkout
of a particular branch version (may need to clone first);GET //repo2 switching the repo;GET /project/dir/file.txt checkout one file;POST ./file.txt stage one file (it gets imported into the repo,
but the branch does not move yet);DELETE somefile.txt delete;PUT ./file.txt?branchB merge in file changes from other branch;GET ?branchB switch the branch;GET ?timestamp-origin checkout a version by its timestamp;GET ?4d2130 checkout a version by its hash;GET ?branchA#has(x) list all uses of symbol x in branchA;POST /project?branchA commit all changes to a branch;PUT //repo2?branchC merge a branch of another repo;POST ?stash; GET ?branchA stash the changes;POST ?branchA commit changes (import, move the branch);GET ?branchA#has(int,getX) from the branch, list all AST* nodes that
have children int and getX (likely declaration and definition
of int getX();GET //repo/project/dir#has(int,getX) same but fancier;PUT http://remote.repo.team.entity.org big time pull;In fact, most everyday commands would break down into several
GET, POST, PUT, DELETE calls as, for example, refreshing
the work tree also requres temporary stash of worktree changes
and their merge back into the refreshed version. Similarly, push
to a remote repo is first a POST to a local copy and then PUT
to a remote server.
While it is handy that the plumbing layer of CLI is virtually identical to the HTTP interface, for user convenience we need the “porcelain” layer doing all the everyday combos in one go.
The mission of the porcelain command layer is to let the user rely on the power of the technology while keeping him/her safe and sane.
Both plumbing and porcelain layers turn to be quite compact so far and most of nuance is coded into URIs while CLI verbs only define the general maneuver. One tradeoff here is that the user must have some intuition of URI syntax. LLMs certainly have it, so no worry if you don’t.
Code is hypertext, IDE is a browser. Beagle is your curl/wget, a simple reliable everyday tool.
Same as plumbing, porcelain commands implement three maneuvers:
get data from repo to worktree,post data from worktree to repo,put moves data laterally in a repo.There are some shortcuts for combos, but most of work is get, post, put. The most straightforward linear workflow looks like:
be get //repo/project clone/checkout a worktreebe come ?branch fork off a branch (combo of be post +
be get ?branch)be post commit/stage all branch changesbe put merge in the head (or
be get ?head ?branch ... be post,
a subtly more delicate way to
achieve same result)be post ?head merge into the headMixing branches is done by the same get verb but with multiple
arguments. Use worktree as a palette where you mix and blend
colors. Once satisfied, lay the paint on the canvas (post it back
to the repo).
be get ?branchA ?branchB ?headbe post ?branchABHCRDT merge never fails, technically. That does not guarantee that your worktree would build or run correctly. Semantics is entirely your(s LLM’s) responsibility. Beagle allows to merge/ undo/ juggle changes quickly. That is the best thing SCM can do.
There are aliases/combos for typical cases, e.g.
be come ?branch make the worktree version into a branchbe diff diff to the head (default, 3way)be lay make a waypoint commitbe mark "Comment" "Story..." make a “classic” verbose commitbe moan rollback one postbe rate mark the current commitbe fit merge into the head (be post ?head)be overview of the current state (more than status)Some shells treat ? as a special symbol, we may skip it most of
the time. There is risk of URI ?query being confused for a file
name and other things, so in this doc ? is never skipped. Still,
be get ?featureA ?tweakB should be OK (most of the time).
Beagle is balanced differently than git. There is one Beagle
repo per system, Beagle repos are between git branches and
git repos, while Beagle branches are lighter than git branches (may
see them as patch stacks). Approximate command equivalents:
git init dir/ be post dir/git stash push there is no difference between stash
and any other commit, so be post ?mystash
is enoughgit add a.txt b.txt same, be post a.txt b.txtgit clone http://uri be get http://repo.team.entity.orggit push origin a:b be post http://repo repo names are FQDNsgit pull origin b:a be get http://repo ? && be post where ?
is the expression for the current worktree’s
repo/branch formulagit merge xxx be get ?branchA ?branchBgit status beBeagle (will) implement combos for key git commands.
We started with a claim that Beagle is a database, not a filesystem. It stores a basic AST tree of the source code, which allows for basic code manipulation and search. That is a great opportunity to minimize busywork both for users and LLMs. That is especially valuable when digging code written by somebody else (which is the case in the overwhelming % of cases as individual contributors rely on LLM more and more). Here are some examples of less trivial Beagle commands.
be get /project /project.ai blend project and its prompt overlay
(technically, a separate project)be get ?head ?branch blend head and branch (no repo changes)be put ?branch#DoThing cherry pick a symbol from a branch
(will extract a patch based on the AST* tree)be get ?branch#DoThing same, but no commit, worktree onlybe put ./file.txt?branch cherry pick a file from a branchbe get ./file.txt?branch get a file from a branch (no commit)be get ./file.txt?branch#Some cherry pick a symbol in a filebe put ?featureA&featureB merge in two branchesbe post ?newbranch forkbe post //newrepo big time forkbe diff ?head#SomeClass find any changes to SomeClass since head
(prints out patches)be diff ./file.txt?v1.2 find all changes to file.txt since v1.2be diff #has(DoThing,int) diff int DoThing() specificallybe get ?#todo(asan) find things to sanitize, any branchIn fact, the semantic load on the verbs of be CLI is to give
the direction data moves in. We may also use a convention with
no verbs at all: be uri_dest uri_src1 uri_src2...
That way, be - ?branchA //repoB is a merge into a working
tree, while be //release ?head ?tweaks is a merge into the
release repo head bypassing the working tree (reckless).
Overall, verbless use allows non-standard/advanced use patterns.
Beagle’s commit model is supposed to resolve git’s common pain
points and ossified workarounds. git’s ideal commit model sees
a commit as something eternal; an unbroken Merkle chain of
commits goes back to version 0, blockchain-like. But real life
is messy, so we have a number of workarounds to avoid enshrining
everyday hacks in project’s “blockchain”. Those are rebase,
squash or (my favorite, -m fix) learning to live with disorderly
histories. History rewriting in git is an advanced and extensive
topic, a yardstick of developer’s expertise. As it often happens,
workarounds may need workarounds of their own, and so on.
Beagle unifies in-repo “buckets”: staging or stashing is done by
the same kind of a data container, no different from commit,
branch, or tag, except for the labelling. Beagle makes unnamed
inter-commit states (waypoints) shareable, and all commits in
general aggregatable, so rebasing and squashing become part of
the vanilla model, not an override. CRDTs ease that a lot.
The idea of Beagle commits is to be “undo-redo, but persistent”.
If Ctrl+S triggers a commit, there is nothing wrong about it.
On the technical side, Beagle’s branches are very much like git’s
tags, just labels for hashes pointing at system states. Here,
things do not differ from git that much. The head branch is the
main public version. get ?head or get ?feature
switches the worktree to a different branch.
Beagle’s Merkle structure is aligned with its LSM structure. A project’s state is technically a stack of SST files in a repo. Each (newer, smaller) file references the hash of the previous (older, larger) file. When pulling changes from other replica, we can verify that this Tower of Hanoi is mostly unchanged, except for some smaller files on the top that are easy to inspect. A full chain-of-commits history is inspectable, in theory, if all the historical commit files are preserved somewhere (likely S3).
Waypoint commits are write transactions with no tag attached. One can address them by time or hash, but their replication to other replicas is not guaranteed. Those are of local interest and might be compacted into larger files and garbage collected. That is the standard LSM way of things. Important commits are marked with “sticky notes” (branches and tags). Those are preserved.
Finally, a CHANGELOG document lists all the regular commits
and their attributes: times, dates, comments, authors, hashes,
signatures. These are produced by be mark (changelog insert +
be post combo). The mission of a changelog is to explain the
rationale behind the changes, as the changes themselves can be
cheaply calculated. Then, changelog changes would serve as
commit descriptions.
Beagle’s model is not exactly real-time, but more like “continuous”, especially if compared to git’s commit chains. Want to send the current uncommited version to CI? Go ahead.
One performance bonus of this architecture is that old deleted data is not tied in the repo forever. It gradually fades away, unless intentionally preserved. No replica is required to maintain full history.
All versions are recoverable (if files can still be found), so
be get ?26219b4L5j would recover worktree to a historical
version by a timestamp (base64 coded in this case) or a
hash(let), e.g. be get ?4d2130 (hex sha256 prefix) or be get
?4d213077e2bdd7d83e101a82ed070934cd8e2af6d8ded3dc64905736f8a820cb.
The system will do its best to interpret informal inputs, like
be get ?head,15:50 or be get ?head,"skiplist" based on the
CHANGELOG document and the existing waypoint commits.
Overall, Beagle reworks the model for the LLM age commit volume and frequency, mainly by borrowing tricks from the most scalable databases.
What Beagle internally processes is not exactly AST but BASON, a
binary JSON format (a budget variant of RDX). Beagle employs
codecs to import and export files into/from BASON. Hence, most
queries have to rely on generic document tree structure.
The exact codec machinery may vary, e.g. a *.c file may be
im/exported with: general text codec, tree-sitter based codec,
clang AST based codec or “git mode” fallback. Changing the codec
resets file’s history. Apart from the tree structure per se,
codecs may tag nodes (the bit budget is rather tight there).
That way, queries may distinguish function from a class,
invocation from declaration, and so on.
Based on that rather generic information, we can have 80/20 of
your typical code navigation: callers/callees, definitions,
todos, and so on. That is way more accurate than grep (how
do you grep for a function body?). Still, this may fall short
of full IDE capabilities. For an inquiring agent, that might be
just right though.
be grep "search" trigram-accelerated substring search:
the trigram index narrows candidates to ~1/4096 of files
before reading any content, making search nearly instant
even for large reposmdp(worktree) grep for markdown paragraphs (not lines)has(int,getLen) find nodes having children int and
getLen (e.g. a typical C function definition)fn(int,getLen) find specifically tagged function definitionsuse(getLen) find uses of a symbolfuncs(use(getLen)) find functions using a symbolfiles(use(getLen)) find files using a symboltodo(fuzz) search for TODOs mentioning fuzzingQuery notation is RDX, although that hardly matters as it is generic enough. Each query produces a set of document elements (AST* nodes) that a command can be scoped to (diff, get, post, etc). So, for example, we can change signature of a function and commit specifically those hunks by a one-liner.
If your next question is how to make this work efficiently, wait for Part III.
Acknowledgements. A.Borzilov, N. Prokopov (aka tonsky), J.Syrowiecki contributed feedback and ideas for this draft.
Part I. SCM as a database for the code
Part III. Inner workings of CRDT revision control.
Part IV. Experiments.
Part V. The Vision.