Part I. SCM as a database for the code
The outer interface of a revision control system is often complex. In particular, git’s CLI is easy to pick on because of the way it grew unconstrained into a jungle of commands, options and syntaxes with overlapping concerns. Git UI ceremony distracts and limits velocity to a very noticeable degree.
Fundamentally, SCM moves changes between worktrees and the
repo, but git’s multilevel system makes things look rather
complex. With k types of buckets, we have k*k kinds of
bucket-to-bucket moves. With remote branches different from
local branches, staging and stash different from commits, plus
the worktree, we get 6x6=36 kinds of potential data maneuvers.
Ideally, this should be 2x2=4, worktree and repo. Is that
realistic? In short, yes.
We can also look at this from the other side: what useful functions do we get at the cost of that complexity? It is way above trivial as git codebase is 310KLoC of C code and about the same amount of sh/perl/tcl. That is x15 more than LevelDB, /3 less than PostgreSQL, and generally in the ballpark of a general-purpose database.
Still, we can not query the branches and the trees in any ways more advanced than grep. Author’s personal experience is that with no issue tracker, branches get stale and forgotten even in solo development mode, sadly. LLMs add to that, as now we have no solo mode, and LLMs just love to reimplement things, each time very imperfectly. The awareness is lacking.
Finally, the ability to split and join content is critical in managing the mass of code. Apart from submodule/ monorepo aspect, there is the method of overlays where we split a worktree into distinct layers (e.g. code, prompts and configs) able to work with them jointly or separately, depending on circumstances. That is like Photoshop layers. This idea circulated in CRDT community for quite some years.
Overall, things better be more structured, but less complicated. As AIs are piling up the code, we have to keep track of it and maintain the structure.
git is a filesystem, it says so on the box and it stores
blobs. Beagle is a database for your code, it stores AST*.
That allows to address not only specific files for diffing/
querying/ merging/ cherry-picking, but also specific symbols
and AST* subtrees. That allows for complex querying of
versioned sources and text. Beagle is useful for users and
LLMs alike when one has to juggle a dozen branches at a time.
The next section talks about Beagle’s project/ branch/ twig/ overlay model which is slightly different from git’s: branches are closer to git repos, twigs are like git branches, but lighter and overlays have no parallel in git at all. CRDT merges are deterministic and non-intrusive, so one can merge left and right, using worktree as a palette for blending.
The section after that talks about Beagle’s core/plumbing
commands: GET, POST, PUT and DELETE. Yes, like HTTP.
Skip next two sections if you want to see the resulting UX first. Long story short: mainly the same four commands plus URI-based syntax for everything.
How to make a command/ referral language flexible enough to express all the use cases by composing a minimal number of plain intuitive primitives? This problem is essentially a language problem.
In respect to addressing, Beagle bets on URIs. What worked for a World Wide Web in all its vastness, should also work for intra/inter repo referencing.
Encouraged by that idea, Beagle sets the scope of the system to global. One key feature of git was to only version an entire project as a whole. Lets think: what can we do to version an entire working system, all sources and configs, so each repo is a small GitHub hosting a number of projects?
If we want to limit ourselves to 4 basic kinds of maneuvers, those are:
We assume the current worktree is linked to one fixed place in the repo. Things look a bit too primitive so far. Then, we chalk the repo into squares:
branch.team.company.com or
release.product.entity.org;@gritzko/librdx (like a GitHub path).
So a full URI is like http://main.replicated.live/@gritzko/librdxHere the maneuver #4 gets subdivided into submaneuvers, the most frequent case being changeset exchange between branches. Note that branches are not scoped to a repo or even to a project. When we create a branch, we “fork the world”. That mostly makes sense because projects form their own dependency graphs anyway, so version alpha of project A needs version beta of project B and so on. Once we create a branch, we may put in all the relevant code. With syntax-aware CRDT merge, we can be a bit bolder in forking things, as we retain enough metadata to ease merges.
On top of that, the mapping between file system paths and projects is
not 1:1. First of all, one project can have several worktrees, that is
normal. Second, one worktree can contain several blended branches or
projects. Merging the branches is nothing special, let’s talk about
the other case. Suppose we want to split one project into the base and
its overlays. For example, prompts, plans and TODOs live in the same
dirs in the worktree, but belong to a different overlay project in the
repo, @gritzko/librdx vs @gritzko/librdx.ai. We can work with the
source, we can add the AI work docs, or we can deal with prompts and
logs separately from sources.
The last caveat for those familiar with git (all of us) is twigs.
Apart from the head, a branch can have multiple marked twigs, which
are supposed to merge in near future. The distinction here is that twigs
are scoped to a project/branch/repo, and have no public identity.
When each developer teams up with AI, cheaper transient branching is
necessary, locally and within a team. So public branches are heavier
than git branches and twigs are somewhat lighter. While a twig is
essentially a sticky note on a hash, CRDT merges are deterministic and
non-intrusive, so merging (blending) twigs invokes much less work and
ceremony than merging git branches.
GET POST PUT DELETEBack to the original question, lets see whether an URI based referencing
language and 4 HTTP verbs are sufficient to express the operations we want.
GET, POST, PUT and DELETE correspond to maneuvers #2, #1, #4, #4 resp.
Maneuver #3 is cp, rm, vim, etc.
GET http://branch.team.entity.org/project?twigA simple checkout
of a particular twig version (may need to clone first);GET //branch2 switching the branch;GET /project/dir/file.txt checkout one file;POST ./file.txt stage one file (it gets imported into the repo,
but the twig does not move yet);DELETE somefile.txt delete;PUT ./file.txt?twigB merge in file changes from other twig;GET ?twigB switch the twig;GET ?timestamp-origin checkout a version by its timestamp;GET ?4d2130 checkout a version by its hash;GET ?twigA#has(x) list all uses of symbol x in twigA;POST /project?twigA commit all changes to a twig;PUT //branch2?twigC merge a twig of another branch;POST ?stash; GET ?twigA stash the changes;POST ?twigA commit changes (import, move the twig);GET ?twigA#has(int,getX) from the twig, list all AST* nodes that
have children int and getX (likely declaration and definition
of int getX();GET //branch/project/dir#has(int,getX) same but fancier;PUT http://remote.branch.team.entity.org big time pull;In fact, most everyday commands would break down into several
GET, POST, PUT, DELETE calls as, for example, refreshing
the work tree also requres temporary stash of worktree changes
and their merge back into the refreshed version. Similarly, push
to a remote branch is first a POST to a local copy and then
PUT to a remote server.
While it is handy that the plumbing layer of CLI is virtually identical to the HTTP interface, for user convenience we need the “porcelain” layer doing all the everyday combos in one go.
The mission of the porcelain command layer is to let the user rely on the power of the technology while keeping him/her safe and sane.
Both plumbing and porcelain layers turn to be quite compact so far and most of nuance is coded into URIs while CLI verbs only define the general maneuver. One tradeoff here is that the user must have some intuition of URI syntax. LLMs certainly have it, so no worry if you don’t.
Code is hypertext, IDE is a browser. Beagle is your curl/wget, a simple reliable everyday tool.
Same as plumbing, porcelain commands implement three maneuvers:
get data from repo to worktree,post data from worktree to repo,put moves data laterally in a repo.There are some shortcuts for combos, but most of work is get, post, put. The most straightforward linear workflow looks like:
be get //branch/project clone/checkout a worktreebe come ?twig fork off a twig (combo of be post ?twig +
be get ?twig)be post commit/stage all twig changesbe put merge in the branch head (or
be get ?head ?twig ... be post,
a subtly more delicate way to
achieve same result)be post ?head merge into the headMixing branches or twigs is done by the same get verb but with
multiple arguments. Use worktree as a palette where you mix and
blend colors. Once satisfied, lay the paint on the canvas (post
it back to the repo).
be get ?twigA ?twigB ?headbe post ?twigABHCRDT merge never fails, technically. That does not guarantee that your worktree would build or run correctly. Semantics is entirely your(s LLM’s) responsibility. Beagle allows to merge/ undo/ juggle changes quickly. That is the best thing SCM can do.
There are aliases/combos for typical cases, e.g.
be come ?twig make the worktree version into a twigbe diff diff to the head (default, 3way)be lay make a waypoint commitbe mark "Comment" "Story..." make a “classic” verbose commitbe moan rollback one postbe rate mark the current commitbe fit merge into the head (be post ?head)be overview of the current state (more than status)Some shells treat ? as a special symbol, we may skip it most of
the time. There is risk of URI ?query being confused for a file
name and other things, so in this doc ? is never skipped. Still,
be get featureA tweakB should be OK (most of the time).
Beagle is balanced differently than git. There is one Beagle
repo per system, Beagle branches are between git branches and
git repos, while Beagle twigs are lighter than git branches (may
see them as patch stacks). Approximate command equivalents:
git init dir/ be post dir/git stash push there is no difference between stash
and any other commit, so be post ?mystash
is enoughgit add a.txt b.txt same, be post a.txt b.txtgit clone http://uri be get http://branch.team.entity.orggit push origin a:b be post http://branch branch names are FQDNsgit pull origin b:a be get http://branch ? && be post where ?
is the expression for the current worktree’s
branch/twig formulagit merge xxx be get ?twigA ?twigBgit status beBeagle (will) implement combos for key git commands.
We started with a claim that Beagle is a database, not a filesystem. It stores a basic AST tree of the source code, which allows for basic code manipulation and search. That is a great opportunity to minimize busywork both for users and LLMs. That is especially valuable when digging code written by somebody else (which is the case in the overwhelming % of cases as individual contributors rely on LLM more and more). Here are some examples of less trivial Beagle commands.
be get /project /project.ai blend project and its prompt overlay
(technically, a separate project)be get ?head ?twig blend head and twig (no repo changes)be put ?twig#DoThing cherry pick a symbol from a twig
(will extract a patch based on the AST* tree)be get ?twig#DoThing same, but no commit, worktree onlybe put ./file.txt?twig cherry pick a file from a twigbe get ./file.txt?twig get a file from a twig (no commit)be get ./file.txt?twig#Some cherry pick a symbol in a filebe put ?featureA&featureB merge in two twigsbe post ?newtwig forkbe post //newbranch big time forkbe diff ?head#SomeClass find any changes to SomeClass since head
(prints out patches)be diff ./file.txt?v1.2 find all changes to file.txt since v1.2be diff #has(DoThing,int) diff int DoThing() specificallybe get ?#todo(asan) find things to sanitize, any twigIn fact, the semantic load on the verbs of be CLI is to give
the direction data moves in. We may also use a convention with
no verbs at all: be uri_dest uri_src1 uri_src2...
That way, be - ?twigA //branchB is a merge into a working
tree, while be //release ?head ?tweaks is a merge into the
release branch head bypassing the working tree (reckless).
Overall, verbless use allows non-standard/advanced use patterns.
Beagle’s commit model is supposed to resolve git’s common pain
points and ossified workarounds. git’s ideal commit model sees
a commit as something eternal; an unbroken Merkle chain of
commits goes back to version 0, blockchain-like. But real life
is messy, so we have a number of workarounds to avoid enshrining
everyday hacks in project’s “blockchain”. Those are rebase,
squash or (my favorite, -m fix) learning to live with disorderly
histories. History rewriting in git is an advanced and extensive
topic, a yardstick of developer’s expertise. As it often happens,
workarounds may need workarounds of their own, and so on.
Beagle unifies in-repo “buckets”: staging or stashing is done by the same kind of an RDX container, no different from commit, branch, or tag, except for the labelling. Beagle makes unnamed inter-commit states (waypoints) shareable, and all commits in general aggregatable, so rebasing and squashing become part of the vanilla model, not an override. CRDTs ease that a lot. The idea of Beagle commits is to be “undo-redo, but persistent”. If Ctrl+S triggers a commit, there is nothing wrong about it.
On the technical side, Beagle’s twigs are very much like git’s
tags, just labels for hashes pointing at system states. Here,
things do not differ from git that much. The head twig is the
public version of the branch. get ?head or get ?feature
switches the worktree to a different twig.
Beagle’s Merkle structure is aligned with its LSM structure. A project’s state is technically a stack of RDX SST files in a repo. Each (newer, smaller) file references the hash of the previous (older, larger) file. When pulling changes from other replica, we can verify that this Tower of Hanoi is mostly unchanged, except for some smaller files on the top that are easy to inspect. A full chain-of-commits history is inspectable, in theory, if all the historical commit files are preserved somewhere (likely S3).
Waypoint commits are RDX SST files with no tag attached. One can address them by time or hash, but their replication to other replicas is not guaranteed. Those are of local interest and might be compacted into larger files and garbage collected. That is the standard LSM way of things. Important commits are marked with “sticky notes” (twigs and tags). Those are preserved.
Finally, a CHANGELOG RDX document lists all the regular commits
and their attributes: times, dates, comments, authors, hashes,
signatures. These are produced by be mark (changelog insert +
be post combo). The mission of a changelog is to explain the
rationale behind the changes, as the changes themselves can be
cheaply calculated. Then, changelog changes would serve as
commit descriptions.
Beagle’s model is not exactly real-time, but more like “continuous”, especially if compared to git’s commit chains. Want to send the current uncommited version to CI? Go ahead.
One performance bonus of this architecture is that old deleted data is not tied in the repo forever. It gradually fades away, unless intentionally preserved. No replica is required to maintain full history.
All versions are (obviously) recoverable, so be get ?26219b4L5j
would recover worktree to a historical version by a timestamp
(base64 coded in this case) or a hash(let) be get ?4d2130,
?4d213077e2bdd7d83e101a82ed070934cd8e2af6d8ded3dc64905736f8a820cb.
The system will do its best to interpret informal inputs, like
be get ?head,15:50 or be get ?head,"skiplist" based on the
CHANGELOG document and the existing waypoint commits.
What Beagle internally processes is not exactly AST but RDX, a
CRDT JSON superset, tree-ish document format. Beagle employs
codecs to import and export files into/from RDX. Hence, most
queries have to rely on generic document tree structure.
The exact codec machinery may vary, e.g. a *.c file may be
im/exported with: general text codec, tree-sitter based codec,
clang AST based codec or “git mode” fallback. Changing the codec
resets file’s history. Apart from the tree structure per se,
codecs may tag nodes (the bit budget is rather tight there).
That way, queries may distinguish function from a class,
invocation from declaration, and so on.
Based on that rather generic information, we can have 80/20 of
your typical code navigation: callers/callees, definitions,
todos, and so on. That is way more accurate than grep (how
do you grep for a function body?). Still, this may fall short
of full IDE capabilities. For an inquiring agent, that might be
just right though.
mdp(worktree) grep for markdown paragraphs (not lines)grep("search") grep-like generic searchhas(int,getLen) find nodes having children int and
getLen (e.g. a typical C function definition)fn(int,getLen) find specifically tagged function definitionsuse(getLen) find uses of a symbolfuncs(use(getLen)) find functions using a symbolfiles(use(getLen)) find files using a symboltodo(fuzz) search for TODOs mentioning fuzzingQuery notation is RDX, although that hardly matters as it is generic enough. Each query produces a set of document elements (AST* nodes) that a command can be scoped to (diff, get, post, etc). So, for example, we can change signature of a function and commit specifically those hunks by a one-liner.
If your next question is how to make this work efficiently, wait for Part III.
Acknowledgements. A.Borzilov, N. Prokopov (aka tonsky), J.Syrowiecki contributed feedback and ideas for this draft.
Part I. SCM as a database for the code
Part III. Inner workings of CRDT revision control.
Part IV. Experiments.
Part V. The Vision.