Eternal project - LLVM Backend Documentation
Posted on Вт 23 июня 2026 in misc
Eternal Software Initiative LLVM Backend Documentation
This document provides a complete, self-contained reference for the ESI LLVM backend-a custom LLVM target that compiles C (and C++) code to the ESI architecture, a One Instruction Set Computer based on the Subleq architecture where the only operation is subtract and branch if less than or equal to zero.
For the VM, memory layout, boot sequence, and inline-assembly conventions, see docs/machine_architecture.md. For the runtime library generators, see runtime/gen_runtime.py.
Table of Contents
- High-Level Architecture
- Directory Layout
- Register Map
- Calling Convention
- Type Legalization
- Instruction Selection (ISel)
- MIR Instruction Set
- The AsmPrinter: From MIR to Subleq Triples
- The Two-Pass (DryRun) Architecture
- Constant Pool
- Peephole Optimizations
- Branch and Comparison Lowering
- Sub-word Memory Access (CHAR_BIT = 8)
- Bitwise and Shift Operations
- Multiplication and Division
- Function Calls and Returns
- Frame Lowering
- Register Allocation Hints
- ELF Object Support and Relocations
- Target Transform Info (Cost Model)
- Exception Handling (SJLJ)
- Soft-Float and Libcall Configuration
- Atomics and Concurrency
- Inline Assembly Support
- Assembly Parser
- Compilation Pipeline Summary
1. High-Level Architecture
The Subleq backend follows the standard LLVM target architecture, but with a critical twist: because the underlying machine has no native instructions other than subleq(A, B, C), every pseudo-instruction defined in the backend's TableGen files is expanded to sequences of subleq triples during the AsmPrinter phase-not during instruction selection or MC lowering.
C source
│ clang frontend
▼
LLVM IR
│ SubleqTargetLowering (ISel lowering)
▼
SelectionDAG
│ SubleqDAGToDAGISel (pattern matching)
▼
Machine IR (MIR) - pseudo-instructions like ADD, LW, CALL, BR_SLT
│ Register allocation, frame lowering, branch analysis
▼
Machine IR (post-RA)
│ SubleqAsmPrinter (two-pass: DryRun + emit)
▼
Raw subleq triples (.word A, B, C) → ELF object file
The SubleqAsmPrinter is by far the largest file (~7,700 lines, 153 functions) and is the heart of codegen. It contains an expand* function for every MIR opcode, translating each to a hand-optimized sequence of subleq triples.
2. Directory Layout
llvm/lib/Target/Subleq/
├── Subleq.td # Top-level TableGen: includes, processor, target
├── SubleqRegisterInfo.td # Register definitions (GPR, AllRegs)
├── SubleqCallingConv.td # CC_Subleq, RetCC_Subleq, CSR_Subleq
├── SubleqInstrInfo.td # All MIR instruction definitions (~910 lines)
├── Subleq.h # SubleqISD namespace (custom ISD nodes)
├── SubleqAsmPrinter.cpp # MIR → subleq triples (7,714 lines)
├── SubleqISelLowering.cpp/.h # DAG lowering (BR_CC, SELECT_CC, calls, etc.)
├── SubleqISelDAGToDAG.cpp/.h # Pattern selection (Select, AddrFI matching)
├── SubleqFrameLowering.cpp/.h # Prologue/epilogue, CSR spill/restore
├── SubleqRegisterInfo.cpp/.h # Reserved regs, frame index elimination, hints
├── SubleqInstrInfo.cpp/.h # copyPhysReg, analyzeBranch, insertBranch
├── SubleqTargetMachine.cpp/.h # TargetMachine, pass pipeline
├── SubleqSubtarget.cpp/.h # Subtarget (generic CPU, no features)
├── SubleqTargetTransformInfo.cpp/.h # Instruction cost model
├── SubleqMachineFunctionInfo.h # Per-function state (VarArgsFrameIndex)
├── SubleqSelectionDAGInfo.cpp/.h # (minimal)
├── MCTargetDesc/
│ ├── SubleqMCAsmInfo.cpp/.h # Assembly syntax, ELF config, SJLJ
│ ├── SubleqMCTargetDesc.cpp/.h # MC component registration
│ ├── SubleqMCCodeEmitter.cpp # Instruction encoding (word-oriented)
│ ├── SubleqAsmBackend.cpp # Fixup application, NOP data, reloc eval
│ ├── SubleqELFObjectWriter.cpp # ELF relocation types (R_386_32, R_SUBLEQ_NEG32)
│ ├── SubleqELFStreamer.cpp/.h # Streamer customization
│ ├── SubleqFixupKinds.h # fixup_subleq_neg32
│ └── SubleqInstPrinter.cpp/.h # Textual instruction printing
├── AsmParser/
│ └── SubleqAsmParser.cpp # Assembly parsing (.word directives, labels)
└── TargetInfo/
└── SubleqTargetInfo.cpp/.h # Target registration (getTheSubleqTarget)
3. Register Map
Subleq has no hardware registers. All "registers" are fixed memory locations in low memory (words 0–67, bytes 0x00–0x10C). The backend defines them in SubleqRegisterInfo.td:
| Register | Word | Byte Addr | Role | Saved By |
|---|---|---|---|---|
| Z | 3 | 0x0C | Scratch zero (cleared before use) | - (reserved) |
| SP | 4 | 0x10 | Stack pointer | - (reserved) |
| RA | 5 | 0x14 | Return address scratch | - (reserved) |
| R3–R19 | 7–23 | 0x1C–0x5C | General purpose | Callee |
| R20 | 24 | 0x60 | Return value | Caller |
| R21–R24 | 25–28 | 0x64–0x70 | Function arguments 1–4 | Caller |
| R25–R31 | 29–35 | 0x74–0x8C | General purpose | Callee |
| ZERO | 36 | 0x90 | Constant 0 | - (constant) |
| FP | 37 | 0x94 | Frame pointer | Callee |
| MINUS_ONE | 38 | 0x98 | Constant −1 | - (constant) |
| ONE | 39 | 0x9C | Constant 1 | - (constant) |
| T0–T15 | 40–55 | 0xA0–0xDC | Compiler temporaries | - (reserved) |
Register classes:
GPR- All allocatable registers: FP, RA, SP (for type compat, marked reserved), R3–R31. Onlyi32type.AllRegs- GPR + ZERO, Z, T0–T15. Used for instructions that reference non-allocatable locations.
Reserved registers (cannot be allocated): Z, SP, RA, FP (when frame pointer is needed), ZERO, T0–T15. The getReservedRegs() function in SubleqRegisterInfo.cpp marks these.
4. Calling Convention
Defined in SubleqCallingConv.td:
Arguments
| Convention | Behavior |
|---|---|
CC_Subleq |
i1/i8/i16 promoted to i32. First 4 i32 args in R21–R24; rest on stack (4-byte aligned). |
CC_Subleq_VarArg |
All arguments on stack (enables va_arg iteration). |
Return Values
| Type | Location |
|---|---|
| i32 (or smaller) | R20 |
| i64 / struct | Hidden sret pointer in R21 |
Register Categories
- Caller-saved (clobbered by calls): R20–R24, T0–T15
- Callee-saved (preserved): R3–R19, R25–R31
- CSR_NoRegs: Used for SJLJ dispatch-forces spilling of all live registers before
invokecalls.
Stack Frame
The stack grows downward. Caller cleans up stack arguments. The frame layout:
High addresses
+---------------------------+
| Argument N (if > 4 args) |
| ... |
| Argument 5 |
+---------------------------+ ← SP on entry
| Return address (RA) |
| Saved FP (if used) |
| Saved callee-saved regs |
| Local variables |
| RuntimeCallStackPadding | ← 16 bytes reserved for nested runtime calls
+---------------------------+ ← SP during function
Low addresses
5. Type Legalization
Configured in the SubleqTargetLowering constructor:
- Legal type:
i32only. LLVM automatically expandsi64to pairs ofi32. i8/i16: All arithmetic is promoted toi32. Loads and stores ofi8/i16are Legal (handled by patterns that emitLB/SB/LH/SHpseudo-instructions).i64operations: Shifts expand to libcalls (__ashldi3,__lshrdi3,__ashrdi3). Division/remainder uses__divdi3,__moddi3,__udivdi3,__umoddi3.- Floating-point: Not legal. All
f32/f64operations are lowered to soft-float libcalls (compiler-rt compatible:__addsf3,__muldf3, etc.) implemented insubleq_runtime_softfloat.c. - Atomics: 32-bit atomics are "supported" via
AtomicExpandPass-it strips the atomic ordering and emits plain loads/stores (safe on single-core Subleq). Fences are no-ops.
6. Instruction Selection (ISel)
DAG Lowering (SubleqISelLowering.cpp)
LowerOperation() dispatches to custom handlers for:
| ISD Node | Handler | Strategy |
|---|---|---|
GlobalAddress |
LowerGlobalAddress |
Wrap in SubleqISD::WRAPPER |
BlockAddress |
LowerBlockAddress |
Wrap in SubleqISD::WRAPPER |
ConstantPool |
LowerConstantPool |
Wrap in SubleqISD::WRAPPER |
JumpTable |
LowerJumpTable |
Wrap in SubleqISD::WRAPPER |
BR_CC |
LowerBR_CC |
Rich comparison lowering (see §12) |
SETCC |
LowerSETCC |
Route through SELECT_CC with 0/1 |
SELECT_CC |
LowerSELECT_CC |
12 SELECT variants (see §12) |
SDIV/SREM/etc. |
LowerSDIVREM |
Combined SDIVREM/UDIVREM pseudo |
VASTART |
LowerVASTART |
Store varargs frame pointer |
RETURNADDR/FRAMEADDR |
Custom | Walk frame chain |
EH_SJLJ_* |
Custom | SJLJ exception handling |
INTRINSIC_WO_CHAIN |
Custom | eh_sjlj_lsda intrinsic |
ATOMIC_FENCE |
Custom | No-op (return chain) |
Multiply-by-constant strength reduction is enabled via decomposeMulByConstant(), which returns true so LLVM transforms x * 3 → (x << 1) + x, x * 5 → (x << 2) + x, etc.
Power-of-2 division/remainder: BuildSDIVPow2() and BuildSREMPow2() prevent LLVM's default strength reduction (which would use ADD+SRA+SRL sequences) and instead let the operations go through the normal SDIV/SREM path, which the backend handles more efficiently.
DAG-to-DAG Selection (SubleqISelDAGToDAG.cpp)
The Select() method handles:
ISD::Constant→Subleq::LI(load immediate)ISD::LOAD→Subleq::LWorSubleq::LWO(with offset folding for frame indices)ISD::STORE→Subleq::SWorSubleq::SWO(with offset folding)SubleqISD::CALL→Subleq::CALLorSubleq::CALLRSubleqISD::WRAPPER→Subleq::LISym(load symbol address)
ComplexPattern matchers:
SelectAddrFI- Matches bareFrameIndexnodes forLEA_FI.SelectAddrFI_Byte- Matches(add frameindex, const)or(or frameindex, const)for byte-offset sub-word operations (LBO/SBO/LHO/SHO). The OR variant is needed because LLVM's DAGCombiner converts(add X, C)to(or X, C)when alignment guarantees no bit overlap.
7. MIR Instruction Set
All instructions are defined in SubleqInstrInfo.td (~910 lines). None have real encodings-they are pseudo-instructions expanded in the AsmPrinter. Key categories:
Data Movement
| Instruction | Operands | Description |
|---|---|---|
MOV |
dst, src |
Register-to-register move |
LI |
dst, imm |
Load 32-bit immediate |
LISym |
dst, sym |
Load symbol address |
LW |
dst, addr |
Load word from memory |
LWO |
dst, base, offset |
Load word with offset |
SW |
src, addr |
Store word to memory |
SWO |
src, base, offset |
Store word with offset |
CLR |
dst |
Clear register to zero |
Sub-word Memory (with Defs = [R20, R21, R22])
| Instruction | Description |
|---|---|
LB / LBs |
Load byte (zero-ext / sign-ext) |
SB |
Store byte (read-modify-write) |
LBO / LBOs / SBO |
Byte ops with frame-relative offset |
LH / LHs / SH |
Halfword ops |
LHO / LHOs / SHO |
Halfword ops with offset |
Arithmetic
| Instruction | Pattern | Description |
|---|---|---|
ADD |
(add r, r) |
dst = src1 + src2 |
ADDI |
(add r, imm) |
dst = src + imm |
SUB |
(sub r, r) |
dst = src1 − src2 |
SUBI |
(sub r, imm) |
dst = src − imm |
NEG |
(sub 0, r) |
dst = −src |
LEA_FI |
frameindex |
Materialize frame address |
Bitwise (with Defs = [R20, R21, R22] - runtime calls)
| Instruction | Pattern | Description |
|---|---|---|
AND / ANDI |
(and r, r/imm) |
Bitwise AND |
OR / ORI |
(or r, r/imm) |
Bitwise OR |
XOR / XORI |
(xor r, r/imm) |
Bitwise XOR |
NOT |
(not r) |
Bitwise complement (inline, no runtime) |
Shifts (with Defs = [R20, R21, R22] - runtime calls)
| Instruction | Pattern |
|---|---|
SHL / SHLI |
(shl r, r/imm) |
SRL / SRLI |
(srl r, r/imm) |
SRA / SRAI |
(sra r, r/imm) |
Multiplication and Division (with Defs = [R20, R21, R22])
| Instruction | Description |
|---|---|
MUL |
32-bit multiply |
SDIV / UDIV |
Signed/unsigned division |
SREM / UREM |
Signed/unsigned remainder |
SDIVREM / UDIVREM |
Combined quotient+remainder (single pass) |
Control Flow
| Instruction | Description |
|---|---|
JMP |
Unconditional jump |
JMPR |
Indirect jump (computed goto) |
BLEZ |
Branch if ≤ 0 (native subleq semantics) |
BGTZ |
Branch if > 0 (also matches brcond) |
BLTZ |
Branch if < 0 |
BGEZ |
Branch if ≥ 0 |
BREQ0 |
Branch if == 0 (two-check) |
BRNE0 |
Branch if ≠ 0 |
BR_SLT/SGT/SGE/SLE |
Signed comparisons (two-operand, overflow-safe) |
BR_ULT/UGT/UGE/ULE |
Unsigned comparisons (two-operand, MSB-aware) |
BEQ / BNE |
Two-operand equal/not-equal |
Conditional Selects
12 SELECT_* pseudo-instructions implement SELECT_CC:
- Single-operand (condition vs 0):
SELECT_EQ0,SELECT_NE0,SELECT_LE0,SELECT_GT0,SELECT_LT0,SELECT_GE0 - Two-operand unsigned:
SELECT_UGT,SELECT_ULT,SELECT_UGE,SELECT_ULE - Two-operand signed:
SELECT_SLT,SELECT_SGT,SELECT_SLE,SELECT_SGE
Function Calls
| Instruction | Description |
|---|---|
CALL |
Direct call (Defs R20–R24, Uses R21–R24) |
CALLR |
Indirect call |
RET |
Return (Uses RA, R20) |
ADJCALLSTACKDOWN/UP |
Call sequence pseudos for stack adjustment |
Batch Callee-Save/Restore
| Instruction | Description |
|---|---|
CSR_SPILL_START |
Compute T4 = base + offset, save first reg |
CSR_SPILL_NEXT |
Decrement T4 by 4, save next reg |
CSR_RELOAD_START |
Compute T4 = base + offset, load first reg |
CSR_RELOAD_NEXT |
Decrement T4 by 4, load next reg |
These reduce per-register spill cost from ~30 words to ~15 words for subsequent registers by batching the address computation.
I/O and Special
| Instruction | Description |
|---|---|
PUTCHAR |
Output character (VM sentinel −4 in B) |
GETCHAR |
Input character (VM sentinel −4 in A) |
HALT |
Exit program with return code |
NOP |
No-operation (subleq Z, Z, next) |
SJLJ Exception Handling
| Instruction | Description |
|---|---|
EH_SjLj_SetJmp |
Save context, return 0 / non-zero on longjmp |
EH_SjLj_LongJmp |
Restore context and jump back |
EH_SjLj_Setup_Dispatch |
Set up landing pad dispatch |
SJLJ_DISPATCH_MARKER |
No-op marker with register mask |
8. The AsmPrinter: From MIR to Subleq Triples
SubleqAsmPrinter.cpp is the core of the backend. Every MIR instruction dispatches to an expand* function that emits raw .word triples. Key primitives:
Emission Helpers
| Helper | Words | Description |
|---|---|---|
emitSubleq(A, B, C) |
3 | Raw subleq triple |
emitSubleqNext(A, B) |
3 | subleq with fall-through (C = PC + 12) |
emitClear(dst) |
3 | dst = 0 via subleq(dst, dst, next) |
emitMove(dst, src) |
12 | dst = src (clear dst, Z = −src, dst = −Z, clear Z) |
emitNegate(dst, src) |
6–18 | dst = −src |
emitAdd(dst, s1, s2) |
9–15 | dst = s1 + s2 (optimized for dst == s1 or dst == s2) |
emitSub(dst, s1, s2) |
3–21 | dst = s1 − s2 (optimized for dst == s1) |
emitLoadImm(dst, val) |
0–6 | Load constant via constant pool |
emitJumpReloc(target) |
3 | Unconditional jump (relocatable) |
emitJumpToSymbol(sym) |
3 | Jump to MCSymbol |
Key Expansion Sizes
Some representative expansion sizes (in words / 3 = subleq instructions):
| MIR Instruction | Words (typical) |
|---|---|
MOV |
12 |
LI (imm ≠ 0) |
6 (or 3 with delta, or 0 if known) |
ADD (general) |
15 |
ADD (dst == src1) |
9 |
ADDI (in-place) |
3 |
SUB (general) |
21 |
LW (general) |
30 |
LW (direct reg) |
12 |
SW (general) |
30 |
BLEZ |
3 |
BGTZ |
18–21 |
BR_SLT (signed) |
~90 |
BR_ULT (unsigned) |
~90 |
CALL |
~66 |
RET |
9 |
Register Address Translation
getRegAddr(unsigned Reg) maps LLVM register enums to byte addresses. For example, Subleq::R3 → ADDR_R3 (= 7 × 4 = 28). The SubleqMemLayout enum encodes all fixed memory addresses.
9. The Two-Pass (DryRun) Architecture
Because subleq has no conditional branch with an arbitrary target encoding-the branch target is a literal address word-the AsmPrinter must know the exact byte address of every basic block before emitting code. This creates a chicken-and-egg problem: you need block addresses to emit branches, but you need to emit code to know block addresses.
Solution: Two-pass emission.
- DryRun pass (
calculateBlockAddresses): SetsDryRun = true. Iterates over all basic blocks and instructions, calling the sameexpand*functions. These updatePCwithout emitting any bytes. After each block, the resultingPCis recorded as that block's address inLabelAddresses. Peephole state (KnownRegValues,PeepholeR21Valid,PeepholeT0Valid) andReturnLabelIDare saved and restored to ensure bit-exact consistency. - Real pass: Sets
DryRun = false. Iterates again, now emitting real.worddirectives. Branch targets are resolved fromLabelAddresses.
Critical invariant: The DryRun and real passes must produce exactly the same code size for every instruction. If they diverge, branch targets will be wrong, causing silent miscompilation or crashes. A historical bug (Heisenbug in cred.c) was traced to INLINEASM instructions incorrectly invalidating peephole states between passes.
Per-section PCs: When a translation unit has functions in multiple sections (e.g., .text and .sched.text), each section maintains its own PC counter via the SectionPCs map.
10. Constant Pool
Subleq cannot encode immediate values in instructions-everything is a memory address. To load an immediate value V, the backend:
- Stores
−Vat a constant pool location (a labeled word at the end of the section). - Emits
subleq(const_addr, dst, next)which computesdst = 0 − (−V) = V.
Per-Section Constant Pools
To avoid cross-section references (which cause Linux kernel modpost warnings), constants are duplicated per-section. The pools are:
| Pool | Key | Purpose |
|---|---|---|
SectionConstantPool |
(section, int32_t) | Integer constants |
SectionSymbolConstantPool |
(section, string) | Symbol addresses |
SectionSymbolOffsetConstantPool |
(section, name, offset) | Symbol + offset (folds GV+offset into R_SUBLEQ_NEG32) |
SectionMCSymbolConstantPool |
(section, MCSymbol*) | Block address symbols |
CodeAddressConstantPool |
int32_t | Code addresses needing relocation |
The emitImmediateConstantPool() function is called at the end of each section to emit all accumulated constants.
11. Peephole Optimizations
The AsmPrinter implements several peephole optimizations that operate during emission:
P1: Sub-word Address Reuse (PeepholeR21Valid)
When a sub-word store (SHO/SBO) computes a word-aligned byte address into R21, the next sub-word load/store to the same address can skip the address computation. This saves ~44 subleq instructions per reuse. The state tracks:
- PeepholeR21BaseReg - the base register used
- PeepholeR21WordOffset - the word-aligned byte offset
- PeepholeR21HalfPos / PeepholeR21BytePos - which sub-word position
Reset at: basic block boundaries, any instruction that clobbers R21.
P2: Store-Load Forwarding (PeepholeT0Valid)
When expandSW stores a value, it leaves −src in T0 as a side effect. If the next instruction loads from the same address, it can reuse T0 instead of re-reading memory. Tracks PeepholeT0NegatedSrc.
H3: LI Delta (KnownRegValues)
Tracks known integer values loaded into registers via emitLoadImm. When loading a new constant into a register that already holds a known value, emits a 3-word delta subtraction instead of a 6-word clear+load:
; Instead of 6 words (clear + load from pool):
; Just 3 words: dst -= (oldValue - newValue)
subleq(delta_pool_addr, dst, next)
This optimization also detects when a register already holds the desired value (0-word no-op).
Dead Operand Analysis (getCmpOperandDeadFlags)
For branch instructions, analyzes liveness (via LivePhysRegs) to determine if comparison operands are dead after use. Dead operands can be used as temporaries directly, avoiding emitMove copies (saving 12 words each).
Instruction Fusion (SkipMIs)
The expandLBO/expandSBO and related functions can look ahead at the next instruction and fuse operations (e.g., an LBO followed by an SBO to the same byte can share address computation).
12. Branch and Comparison Lowering
This is the most complex part of the backend, because subleq only has a ≤ 0 branch, and signed/unsigned comparisons can overflow.
BR_CC Lowering Strategy
LowerBR_CC() in SubleqISelLowering.cpp implements a multi-tier strategy:
Tier 1: Comparison against zero (3–21 words)
When RHS is constant zero, uses cheap single-operand branch nodes:
- SETLE → SubleqISD::BLEZ (3 words - maps directly to subleq)
- SETLT → SubleqISD::BLTZ (15 words)
- SETGT → SubleqISD::BGTZ (18–21 words)
- SETGE → SubleqISD::BGEZ (15 words)
Also handles constant-zero LHS by swapping the comparison.
Tier 2: Equality/inequality (diff-based)
SETEQ→SubleqISD::BEQ0on(LHS − RHS)(27 words)SETNE→SubleqISD::BNE0on(LHS − RHS)(21 words)
Tier 3: Two-operand comparisons (overflow-safe, ~90 words)
For signed comparisons like SETLT, computing LHS − RHS can overflow (e.g., INT_MIN − 1 wraps). The backend uses two-operand branch nodes (BR_SLT, BR_SGT, BR_SGE, etc.) that check operand signs before subtracting:
; BR_SLT: branch if LHS < RHS (signed)
; 1. Check signs: if LHS ≥ 0 and RHS < 0 → LHS > RHS, fall through
; 2. If LHS < 0 and RHS ≥ 0 → LHS < RHS, branch
; 3. Same signs → safe to subtract: branch if (LHS − RHS) < 0
Unsigned comparisons (BR_ULT, BR_UGT, BR_UGE, BR_ULE) use MSB (sign-bit) analysis to determine the unsigned magnitude comparison: if MSBs differ, the one with MSB=1 is unsigned-greater; if MSBs are the same, the lower 31 bits can be compared with signed subtraction.
SELECT_CC Lowering
LowerSELECT_CC() follows a similar strategy: for comparisons against zero, emits single-operand selects (SELECT_EQ0, SELECT_GT0, etc.); for signed/unsigned comparisons, emits two-operand selects (SELECT_SLT, SELECT_UGT, etc.) that implement overflow-safe control flow inline.
13. Sub-word Memory Access (CHAR_BIT = 8)
With CHAR_BIT = 8, the Subleq backend packs 4 bytes per 32-bit word (little-endian):
Byte Address 4N+0 → bits [7:0] (least significant byte)
Byte Address 4N+1 → bits [15:8]
Byte Address 4N+2 → bits [23:16]
Byte Address 4N+3 → bits [31:24] (most significant byte)
Byte Operations
The general-case LB/SB instructions call runtime functions (__subleq_lb, __subleq_sb) that:
1. Compute the word address: word_addr = byte_addr & ~3
2. Compute the byte position: pos = byte_addr & 3
3. Use a bit-lattice to extract/insert the relevant byte
Optimized Frame-Relative Operations (LBO/SBO/LHO/SHO)
When the byte offset is known at compile time (frame-relative accesses), the backend uses split entry points (__subleq_lb_b0, __subleq_lb_b1, __subleq_lb_b2, __subleq_lb_b3) that skip the modulo lattice, saving ~44 subleq instructions per call. These are selected via the AddrFI_Byte ComplexPattern in SubleqISelDAGToDAG.cpp.
The traceAddrToGlobal() helper traces address definitions backward through MOV/COPY/ADDI/LISym chains to identify compile-time-known byte offsets for global variables.
14. Bitwise and Shift Operations
Since subleq can only subtract, bitwise operations require creative implementations:
Bitwise AND/OR/XOR
Register-register operations call runtime functions (__subleq_and, __subleq_or, __subleq_xor) that use a 32-iteration bit-by-bit lattice: doubling the input values each iteration, testing the sign bit via BLEZ, and accumulating the result.
Immediate operations (ANDI/ORI/XORI) use split entry points (__subleq_and_bN) that skip iterations for bits that are known 0 or 1 in the immediate, reducing iteration count.
NOT
NOT is the only bitwise op expanded inline (no runtime call): dst = −1 − src. This is just subleq(src, MINUS_ONE_copy, ...).
Shifts
Shifts also call runtime functions (__subleq_shl, __subleq_srl, __subleq_sra). For immediate shift amounts (SHLI/SRLI/SRAI), the backend uses split entry points to skip to the correct lattice position.
15. Multiplication and Division
Multiplication
expandMUL moves operands to R21/R22, calls __subleq_mul, and moves R20 to the destination. The runtime uses shift-and-add: for each bit of the multiplier, conditionally add the multiplicand.
Division and Remainder
The backend coalesces SDIV+SREM pairs into single SDIVREM pseudo-instructions (and UDIV+UREM into UDIVREM). These call __subleq_sdivrem / __subleq_udivrem, which compute both quotient (in R20) and remainder (in R21) in one pass.
For 64-bit division, standard libcalls (__divdi3, __moddi3, __udivdi3, __umoddi3) are used.
16. Function Calls and Returns
CALL Expansion
expandCALL emits:
1. Push return address to stack (via emitPushRALabel)
2. Jump to target function
The return address is an MCSymbol label emitted immediately after the jump. emitPushRALabel computes the label address, decrements SP, and stores it to the stack via indirect addressing ([SP|I]).
CALLR Expansion
Indirect calls work similarly but the jump target comes from a register (using Subleq+ indirect addressing for the jump: subleq(Z, Z, target|I)).
RET Expansion
expandRET emits:
1. Pop return address from stack into RA (emitPopRA)
2. Jump to RA (indirect: subleq(Z, Z, RA|I))
Runtime Calls
Many MIR instructions (AND, SHL, MUL, LB, etc.) expand to runtime library calls. emitRuntimeCall(funcName) handles the push-RA / jump-to-symbol / emit-return-label sequence. emitSetupRuntimeArgs handles the register shuffling to place operands in R21/R22 without clobbering, considering all four possible conflict cases.
17. Frame Lowering
SubleqFrameLowering.cpp handles prologue/epilogue generation:
Prologue (emitPrologue)
- Compute total frame size (locals + spills +
RuntimeCallStackPadding) - Emit
SUBI SP, SP, frameSize(decrement stack pointer) - If frame pointer needed:
MOV FP, SPthenADDI FP, FP, adjustment
Epilogue (emitEpilogue)
- If frame pointer used:
MOV SP, FPthenADDI SP, SP, adjustment - Else:
ADDI SP, SP, frameSize
RuntimeCallStackPadding
A fixed 16-byte padding is added to every frame. This prevents runtime function return addresses (which are pushed onto the stack by CALL sequences) from overwriting spill slots. Runtime functions can nest up to 4 levels deep.
Batch CSR Spill/Restore
spillCalleeSavedRegisters() groups contiguous callee-saved registers and emits CSR_SPILL_START + CSR_SPILL_NEXT sequences. The first register in a group costs ~30 words (address computation + store); subsequent registers cost only ~15 words (decrement T4 by 4 + store). The restore path uses CSR_RELOAD_START / CSR_RELOAD_NEXT symmetrically.
Frame Index Elimination
SubleqRegisterInfo::eliminateFrameIndex() is a substantial function (~290 lines) that replaces FrameIndex references in MIR instructions with SP + offset or FP + offset. It handles special cases for:
- Direct register loads/stores from stack slots
- Sub-word operations (LBO/SBO/LHO/SHO) where the byte offset must be preserved
- LEA_FI → ADDI base, offset
18. Register Allocation Hints
SubleqRegisterInfo::getRegAllocationHints() provides non-binding hints to the register allocator:
Runtime-Call Operand Hints
For instructions that expand to runtime calls (AND, SHL, MUL, LB, SB, etc.), hints suggest:
- Output → R20: Avoids a 12-word emitMove from R20 to the destination
- Input 1 → R21: Avoids copying to R21 before the call
- Input 2 → R22: Avoids copying to R22 before the call
In-Place Hints
For instructions like ADDI, ADD, SUB, SUBI, SHLI, SRLI, SRAI, the first source operand is hinted to match the destination register. When dst == src1, the expansion is dramatically cheaper (e.g., ADDI in-place is 3 words vs. 15 general).
Identification
isRuntimeCallPseudo() identifies the ~30 opcodes that benefit from R20/R21/R22 hints. benefitsFromInPlace() identifies opcodes where dst == src1 saves significant code. Notably, NEG and NOT are excluded from in-place hints because their in-place forms are actually worse.
19. ELF Object Support and Relocations
The backend produces standard ELF object files with:
- Machine type:
EM_SUBLEQ(custom ELF machine ID) - Endianness: Little-endian
- Word size: 32-bit
Relocation Types
| Type | Description |
|---|---|
R_386_32 |
Standard 32-bit absolute relocation |
R_SUBLEQ_NEG32 |
Negated 32-bit: resolves to −(S + A) |
The R_SUBLEQ_NEG32 relocation is essential for the constant pool mechanism. When loading an immediate value V, the backend stores −V in the constant pool. At link time, the relocation resolves to −(symbol_value + addend), producing the negated address needed by the subleq copy idiom.
Fixup Handling
SubleqAsmBackend::applyFixup() applies fixups to the data section. For fixup_subleq_neg32, it negates the value only when fully resolved (no relocation emitted). When a relocation is emitted, the addend is written normally; the linker's relocate() handles the negation.
evaluateFixup() overrides the base class to force relocations for undefined symbols, ensuring proper ELF relocations for external references like __text_start.
MCAsmInfo Configuration
Key settings in SubleqMCAsmInfo.cpp:
- CommentString = "#"
- Data32bitsDirective = "\t.word\t"
- ExceptionsType = ExceptionHandling::SjLj
- SupportsDebugInformation = true
- UseIntegratedAssembler = true
- ParseInlineAsmUsingAsmParser = true
20. Target Transform Info (Cost Model)
SubleqTargetTransformInfo.cpp provides LLVM's optimization passes with realistic instruction costs based on measured step counts:
| Operation | Measured Steps | Cost |
|---|---|---|
| Add / Sub | ~85 | 1 |
| Shifts | 143–172 | 2 |
| AND / OR | 193–194 | 2 |
| XOR | ~236 | 3 |
| Multiply | 188–286 | 3 |
| Division / Remainder | 354–947 | 8 |
| ICmp | - | 2 |
| Select | - | 3 |
These costs guide LLVM's decisions on loop unrolling, inlining, and instruction selection trade-offs.
21. Exception Handling (SJLJ)
The backend uses SJLJ (SetJmp/LongJmp) exception handling, the only EH model feasible without hardware stack unwinding support.
Implementation Status
SJLJ exception handling is fully implemented across the backend:
- ISel lowering (
SubleqISelLowering.cpp): Custom lowering forISD::EH_SJLJ_SETJMP,ISD::EH_SJLJ_LONGJMP,ISD::EH_SJLJ_SETUP_DISPATCH, andeh_sjlj_lsdaintrinsic. - Dispatch block (
emitSjLjDispatchBlock, ~300 lines): Creates dispatch MBBs, a dense jump table (one entry per call site index), a trap BB (callsabortfor out-of-range indices), and stores the dispatch address intojbuf[1]of the function context. - Two-pass compatibility:
calculateBlockAddresses()explicitly callsemitJumpTableInfo()during the DryRun pass, ensuring jump table symbols are registered before they are referenced. - Register handling:
CSR_NoRegscallee-saved set forces spilling of all live values beforeinvokecalls.SJLJ_DISPATCH_MARKERcarries the register mask for the dispatch block. - Unwind resume:
RTLIB::UNWIND_RESUMEis configured to call_Unwind_SjLj_Resumefor correct nested exception propagation.
[!NOTE] The stale comments in
SubleqMCAsmInfo.cppdescribing SJLJ as "BLOCKED" or "PARTIALLY implemented" are outdated and refer to problems that have since been fixed.
22. Soft-Float and Libcall Configuration
Since Subleq has no floating-point hardware, the constructor configures comprehensive libcall mappings:
| Category | Example Functions |
|---|---|
| f32 arithmetic | __addsf3, __subsf3, __mulsf3, __divsf3 |
| f64 arithmetic | __adddf3, __subdf3, __muldf3, __divdf3 |
| Float ↔ int | __fixsfsi, __fixdfdi, __floatsisf, __floatdidf, etc. |
| Precision conversion | __extendsfdf2, __truncdfsf2 |
| Comparisons | __eqsf2, __ltdf2, __unordsf2, etc. |
| Rounding | floor, ceil, trunc, round, nearbyint, rint |
| Trig / exp / log | sin, cos, exp, log, pow, sqrt, etc. |
Implementations come from two sources:
- Compiler-rt builtins (subleq_runtime_softfloat.c) for basic operations
- uClibc-ng (libc.a) for math library functions
Memory operations use __subleq_memcpy, __subleq_memset, __subleq_memmove (prefixed to avoid conflicts with user-defined versions). Inline expansion of memcpy/memset/memmove is completely disabled (MaxStoresPerMemcpy = 0) because byte-level operations are ~30× slower than word-level runtime functions.
23. Atomics and Concurrency
Subleq is a single-core, single-threaded machine. The backend handles atomics as follows:
setMaxAtomicSizeInBitsSupported(32): Tells LLVM that 32-bit atomics are "supported."AtomicExpandPass: Added inaddIRPasses(), converts atomic operations to non-atomic equivalents (theshouldExpandAtomicLoadInIR/StoreInIRoverrides returnNotAtomic).- Fences:
ATOMIC_FENCEis custom-lowered to a no-op (just returns the chain). No store buffers or cache coherence to worry about.
24. Inline Assembly Support
The backend supports GCC-style inline assembly with:
Constraints
| Constraint | Type | Description |
|---|---|---|
r |
Register | Any GPR |
m |
Memory | Memory operand (byte address of variable) |
i |
Immediate | Integer constant |
Operand Printing
PrintAsmOperand() handles:
- Register operands: prints the byte address (e.g., 28 for R3)
- Extra code n: prints negated immediate values
- Extra code b: byte number (addr & 3) for sub-word access
PrintAsmMemoryOperand() prints the byte address of memory operands.
Inline Assembly Memory Operand
SelectInlineAsmMemoryOperand() accepts any address as a valid memory operand for m, o, and other memory constraints.
Since Subleq has no named instructions, inline assembly uses .word A, B, C directives to emit raw subleq triples.
25. Assembly Parser
SubleqAsmParser.cpp (~300 lines) parses Subleq assembly:
.worddirectives: The primary instruction format (three 32-bit integers per subleq triple)- Labels: Standard label definitions and references
- Expressions: Arithmetic on labels and constants (e.g.,
.+4for fall-through) - Registers: Parsed by name (
r3,sp,zero, etc.) and mapped to byte addresses - Negated symbols: Supports
-.LsymbolforR_SUBLEQ_NEG32relocations
26. Compilation Pipeline Summary
The complete LLVM pass pipeline for Subleq:
1. AtomicExpandPass - Lower atomics to plain loads/stores
2. Standard IR passes - LLVM's optimization pipeline
3. SubleqDAGToDAGISel - Instruction selection (patterns + custom Select())
4. Register allocation - Greedy RA with custom hints
5. SubleqFrameLowering - Prologue/epilogue insertion, CSR batch spill
6. Branch analysis - analyzeBranch / insertBranch / removeBranch
7. SubleqAsmPrinter - Two-pass: DryRun (size calculation) + emit
├── calculateBlockAddresses() - DryRun pass
├── emitFunctionBodyStart() - Function label, peephole reset
├── emitBasicBlockStart() - Block label
├── emitInstruction() - Dispatch to expand* functions
├── emitBasicBlockEnd() - Pending branch resolution
├── emitImmediateConstantPool() - Emit section constants
└── emitEndOfAsmFile() - Final cleanup
8. MC layer - ELF object emission with R_386_32 / R_SUBLEQ_NEG32
Quick Build Reference
Baremetal (standalone binary for the VM)
# Compile C to object
clang -target subleq -c -O2 -fno-builtin -ffreestanding -o prog.o prog.c
# Assemble runtime
llvm-mc -triple=subleq -filetype=obj -o runtime.o runtime/subleq_runtime.s
# Link
ld.lld -T subleq.ld -o prog.elf prog.o runtime.o
# Extract binary + add boot sequence
llvm-objcopy -O binary --only-section=.text prog.elf prog_code.bin
python3 tools/add_boot.py prog_code.bin prog.bin --text-start 4096 --stack-size 8388608
# Run
./vm/vm prog.bin
Linux ELF (userspace application for the Subleq kernel)
# Compile C to ELF executable
# No -target, -fno-builtin, or -ffreestanding needed - use the cross-compiler directly.
# The core Subleq runtime and soft-float libcalls are part of the kernel;
# fixups are handled by the kernel's ELF loader. No explicit runtime link required.
clang --sysroot=/path/to/subleq-sysroot -O2 -o prog prog.c
# Or compile to object and link separately
clang --sysroot=/path/to/subleq-sysroot -c -O2 -o prog.o prog.c
ld.lld --sysroot=/path/to/subleq-sysroot -o prog prog.o -lc