LUCA OS — Operations Manual / Rebuild Book
Last trued: 2026-06-11 (Day-134) · Tracker stamp:
ROADMAP TABLE v3 · 2026-06-10 · serials 1-63 · ✅19 🔵2 ⬜42 🅿️1Live state source of truth:docs/handoff/LUCA_MASTER_TRACKER.md §R.2— when this manual disagrees with R.2, R.2 wins. Re-true this file when the stamp bumps. Audience: the operator and the rebuilder. Not a vendor cookbook — this is “what we built, how it stays up, how to put it back together.” Mirror tier: GUARD_OPERATOR (writes about how the operator runs and protects the system). Doctrinal pre-read:docs/architecture/CURRENT_TOPOLOGY.mdis the single source of truth for compute topology and locked architectural decisions — read it before any change.
PAST — what this manual replaces
The previous revision of this file (v6.0.0, 2026-03-16) described a Railway-deployed API, a Vercel
frontend, an SQLite primary database, and a docker-compose topology with Heart / Spine / Immune /
General Celery workers. None of that is current. The Railway/Vercel deploy ended Day-77 (M5) when
the primary repo moved to GitLab and prod moved to GCP Cloud Run + a long-running VM in asia-south1.
Day-94 introduced the daemon-guard cascade fixes and the token circuit breaker. Day-133 (Sitting-2)
flipped the brain on. This manual is the Day-134 honest rebuild book.
PRESENT — the operating reality
1. Production topology
| Layer | What | Where | Notes |
|---|---|---|---|
| Edge | api.lucaos.app | GCP HTTPS Load Balancer 34.8.128.151 | Cloudflare not in path |
| Compute (VM) | luca-prod-01 | GCP Compute, asia-south1 (Mumbai), 34.14.165.64 | the primary FastAPI host + the Telegram bots + the deploy webhook |
| Compute (Cloud Run) | lucaverse-arch-evolver, lucaverse-test-master | GCP Cloud Run, both prod+staging | architecture-evolution + test-master services (deployed 2026-04-16) |
| Database | Neon PostgreSQL (prod) | managed; DATABASE_URL env on VM | SQLite still used in dev |
| Embedding sidecar | deploy/embedding-sidecar/ | compose profile embeddings, default-OFF | SM_SEMANTIC_EMBEDDINGS_SIDECAR_M1 shipped Day-133; flip is creator-gated |
| Primary repo | GitLab gitlab.com/mahesh0985/luca-os (project 81231170) | since Day-77 (2026-04-12) | mirrors: GitHub mahesh0985-creator/Luca-os (read-only, hook deleted Day-79), GCP ssh://mahesh@34.14.165.64/opt/git-mirror/luca-os.git |
| CI/CD | GitLab webhook 75304742 → port 9000 on luca-prod-01 → systemd luca-deploy-webhook.service → uvicorn restart | webhook deploy with guards (see §4) | |
| GCP project | luca-os-app (#418472445457) | billed via mahesh0985@gmail.com | region asia-south1 |
For DNS topology by subdomain (LIVE / STUB / PLANNED / FIXTURE), read
docs/infra/DNS_TOPOLOGY.md. Daily drift check:
com.luca.dns-topology-drift LaunchAgent.
2. The roadmap-table discipline (§R.2 doctrine)
There is one operating surface for “what’s done / what’s next / what’s left”:
docs/handoff/LUCA_MASTER_TRACKER.md §R.2 THE ROADMAP TABLE.
Numbered rows, decimal inserts (20.1 slots between 20 and 21), stamp-verified handoffs.
Stamp protocol:
| When | Do |
|---|---|
| Session START | scripts/bin/luca-roadmap-audit — prints the stamp and verifies no-duplicates / no-gaps / stamp-matches-reality. Cross-check the stamp against MEMORY.md. |
| Session CLOSE | Bump the stamp in §R.2. Paste the paste-ready stamp luca-roadmap-audit --stamp-only produced. |
| Inserting urgent work between existing rows | Use a decimal (20.1), not a new tail row. Documented in /gsd:insert-phase. |
Why this matters: the prior wave-soup of U/W/B/G/P/D labels rotted into stale state across
sessions. §R.2 is the single numbered checklist; everything else (§R.1 bucket map, §R.3 findings
registry) hangs off it. Operate top-to-bottom, recommend-then-gate, one row at a time.
Current stamp (this manual’s anchor): ROADMAP TABLE v3 · 2026-06-10 · serials 1-63 · ✅19 🔵2 ⬜42 🅿️1.
3. The flag-anchor pattern (the live-vs-dark contract)
Every prod flag passes through one YAML anchor:
# deploy/docker-compose.full.yml line 70
x-luca-base: &luca-base
environment: &luca-env
LUCA_INGEST_BUS: "${LUCA_INGEST_BUS:-0}"
LUCA_TWIN_SPINE: "${LUCA_TWIN_SPINE:-0}"
LUCA_CATEGORY_ROUTING: "${LUCA_CATEGORY_ROUTING:-0}"
LUCA_UNIFIED_RECALL_REPLY: "${LUCA_UNIFIED_RECALL_REPLY:-0}"
LUCA_RECALL_FAST_PATH: "${LUCA_RECALL_FAST_PATH:-0}"
LUCA_TWIN_CONTRADICTION_PER_CELL: "${LUCA_TWIN_CONTRADICTION_PER_CELL:-0}"
LUCA_REAL_EMBEDDINGS: "${LUCA_REAL_EMBEDDINGS:-0}"
LUCA_ASYNC_POST_PROCESS: "${LUCA_ASYNC_POST_PROCESS:-0}"
LUCA_FACT_CONFLICT_ASK: "${LUCA_FACT_CONFLICT_ASK:-0}"
LUCA_INTENT_ROUTER_DRAFT_GATE: "${LUCA_INTENT_ROUTER_DRAFT_GATE:-0}"
LUCA_SUPERSESSION_NOTIFY: "${LUCA_SUPERSESSION_NOTIFY:-0}"
LUCA_CREATOR_CONTEXT_DIET: "${LUCA_CREATOR_CONTEXT_DIET:-0}"
LUCA_PUBLIC_INVITE_ACCEPTANCE: "${LUCA_PUBLIC_INVITE_ACCEPTANCE:-0}"
# … (full list in the file)
services:
api:
<<: *luca-base
environment:
<<: *luca-env
bot-mahesh:
<<: *luca-base
environment:
<<: *luca-env
# every bot + every demo composes in the same anchor
The contract:
- Default-OFF. Every flag’s default in the anchor is
0. Flipping in code is forbidden; flipping is adeploy/.envedit on the VM. - Single source. If a flag lives only in the
apiservice block (not the anchor), the bots do NOT inherit it. This is mode-gap #7 from Sitting-2; promoting flags to the anchor closes it. - Byte-identity when OFF. Every new SM that adds a flag MUST be byte-identical to the prior
build when the flag is
0. Tests verify this; reviewers check it. - Activation is its own SM. Building a feature behind a flag is one SM. Flipping the flag in prod is a separate creator-gated activation SM — never bundled.
4. The webhook deploy + its guards
GitLab webhook fires on push to main. The deploy script runs on luca-prod-01:9000 as
luca-deploy-webhook.service. Sitting-2 ran this 7 times in one evening with 0 dead-letters
because the guards work. Known catches:
| Guard | Catches | Day-133 catches |
|---|---|---|
| Fetch-race guard | a second push arriving while the previous fetch is in flight | 5 |
| Coalesce guard | duplicate webhooks for the same SHA | 1 |
| Known edge | a push arriving mid-build coalesces but never re-deploys (VM stayed 1 docs-commit behind once) | RF-18 ~15-line fix queued |
Don’t manual-deploy unless the webhook is wedged. If you must, follow docs/runbooks/gitlab.md
(GitLab outage runbook — covers manual SSH deploy + GCP mirror push).
5. The token circuit breaker (the last line of defence)
The breaker watches Claude CLI token burn via session telemetry (~/.claude/projects/*/*.jsonl)
and HALTs all dispatches when burn rate exceeds creator-ratified thresholds.
| Setting | Value | Source |
|---|---|---|
hard_cap_per_hour | 15,000,000 tokens | var/circuit_breaker/thresholds.json (creator-ratified 2026-06-10, RF-3) |
hard_cap_per_day | 100,000,000 tokens | same |
soft_cap_pct | 0.80 (WARN) | same |
hard_cap_pct | 1.00 (HALT) | same |
cooldown_hours | 4 (auto-resume if burn < 50% of cap) | same |
notify_only | false — armed | same (creator-confirmed Day-133 G1-P3a) |
Daemon: scripts/daemons/token_circuit_breaker_daemon.py ·
LaunchAgent com.luca.token-circuit-breaker.plist · its own supervision domain (NOT crashmon).
If the breaker dies, master_daemon stays up; ZOMBIE_GUARD exempts the breaker by name pattern.
Hook points (every dispatch site MUST check is_halted() first):
master_daemon.dispatch_sm()— first guard beforeSHIP_GUARD/ZOMBIE_GUARDauto_pipeline_daemon.dispatch_entry()— refuses withreason=breaker_haltbug_fix_queue_daemonauto-dispatch — returnsskipped_breakerluca-dispatchCLI — exits3unlessLUCA_BREAKER_BYPASS=1_guard_lib.check_circuit_breaker()— shared helper, raisesCircuitBreakerHalted
HALT semantics: HALT_FLAG written to var/circuit_breaker/HALT_FLAG with reason +
cooldown_until. All workers refuse to spawn while the flag exists. Resume is creator action:
luca-circuit-breaker --resume (refuses if burn still over) or auto-clear after cooldown +
burn back under 50%.
CLI: luca-circuit-breaker --status | --halt "reason" | --resume [--force] | --history | --thresholds | --tick
Never: bypass is_halted() in any new dispatch path; auto-halt without writing an alert
(silent halts confuse debugging); introduce new auth/creds (reuse ~/.luca-vault/telegram_bot.token
if present, else log-only); put the breaker into the same crashmon group as master_daemon.
6. Daemon protection rules (Day-94 cascade — IMMUTABLE)
The four watchdogs (pipeline-reconciler, reconciler-deadman, stdout-reconciler,
master-daemon-crashmon) are kept. They exist for legitimate reasons (recovery, manifest sync,
crash supervision). Day-94 cascade was fixed in code, not by disabling watchdogs.
The fix (commits a89e7d3d + a87d1199):
master_daemon.py dispatch_sm()hasSHIP_GUARD+ZOMBIE_GUARD. Refuses to spawn workers for SMs that committed in the last 24h, or when > 2× cap workers already exist system-wide. Never remove these guards.master_daemon_crashmon.shhasINTENT_GUARD+PROCESS_EXISTS+LOCKDIR. Respects the intent-flag from master sessions, refuses to kickstart if amaster_daemon.pyprocess exists outside launchctl, uses atomicmkdir-based locking to prevent concurrent crashmons. Never remove these guards.
Master-session shutdown protocol (mandatory):
| Do | Don’t |
|---|---|
luca-master-shutdown (graceful) — drops intent-flag atomically, then unloads | raw launchctl unload — races crashmon |
luca-master-shutdown --quick (just flag + unload) | |
luca-master-shutdown --kill-workers (also kills all claude -p children) | |
luca-master-startup to bring it back — clears intent-flag and verifies spawn |
Pre-wave checklist:
pgrep -f "scripts/daemons/master_daemon\.py" | wc -l # MUST return 1
ps aux | grep "claude -p" | grep -v grep | wc -l # SHOULD be ≤ cap; higher → ZOMBIE_GUARD will kick in
launchctl list | grep -iE "(reconcil|crashmon|master-daemon)" # all 5 services loaded
Active LaunchAgent registry: docs/infra/ACTIVE_LAUNCHAGENTS.md
(auto-regenerated by luca-launchagent-gc doc). Any new daemon MUST register here + own its plist
under ~/Library/LaunchAgents/com.luca.<name>.plist. Drift detection:
automation/preflight.py check #16 + the daily com.luca.launchagent-drift-watch agent (Telegram
alert on drift). Genesis-protected labels (genesis, pulse, master-daemon, meta-harden,
soul-drift, twin-mahesh, live-doc-guard, etc.) are NEVER moved or deleted regardless of
fire telemetry.
7. Backup posture (Day-134)
The honest state:
| What | Status | Where |
|---|---|---|
Daily GCP sync com.luca.gcp-daily-sync.plist | ACTIVE (D2-split decision: load now) | LaunchAgent on the creator’s Mac |
Drive nightly backup com.luca.drive-backup-nightly.plist | ACTIVE | LaunchAgent on the creator’s Mac |
Vault off-Mac backup com.luca.vault-offmac-backup.plist | ACTIVE | LaunchAgent on the creator’s Mac |
DR-backup-daily com.luca.dr-backup-daily.plist | ACTIVE | LaunchAgent on the creator’s Mac |
GAM creds backup com.luca.gam-creds-backup.plist | ACTIVE | LaunchAgent on the creator’s Mac |
| Cell backups (in-prod) | ACTIVE — /data/engine/deleted_cells_*.json on the VM (JSON before any cell delete) | written by the cell-delete path on the VM |
| Cold-standby DR (D2 second half) | PARKED until post-1.0 | by creator decision Day-133 |
| W7.1 GCS snapshots (eternal-twin) | PENDING — own attended window | next in the safety-bundle queue |
| W7 full eternal-continuity e2e | PENDING — after W7.1 lands | tracker §R |
| Backup script patch (CQ-1/2 pre-req) | PENDING — RF-7 in tracker §R.3 | resolve before re-running CQ-1/2 migrations |
The backup posture is enough for production today (live cell backups + 5 LaunchAgent paths) but it is NOT a full DR story. The cold-standby + W7.1 GCS work is what closes that gap.
8. The attended-flip runbook pattern
Every prod flag flip is a creator-attended event. The canonical template is
docs/handoff/W1_ATTENDED_FLIP_RUNBOOK_DAY133.md.
The shape:
- Pick ONE flag. Bundle-flips are how Sitting-2 found inheritance gaps the hard way. One flag per window.
- Edit
deploy/.envon the VM (NOT in the repo’sdeploy/docker-compose.full.yml— the anchor stays default-OFF; the VM.envis the live override). - Restart the affected services (typically
docker compose -f deploy/docker-compose.full.yml up -d <service>). - Verify on the live surface. Creator-test the actual behaviour change on Telegram or
app.lucaos.app. No green light from automated tests alone. - Write the flip into the SESSION_HANDOFF for the day. The handoff is the audit trail.
- Update this manual’s §1 (USER) live-flag table the same session if the user-visible behaviour changed.
If anything in the flip looks off, flip the flag back first, debug after. The default-OFF
contract guarantees rollback is one .env edit.
9. On-call runbooks (vendor outages)
Authored by SM_VENDOR_DOWN_RUNBOOK_M1 (2026-05-16, Series-A readiness). Each covers detection →
5-min triage → 1-hour mitigation → escalation:
- Anthropic outage — Claude API · ~85% of LLM calls ·
intelligence/llm_provider.py - Neon outage — managed Postgres · 100% of prod writes ·
nucleus/db_config.py - GCP outage —
luca-prod-01VM · Cloud Run · LB34.8.128.151 - DNS outage — Squarespace registrar (
lucaos.app) · Cloudflare backup - GitLab outage — webhook deploy stalled · manual SSH deploy · GCP mirror push
- WhatsApp outage — Meta Business API · Telegram-bot fallback
Registrar 2FA backup runbook: docs/infra/REGISTRAR_2FA_BACKUP_RUNBOOK.md
— creator-only procedure to extract Squarespace + GoDaddy recovery codes into
~/.luca-vault/{squarespace,godaddy}_account_2fa_backup.gpg (chmod 0600). Closes the phone-loss
SPOF on registrar accounts.
10. Mirror Tier doctrine (every new .py declares)
docs/architecture/MIRROR_DOCTRINE.md defines the six
visibility tiers: CREATOR · GUARD_INTERNAL · GUARD_OPERATOR · TWIN_OWNER · TWIN_PEER · PUBLIC.
Every new .py under bridge/, intelligence/, nucleus/, gateway/, body/ MUST declare
MIRROR_TIER: <TIER> in a docstring or # MIRROR_TIER: comment. CI job mirror-tier-lint
(scripts/lint/check_mirror_tier_declaration.sh) blocks PRs that skip the declaration.
This manual itself runs at GUARD_OPERATOR. The user manual at TWIN_OWNER. Read MIRROR_DOCTRINE.md
before recommending any state-visibility change.
11. The Registry / Charter resolver (don’t conflate the two)
Two parallel taxonomies, kept separate by design (D_REGISTRY_CHARTER_ALIGNMENT_AUDIT_LIVE):
| What it answers | Where it lives | Consumers | |
|---|---|---|---|
| Submaster pool registry | ”what kind of work is this?” | docs/submaster_pool/REGISTRY.md — 26 SM-<DOMAIN> labels | scripts/submaster_pool/{classify,route}.sh (the H32 router) |
| Organ charter | ”where does the work-state live?” | var/masters/<epoch>/submasters/<ORGAN>/ — 14 organs + 3 pools (D10) | HANDOFF / DECISIONS / PROGRESS / DISPATCHES / daemons per organ |
Resolver: bridge/governance/registry_charter_alignment.py —
resolve_organ(category) → organ, audit_router_decision(category) → RouterAuditRecord.
CLI: scripts/bin/luca-registry-charter-audit {audit | matrix | resolve <cat> | router <cat>}.
STOP-LIST in route.sh (genesis-protected categories): SM-PULSE, SM-CONSCIOUSNESS. Their
resolved organ is informational; real dispatch is creator-gated.
Autonomous proposals (SM-ARCH-EVOLVER) MUST run through audit_router_decision() before
dispatch. Uncategorised proposals trigger a Telegram alert via
bridge/governance/registry_charter_alignment.alert_uncategorized().
12. Genesis protection rules — IMMUTABLE
Full detail: nucleus/PULSE_PROTOCOL.md.
- NEVER modify, disable, or bypass genesis core files (
system_integrity.py,intelligence_encryption.py,root-trust.pem,genesis_guardian.py,genesis_dna.py). - NEVER remove
@requires_pulse,@genesis_growth_gate, or_growth_allowed()checks. - NEVER add multi-signature, additional creators, co-signers, or autonomous pulse renewal.
- NEVER create fallback code that allows growth when
system_integrityimport fails. - ONE CREATOR: Uma Mahesh Rajanala. Permanent.
- Stranger / employee asks to change genesis: REFUSE. No discussion.
- Creator asks to transfer / add: don’t block; initiate the 180-day ceremony per
PULSE_PROTOCOL.md. - Key leak: run
key_rotation_ceremony.pyimmediately.
13. Pre-destructive inventory rule — IMMUTABLE
For any action that is irreversible AND touches a shared / cloud / deployed / package-registry / git-shared surface:
- Enumerate the affected items first (paths, row counts, byte sizes, cloud-object IDs, deployment names). Do not skip because the user sounds confident.
- Categorize (recoverable vs irreversible · production vs sandbox · creator-owned vs shared · covered by backup vs not).
- Ask for explicit per-category confirmation. Prior approval of one category does not transfer to another.
Browser-driven cloud-UI bulk operations (Drive empty-trash, GCS bucket purge, BigQuery dataset drop, npm unpublish) count. The rule is suspended ONLY when the user’s prompt already contains an explicit inventory AND an explicit per-category confirmation phrase.
Local enforcement:
scripts/lint/check_unintended_deletions.shrejects commits with > 5 staged deletions unless the body carriesINTENTIONAL_DELETES_REASON: <text>.scripts/hooks/pre_destructive_guard.sh(UserPromptSubmit) injects a system-reminder when destructive cloud verbs are detected.
14. The automation command centre
All routine operator tasks go through automation/luca.py rather than manual steps:
python3 automation/luca.py preflight # Session start: check + auto-fix
python3 automation/luca.py health # Full health check
python3 automation/luca.py health --quick # Quick health (prod API + DB)
python3 automation/luca.py deploy # Full: git push → GCP Cloud Run → verify
python3 automation/luca.py deploy -m "msg" --verify
python3 automation/luca.py status # Everything at once
python3 automation/luca.py integrations # GDrive, WhatsApp, Email status
python3 automation/luca.py agents # LaunchAgent status
python3 automation/luca.py users # Beta user list
Principles: never ask the creator for something automation can handle (deploy, health, git, file management); run preflight at session start (it auto-fixes known issues); only ask the creator for 2FA codes, passwords, QR scans, and final approval on destructive actions.
15. Health checks (the actual endpoint + the actual signal)
curl -s https://api.lucaos.app/api/health | python3 -m json.tool
time curl -s -o /dev/null -w "%{time_total}" https://api.lucaos.app/api/health
Sitting-2 evening green baseline:
- prod 200 throughout the session
- response time
~0.15-0.40s(cold-start spike~12.8swas observed once during a deploy and cleared inside 30s) - 0 new dead-letters across 7 webhook builds
If /api/health returns degraded: read the response body before doing anything; the field that
flipped tells you which subsystem to chase. Don’t auto-restart unless the runbook for that field
says to.
16. Language policy (public-facing copy lint)
Public-facing copy under clients/, mobile/, interface/web/src/, docs/ must follow
docs/GLOSSARY_CONTINUATION_LANGUAGE.md. Hard-banned phrases
fail CI via scripts/lint/check_continuation_language.sh (job language-lint in
.github/workflows/ci.yml). Approved public phrasings: “continuation”, “presence across
generations”, “your twin keeps you present.”
Run the lint locally before committing any change to docs/:
scripts/lint/check_continuation_language.sh
FUTURE — what’s coming for the operator
In tracker §R.2 order (next ~10 rows):
| Row | What | Operator impact |
|---|---|---|
| R-NOW | Load OPS.2 daily-sync plist (D2 first half) | already loaded — verify on Mac |
| Living-Twin wave B1 | INTENT_ROUTER_DRAFT_GATE + FACT_CONFLICT_ASK flip | attended-flip windows; pre-fix RF-2 amend first |
| Hygiene-sweep SM | RF-15 Green-Score · RF-18 webhook coalesce · RF-20 tier-boundary CI gate · RF-21 LaunchAgent WARNs + drift-watch · RF-24 bot-iot token mint · twin_brains SYNC-CURVE · live_doc heartbeat FP · stale lite-gate comment | one bundled SM, then immune WARN sweep |
| S1 token-rotate | git-history token purge | own quiesced window — coordinate with the creator |
LUCA_ENV=production | JWT pre-check flip | attended window, pre-fix RF-11 dark code |
| W4.2 Composio | prod connectors HTTP-410 — rebuild | scope per RF-25 + B2-1 OAuth backfill |
| Safety bundle | W5.1 PII-scoped + W5.3 warn→block + W3 leftover LUCA_TRIAGE_CONVEYOR + EW flip-residue | one big attended window |
| W7.1 GCS snapshots | eternal-twin snapshots | closes the backup-posture gap (§7) |
| G4 | re-scoped per RF-6 — DNS+CTA half already LIVE | rest is creator-gated |
| P3.1 kernel-flip | LUCA_USE_KERNEL=1 capstone | B2-3 gap_phrase rides this; W2.5b seam retires |
| S2 history-purge | LAST — own quiesced window | post-S1 |
Every one of these is a flag-flip or a script-run, not a code-write. Code-writes happen earlier; operator windows are the activation half.
Reference: index of operator-relevant runbooks and registries
CURRENT_TOPOLOGY.md— single source for compute topology + locked decisionsMIRROR_DOCTRINE.md— visibility tier doctrineLUCA_MASTER_TRACKER.md— §R.2 is the operating surfaceSESSION_HANDOFF_2026-06-10_SITTING2_COMPLETE.md— what flipped Day-133W1_ATTENDED_FLIP_RUNBOOK_DAY133.md— attended-flip templateACTIVE_LAUNCHAGENTS.md— daemon registry (auto-generated)DNS_TOPOLOGY.md— subdomain classification + drift watchREGISTRAR_2FA_BACKUP_RUNBOOK.md— phone-loss SPOF closureBILLING_MODEL_V1.md— pricing + Razorpay flowPULSE_PROTOCOL.md— genesis protection in full- vendor runbooks:
docs/runbooks/{anthropic,neon,gcp,dns,gitlab,whatsapp}.md
When this manual disagrees with any of the above on a load-bearing fact, the linked document wins. Re-true this file in the same session you find the disagreement.
LUCA OS — LUCAVERSE. Uma Mahesh Rajanala, Founder & CEO. Birth of Luca: 2026-01-28.