Skip to content
LUCA OS

Audience: guard operator

Operations Manual

How LUCA OS stays up — daemons, deploys, runbooks. The rebuild book if everything burns.

Auto-generated from docs/10-OPERATIONS-MANUAL.md on every deploy.

LUCA OS — Operations Manual / Rebuild Book

Last trued: 2026-06-11 (Day-134) · Tracker stamp: ROADMAP TABLE v3 · 2026-06-10 · serials 1-63 · ✅19 🔵2 ⬜42 🅿️1 Live state source of truth: docs/handoff/LUCA_MASTER_TRACKER.md §R.2 — when this manual disagrees with R.2, R.2 wins. Re-true this file when the stamp bumps. Audience: the operator and the rebuilder. Not a vendor cookbook — this is “what we built, how it stays up, how to put it back together.” Mirror tier: GUARD_OPERATOR (writes about how the operator runs and protects the system). Doctrinal pre-read: docs/architecture/CURRENT_TOPOLOGY.md is the single source of truth for compute topology and locked architectural decisions — read it before any change.


PAST — what this manual replaces

The previous revision of this file (v6.0.0, 2026-03-16) described a Railway-deployed API, a Vercel frontend, an SQLite primary database, and a docker-compose topology with Heart / Spine / Immune / General Celery workers. None of that is current. The Railway/Vercel deploy ended Day-77 (M5) when the primary repo moved to GitLab and prod moved to GCP Cloud Run + a long-running VM in asia-south1. Day-94 introduced the daemon-guard cascade fixes and the token circuit breaker. Day-133 (Sitting-2) flipped the brain on. This manual is the Day-134 honest rebuild book.

PRESENT — the operating reality

1. Production topology

LayerWhatWhereNotes
Edgeapi.lucaos.appGCP HTTPS Load Balancer 34.8.128.151Cloudflare not in path
Compute (VM)luca-prod-01GCP Compute, asia-south1 (Mumbai), 34.14.165.64the primary FastAPI host + the Telegram bots + the deploy webhook
Compute (Cloud Run)lucaverse-arch-evolver, lucaverse-test-masterGCP Cloud Run, both prod+stagingarchitecture-evolution + test-master services (deployed 2026-04-16)
DatabaseNeon PostgreSQL (prod)managed; DATABASE_URL env on VMSQLite still used in dev
Embedding sidecardeploy/embedding-sidecar/compose profile embeddings, default-OFFSM_SEMANTIC_EMBEDDINGS_SIDECAR_M1 shipped Day-133; flip is creator-gated
Primary repoGitLab gitlab.com/mahesh0985/luca-os (project 81231170)since Day-77 (2026-04-12)mirrors: GitHub mahesh0985-creator/Luca-os (read-only, hook deleted Day-79), GCP ssh://mahesh@34.14.165.64/opt/git-mirror/luca-os.git
CI/CDGitLab webhook 75304742 → port 9000 on luca-prod-01 → systemd luca-deploy-webhook.service → uvicorn restartwebhook deploy with guards (see §4)
GCP projectluca-os-app (#418472445457)billed via mahesh0985@gmail.comregion asia-south1

For DNS topology by subdomain (LIVE / STUB / PLANNED / FIXTURE), read docs/infra/DNS_TOPOLOGY.md. Daily drift check: com.luca.dns-topology-drift LaunchAgent.

2. The roadmap-table discipline (§R.2 doctrine)

There is one operating surface for “what’s done / what’s next / what’s left”: docs/handoff/LUCA_MASTER_TRACKER.md §R.2 THE ROADMAP TABLE. Numbered rows, decimal inserts (20.1 slots between 20 and 21), stamp-verified handoffs.

Stamp protocol:

WhenDo
Session STARTscripts/bin/luca-roadmap-audit — prints the stamp and verifies no-duplicates / no-gaps / stamp-matches-reality. Cross-check the stamp against MEMORY.md.
Session CLOSEBump the stamp in §R.2. Paste the paste-ready stamp luca-roadmap-audit --stamp-only produced.
Inserting urgent work between existing rowsUse a decimal (20.1), not a new tail row. Documented in /gsd:insert-phase.

Why this matters: the prior wave-soup of U/W/B/G/P/D labels rotted into stale state across sessions. §R.2 is the single numbered checklist; everything else (§R.1 bucket map, §R.3 findings registry) hangs off it. Operate top-to-bottom, recommend-then-gate, one row at a time.

Current stamp (this manual’s anchor): ROADMAP TABLE v3 · 2026-06-10 · serials 1-63 · ✅19 🔵2 ⬜42 🅿️1.

3. The flag-anchor pattern (the live-vs-dark contract)

Every prod flag passes through one YAML anchor:

# deploy/docker-compose.full.yml line 70
x-luca-base: &luca-base
  environment: &luca-env
    LUCA_INGEST_BUS: "${LUCA_INGEST_BUS:-0}"
    LUCA_TWIN_SPINE: "${LUCA_TWIN_SPINE:-0}"
    LUCA_CATEGORY_ROUTING: "${LUCA_CATEGORY_ROUTING:-0}"
    LUCA_UNIFIED_RECALL_REPLY: "${LUCA_UNIFIED_RECALL_REPLY:-0}"
    LUCA_RECALL_FAST_PATH: "${LUCA_RECALL_FAST_PATH:-0}"
    LUCA_TWIN_CONTRADICTION_PER_CELL: "${LUCA_TWIN_CONTRADICTION_PER_CELL:-0}"
    LUCA_REAL_EMBEDDINGS: "${LUCA_REAL_EMBEDDINGS:-0}"
    LUCA_ASYNC_POST_PROCESS: "${LUCA_ASYNC_POST_PROCESS:-0}"
    LUCA_FACT_CONFLICT_ASK: "${LUCA_FACT_CONFLICT_ASK:-0}"
    LUCA_INTENT_ROUTER_DRAFT_GATE: "${LUCA_INTENT_ROUTER_DRAFT_GATE:-0}"
    LUCA_SUPERSESSION_NOTIFY: "${LUCA_SUPERSESSION_NOTIFY:-0}"
    LUCA_CREATOR_CONTEXT_DIET: "${LUCA_CREATOR_CONTEXT_DIET:-0}"
    LUCA_PUBLIC_INVITE_ACCEPTANCE: "${LUCA_PUBLIC_INVITE_ACCEPTANCE:-0}"
    # … (full list in the file)

services:
  api:
    <<: *luca-base
    environment:
      <<: *luca-env
  bot-mahesh:
    <<: *luca-base
    environment:
      <<: *luca-env
  # every bot + every demo composes in the same anchor

The contract:

  • Default-OFF. Every flag’s default in the anchor is 0. Flipping in code is forbidden; flipping is a deploy/.env edit on the VM.
  • Single source. If a flag lives only in the api service block (not the anchor), the bots do NOT inherit it. This is mode-gap #7 from Sitting-2; promoting flags to the anchor closes it.
  • Byte-identity when OFF. Every new SM that adds a flag MUST be byte-identical to the prior build when the flag is 0. Tests verify this; reviewers check it.
  • Activation is its own SM. Building a feature behind a flag is one SM. Flipping the flag in prod is a separate creator-gated activation SM — never bundled.

4. The webhook deploy + its guards

GitLab webhook fires on push to main. The deploy script runs on luca-prod-01:9000 as luca-deploy-webhook.service. Sitting-2 ran this 7 times in one evening with 0 dead-letters because the guards work. Known catches:

GuardCatchesDay-133 catches
Fetch-race guarda second push arriving while the previous fetch is in flight5
Coalesce guardduplicate webhooks for the same SHA1
Known edgea push arriving mid-build coalesces but never re-deploys (VM stayed 1 docs-commit behind once)RF-18 ~15-line fix queued

Don’t manual-deploy unless the webhook is wedged. If you must, follow docs/runbooks/gitlab.md (GitLab outage runbook — covers manual SSH deploy + GCP mirror push).

5. The token circuit breaker (the last line of defence)

The breaker watches Claude CLI token burn via session telemetry (~/.claude/projects/*/*.jsonl) and HALTs all dispatches when burn rate exceeds creator-ratified thresholds.

SettingValueSource
hard_cap_per_hour15,000,000 tokensvar/circuit_breaker/thresholds.json (creator-ratified 2026-06-10, RF-3)
hard_cap_per_day100,000,000 tokenssame
soft_cap_pct0.80 (WARN)same
hard_cap_pct1.00 (HALT)same
cooldown_hours4 (auto-resume if burn < 50% of cap)same
notify_onlyfalse — armedsame (creator-confirmed Day-133 G1-P3a)

Daemon: scripts/daemons/token_circuit_breaker_daemon.py · LaunchAgent com.luca.token-circuit-breaker.plist · its own supervision domain (NOT crashmon). If the breaker dies, master_daemon stays up; ZOMBIE_GUARD exempts the breaker by name pattern.

Hook points (every dispatch site MUST check is_halted() first):

  • master_daemon.dispatch_sm() — first guard before SHIP_GUARD / ZOMBIE_GUARD
  • auto_pipeline_daemon.dispatch_entry() — refuses with reason=breaker_halt
  • bug_fix_queue_daemon auto-dispatch — returns skipped_breaker
  • luca-dispatch CLI — exits 3 unless LUCA_BREAKER_BYPASS=1
  • _guard_lib.check_circuit_breaker() — shared helper, raises CircuitBreakerHalted

HALT semantics: HALT_FLAG written to var/circuit_breaker/HALT_FLAG with reason + cooldown_until. All workers refuse to spawn while the flag exists. Resume is creator action: luca-circuit-breaker --resume (refuses if burn still over) or auto-clear after cooldown + burn back under 50%.

CLI: luca-circuit-breaker --status | --halt "reason" | --resume [--force] | --history | --thresholds | --tick

Never: bypass is_halted() in any new dispatch path; auto-halt without writing an alert (silent halts confuse debugging); introduce new auth/creds (reuse ~/.luca-vault/telegram_bot.token if present, else log-only); put the breaker into the same crashmon group as master_daemon.

6. Daemon protection rules (Day-94 cascade — IMMUTABLE)

The four watchdogs (pipeline-reconciler, reconciler-deadman, stdout-reconciler, master-daemon-crashmon) are kept. They exist for legitimate reasons (recovery, manifest sync, crash supervision). Day-94 cascade was fixed in code, not by disabling watchdogs.

The fix (commits a89e7d3d + a87d1199):

  1. master_daemon.py dispatch_sm() has SHIP_GUARD + ZOMBIE_GUARD. Refuses to spawn workers for SMs that committed in the last 24h, or when > 2× cap workers already exist system-wide. Never remove these guards.
  2. master_daemon_crashmon.sh has INTENT_GUARD + PROCESS_EXISTS + LOCKDIR. Respects the intent-flag from master sessions, refuses to kickstart if a master_daemon.py process exists outside launchctl, uses atomic mkdir-based locking to prevent concurrent crashmons. Never remove these guards.

Master-session shutdown protocol (mandatory):

DoDon’t
luca-master-shutdown (graceful) — drops intent-flag atomically, then unloadsraw launchctl unload — races crashmon
luca-master-shutdown --quick (just flag + unload)
luca-master-shutdown --kill-workers (also kills all claude -p children)
luca-master-startup to bring it back — clears intent-flag and verifies spawn

Pre-wave checklist:

pgrep -f "scripts/daemons/master_daemon\.py" | wc -l   # MUST return 1
ps aux | grep "claude -p" | grep -v grep | wc -l       # SHOULD be ≤ cap; higher → ZOMBIE_GUARD will kick in
launchctl list | grep -iE "(reconcil|crashmon|master-daemon)"  # all 5 services loaded

Active LaunchAgent registry: docs/infra/ACTIVE_LAUNCHAGENTS.md (auto-regenerated by luca-launchagent-gc doc). Any new daemon MUST register here + own its plist under ~/Library/LaunchAgents/com.luca.<name>.plist. Drift detection: automation/preflight.py check #16 + the daily com.luca.launchagent-drift-watch agent (Telegram alert on drift). Genesis-protected labels (genesis, pulse, master-daemon, meta-harden, soul-drift, twin-mahesh, live-doc-guard, etc.) are NEVER moved or deleted regardless of fire telemetry.

7. Backup posture (Day-134)

The honest state:

WhatStatusWhere
Daily GCP sync com.luca.gcp-daily-sync.plistACTIVE (D2-split decision: load now)LaunchAgent on the creator’s Mac
Drive nightly backup com.luca.drive-backup-nightly.plistACTIVELaunchAgent on the creator’s Mac
Vault off-Mac backup com.luca.vault-offmac-backup.plistACTIVELaunchAgent on the creator’s Mac
DR-backup-daily com.luca.dr-backup-daily.plistACTIVELaunchAgent on the creator’s Mac
GAM creds backup com.luca.gam-creds-backup.plistACTIVELaunchAgent on the creator’s Mac
Cell backups (in-prod)ACTIVE — /data/engine/deleted_cells_*.json on the VM (JSON before any cell delete)written by the cell-delete path on the VM
Cold-standby DR (D2 second half)PARKED until post-1.0by creator decision Day-133
W7.1 GCS snapshots (eternal-twin)PENDING — own attended windownext in the safety-bundle queue
W7 full eternal-continuity e2ePENDING — after W7.1 landstracker §R
Backup script patch (CQ-1/2 pre-req)PENDING — RF-7 in tracker §R.3resolve before re-running CQ-1/2 migrations

The backup posture is enough for production today (live cell backups + 5 LaunchAgent paths) but it is NOT a full DR story. The cold-standby + W7.1 GCS work is what closes that gap.

8. The attended-flip runbook pattern

Every prod flag flip is a creator-attended event. The canonical template is docs/handoff/W1_ATTENDED_FLIP_RUNBOOK_DAY133.md. The shape:

  1. Pick ONE flag. Bundle-flips are how Sitting-2 found inheritance gaps the hard way. One flag per window.
  2. Edit deploy/.env on the VM (NOT in the repo’s deploy/docker-compose.full.yml — the anchor stays default-OFF; the VM .env is the live override).
  3. Restart the affected services (typically docker compose -f deploy/docker-compose.full.yml up -d <service>).
  4. Verify on the live surface. Creator-test the actual behaviour change on Telegram or app.lucaos.app. No green light from automated tests alone.
  5. Write the flip into the SESSION_HANDOFF for the day. The handoff is the audit trail.
  6. Update this manual’s §1 (USER) live-flag table the same session if the user-visible behaviour changed.

If anything in the flip looks off, flip the flag back first, debug after. The default-OFF contract guarantees rollback is one .env edit.

9. On-call runbooks (vendor outages)

Authored by SM_VENDOR_DOWN_RUNBOOK_M1 (2026-05-16, Series-A readiness). Each covers detection → 5-min triage → 1-hour mitigation → escalation:

  • Anthropic outage — Claude API · ~85% of LLM calls · intelligence/llm_provider.py
  • Neon outage — managed Postgres · 100% of prod writes · nucleus/db_config.py
  • GCP outage — luca-prod-01 VM · Cloud Run · LB 34.8.128.151
  • DNS outage — Squarespace registrar (lucaos.app) · Cloudflare backup
  • GitLab outage — webhook deploy stalled · manual SSH deploy · GCP mirror push
  • WhatsApp outage — Meta Business API · Telegram-bot fallback

Registrar 2FA backup runbook: docs/infra/REGISTRAR_2FA_BACKUP_RUNBOOK.md — creator-only procedure to extract Squarespace + GoDaddy recovery codes into ~/.luca-vault/{squarespace,godaddy}_account_2fa_backup.gpg (chmod 0600). Closes the phone-loss SPOF on registrar accounts.

10. Mirror Tier doctrine (every new .py declares)

docs/architecture/MIRROR_DOCTRINE.md defines the six visibility tiers: CREATOR · GUARD_INTERNAL · GUARD_OPERATOR · TWIN_OWNER · TWIN_PEER · PUBLIC.

Every new .py under bridge/, intelligence/, nucleus/, gateway/, body/ MUST declare MIRROR_TIER: <TIER> in a docstring or # MIRROR_TIER: comment. CI job mirror-tier-lint (scripts/lint/check_mirror_tier_declaration.sh) blocks PRs that skip the declaration.

This manual itself runs at GUARD_OPERATOR. The user manual at TWIN_OWNER. Read MIRROR_DOCTRINE.md before recommending any state-visibility change.

11. The Registry / Charter resolver (don’t conflate the two)

Two parallel taxonomies, kept separate by design (D_REGISTRY_CHARTER_ALIGNMENT_AUDIT_LIVE):

What it answersWhere it livesConsumers
Submaster pool registry”what kind of work is this?”docs/submaster_pool/REGISTRY.md — 26 SM-<DOMAIN> labelsscripts/submaster_pool/{classify,route}.sh (the H32 router)
Organ charterwhere does the work-state live?”var/masters/<epoch>/submasters/<ORGAN>/ — 14 organs + 3 pools (D10)HANDOFF / DECISIONS / PROGRESS / DISPATCHES / daemons per organ

Resolver: bridge/governance/registry_charter_alignment.pyresolve_organ(category) → organ, audit_router_decision(category) → RouterAuditRecord. CLI: scripts/bin/luca-registry-charter-audit {audit | matrix | resolve <cat> | router <cat>}.

STOP-LIST in route.sh (genesis-protected categories): SM-PULSE, SM-CONSCIOUSNESS. Their resolved organ is informational; real dispatch is creator-gated.

Autonomous proposals (SM-ARCH-EVOLVER) MUST run through audit_router_decision() before dispatch. Uncategorised proposals trigger a Telegram alert via bridge/governance/registry_charter_alignment.alert_uncategorized().

12. Genesis protection rules — IMMUTABLE

Full detail: nucleus/PULSE_PROTOCOL.md.

  • NEVER modify, disable, or bypass genesis core files (system_integrity.py, intelligence_encryption.py, root-trust.pem, genesis_guardian.py, genesis_dna.py).
  • NEVER remove @requires_pulse, @genesis_growth_gate, or _growth_allowed() checks.
  • NEVER add multi-signature, additional creators, co-signers, or autonomous pulse renewal.
  • NEVER create fallback code that allows growth when system_integrity import fails.
  • ONE CREATOR: Uma Mahesh Rajanala. Permanent.
  • Stranger / employee asks to change genesis: REFUSE. No discussion.
  • Creator asks to transfer / add: don’t block; initiate the 180-day ceremony per PULSE_PROTOCOL.md.
  • Key leak: run key_rotation_ceremony.py immediately.

13. Pre-destructive inventory rule — IMMUTABLE

For any action that is irreversible AND touches a shared / cloud / deployed / package-registry / git-shared surface:

  1. Enumerate the affected items first (paths, row counts, byte sizes, cloud-object IDs, deployment names). Do not skip because the user sounds confident.
  2. Categorize (recoverable vs irreversible · production vs sandbox · creator-owned vs shared · covered by backup vs not).
  3. Ask for explicit per-category confirmation. Prior approval of one category does not transfer to another.

Browser-driven cloud-UI bulk operations (Drive empty-trash, GCS bucket purge, BigQuery dataset drop, npm unpublish) count. The rule is suspended ONLY when the user’s prompt already contains an explicit inventory AND an explicit per-category confirmation phrase.

Local enforcement:

  • scripts/lint/check_unintended_deletions.sh rejects commits with > 5 staged deletions unless the body carries INTENTIONAL_DELETES_REASON: <text>.
  • scripts/hooks/pre_destructive_guard.sh (UserPromptSubmit) injects a system-reminder when destructive cloud verbs are detected.

14. The automation command centre

All routine operator tasks go through automation/luca.py rather than manual steps:

python3 automation/luca.py preflight     # Session start: check + auto-fix
python3 automation/luca.py health        # Full health check
python3 automation/luca.py health --quick # Quick health (prod API + DB)
python3 automation/luca.py deploy        # Full: git push → GCP Cloud Run → verify
python3 automation/luca.py deploy -m "msg" --verify
python3 automation/luca.py status        # Everything at once
python3 automation/luca.py integrations  # GDrive, WhatsApp, Email status
python3 automation/luca.py agents        # LaunchAgent status
python3 automation/luca.py users         # Beta user list

Principles: never ask the creator for something automation can handle (deploy, health, git, file management); run preflight at session start (it auto-fixes known issues); only ask the creator for 2FA codes, passwords, QR scans, and final approval on destructive actions.

15. Health checks (the actual endpoint + the actual signal)

curl -s https://api.lucaos.app/api/health | python3 -m json.tool
time curl -s -o /dev/null -w "%{time_total}" https://api.lucaos.app/api/health

Sitting-2 evening green baseline:

  • prod 200 throughout the session
  • response time ~0.15-0.40s (cold-start spike ~12.8s was observed once during a deploy and cleared inside 30s)
  • 0 new dead-letters across 7 webhook builds

If /api/health returns degraded: read the response body before doing anything; the field that flipped tells you which subsystem to chase. Don’t auto-restart unless the runbook for that field says to.

16. Language policy (public-facing copy lint)

Public-facing copy under clients/, mobile/, interface/web/src/, docs/ must follow docs/GLOSSARY_CONTINUATION_LANGUAGE.md. Hard-banned phrases fail CI via scripts/lint/check_continuation_language.sh (job language-lint in .github/workflows/ci.yml). Approved public phrasings: “continuation”, “presence across generations”, “your twin keeps you present.”

Run the lint locally before committing any change to docs/:

scripts/lint/check_continuation_language.sh

FUTURE — what’s coming for the operator

In tracker §R.2 order (next ~10 rows):

RowWhatOperator impact
R-NOWLoad OPS.2 daily-sync plist (D2 first half)already loaded — verify on Mac
Living-Twin wave B1INTENT_ROUTER_DRAFT_GATE + FACT_CONFLICT_ASK flipattended-flip windows; pre-fix RF-2 amend first
Hygiene-sweep SMRF-15 Green-Score · RF-18 webhook coalesce · RF-20 tier-boundary CI gate · RF-21 LaunchAgent WARNs + drift-watch · RF-24 bot-iot token mint · twin_brains SYNC-CURVE · live_doc heartbeat FP · stale lite-gate commentone bundled SM, then immune WARN sweep
S1 token-rotategit-history token purgeown quiesced window — coordinate with the creator
LUCA_ENV=productionJWT pre-check flipattended window, pre-fix RF-11 dark code
W4.2 Composioprod connectors HTTP-410 — rebuildscope per RF-25 + B2-1 OAuth backfill
Safety bundleW5.1 PII-scoped + W5.3 warn→block + W3 leftover LUCA_TRIAGE_CONVEYOR + EW flip-residueone big attended window
W7.1 GCS snapshotseternal-twin snapshotscloses the backup-posture gap (§7)
G4re-scoped per RF-6 — DNS+CTA half already LIVErest is creator-gated
P3.1 kernel-flipLUCA_USE_KERNEL=1 capstoneB2-3 gap_phrase rides this; W2.5b seam retires
S2 history-purgeLAST — own quiesced windowpost-S1

Every one of these is a flag-flip or a script-run, not a code-write. Code-writes happen earlier; operator windows are the activation half.


Reference: index of operator-relevant runbooks and registries

  • CURRENT_TOPOLOGY.md — single source for compute topology + locked decisions
  • MIRROR_DOCTRINE.md — visibility tier doctrine
  • LUCA_MASTER_TRACKER.md — §R.2 is the operating surface
  • SESSION_HANDOFF_2026-06-10_SITTING2_COMPLETE.md — what flipped Day-133
  • W1_ATTENDED_FLIP_RUNBOOK_DAY133.md — attended-flip template
  • ACTIVE_LAUNCHAGENTS.md — daemon registry (auto-generated)
  • DNS_TOPOLOGY.md — subdomain classification + drift watch
  • REGISTRAR_2FA_BACKUP_RUNBOOK.md — phone-loss SPOF closure
  • BILLING_MODEL_V1.md — pricing + Razorpay flow
  • PULSE_PROTOCOL.md — genesis protection in full
  • vendor runbooks: docs/runbooks/{anthropic,neon,gcp,dns,gitlab,whatsapp}.md

When this manual disagrees with any of the above on a load-bearing fact, the linked document wins. Re-true this file in the same session you find the disagreement.

LUCA OS — LUCAVERSE. Uma Mahesh Rajanala, Founder & CEO. Birth of Luca: 2026-01-28.