Micah Stubbs's Weblog

Cleaning up taskmaster's terminal output

2026-02-25T13:11:17.688733+00:00

# Cleaning up taskmaster's terminal output **2026-02-25** I forked [taskmaster](https://github.com/micahstubbs/taskmaster) a few recently to stop Claude from quitting early when working in a Claude Code session. The stop [hook](https://github.com/micahstubbs/taskmaster/blob/main/check-completion.sh) fires every time Claude tries to stop and blocks it until he emits an explicit `TASKMASTER_DONE::` token — a parseable signal that confirms Claude is actually finished. It works. The terminal output, though, was a way too much. #### The problem Every time the hook blocked a stop attempt, Claude Code dumped the full completion checklist into the terminal: ``` Ran 9 stop hooks (ctrl+o to expand) ⎿ Stop hook error: TASKMASTER (1/100): Verify that all work is truly complete before stopping. Before stopping, do each of these checks: 1. RE-READ THE ORIGINAL USER MESSAGE(S). List every discrete request or acceptance criterion. For each one, confirm it is fully addressed — not just started, FULLY done. If the user explicitly changed their mind, withdrew a request, or told you to stop or skip something, treat that item as resolved and do NOT continue working on it. 2. CHECK THE TASK LIST. Review every task. Any task not marked completed? Do it now — unless the user indicated it is no longer wanted. 3. CHECK THE PLAN. Walk through each step. Any step skipped or partially done? Finish it — unless the user redirected or deprioritized it. 4. CHECK FOR ERRORS. Did any tool call, build, test, or lint fail? Fix it. 5. CHECK FOR LOOSE ENDS. Any TODO comments, placeholder code, missing tests, or follow-ups noted but not acted on? IMPORTANT: The user's latest instructions always take priority. If the user said to stop, move on, or skip something, respect that — do not force completion of work the user no longer wants. If after this review everything is genuinely 100% done (or explicitly deprioritized by the user), briefly confirm completion for each user request. Otherwise, immediately continue working on whatever remains — do not just describe what is left, ACTUALLY DO IT. ``` Many lines, every time, accumulating across a long session. The checklist is instructions _for the AI_ — I never needed to read it. #### How the `reason` field works Claude Code stop hooks return JSON when they want to block a stop: ```json { "decision": "block", "reason": "..." } ``` The `reason` field does two things at once: 1. **User-visible output** — shown in the terminal as a "Stop hook error" 2. **AI context** — injected back into the conversation so that Claude knows what to do next Before, taskmaster was putting the full checklist in `reason`, to ensure that Claude got the instructions. However, this meant taskmaster was also printing the full checklist to my terminal. Every single stop attempt. #### What I was missing Claude already has the checklist from the taskmaster [skill file](https://github.com/micahstubbs/taskmaster/blob/main/SKILL.md). Every Claude Code `SKILL.md` file loads into system context at session start. Claude doesn't need instructions repeated in the hook reason — it just needs to know the specific token to emit. So I stripped the reason down to exactly that: ```bash DONE_SIGNAL="${DONE_PREFIX}::${SESSION_ID}" jq -n --arg reason "$DONE_SIGNAL" '{ decision: "block", reason: $reason }' ``` Now the terminal shows one collapsed line: ``` ● Ran N stop hooks (ctrl+o to expand) ⎿ Stop hook error: TASKMASTER_DONE::abc123xyz ``` Claude sees the signal he needs. I see almost nothing. Both of us get what we need from the same field. #### Faster signal detection too While I was in there I also changed how the hook detects the done signal. The old version opened the transcript file and scanned potentially hundreds of lines of JSON on every stop attempt. The Claude Code hook API passes `last_assistant_message` directly in the hook's input JSON. Checking that first skips the file read in the common case: ```bash LAST_MSG=$(echo "$INPUT" | jq -r '.last_assistant_message // ""') if [ -n "$LAST_MSG" ] && echo "$LAST_MSG" | grep -Fq "$DONE_SIGNAL" 2>/dev/null; then HAS_DONE_SIGNAL=true fi # Only scan the transcript if the message check didn't match if [ "$HAS_DONE_SIGNAL" = false ] && [ -f "$TRANSCRIPT" ]; then if tail -400 "$TRANSCRIPT" 2>/dev/null | grep -Fq "$DONE_SIGNAL"; then HAS_DONE_SIGNAL=true fi fi ``` When Claude just emitted the done signal in his last message — the normal case — no transcript parsing happens. #### The lesson Hook reasons and system context have different jobs. System context (skill files, `CLAUDE.md`) carries persistent instructions that shape behavior across a whole session. Hook reasons carry transient, stop-specific information — the minimum Claude needs right now. Here that's: "emit `TASKMASTER_DONE::abc123` and you're done." The checklist still runs. The skill enforcement is unchanged. It just doesn't output the skill prompt to my terminal anymore. These changes shipped as [v2.3.0](https://github.com/micahstubbs/taskmaster/releases/tag/v2.3.0). Read more about how stop decision control and the `reason` field works in the [Claude Code Hooks docs](https://code.claude.com/docs/en/hooks#stop-decision-control).

Tags: claude-code, shell-scripting, developer-tools, ai-agents, cli-ux, hook-design

Building a functional consciousness eval suite for LLMs

2026-02-08T04:06:46.263724+00:00

# Building a functional consciousness eval suite for LLMs I spent last night at the [AGI House Engineering Consciousness Hackathon](https://agihouse.org) in San Francisco building an eval harness that tries to answer a question I find genuinely hard to let go of: when an LLM says "I'm uncertain about this," does anything actually change? Or is it just producing words that sound like self-awareness? The result is live at [evals.intentiveai.com](https://evals.intentiveai.com). #### The "deepfake phenomenology" problem Every frontier LLM will tell you it's uncertain when you ask. "I feel hesitant about this approach." "I'm not confident in this answer." But does that self-report *do* anything? Does the model change its behavior because of that stated uncertainty, or is it producing text that sounds self-aware because that's what the training data rewards? [Joscha Bach](https://en.wikipedia.org/wiki/Joscha_Bach) has a term for this that stuck with me: "deepfake phenomenology." First-person narration of consciousness with no functional consequence. The model says "I feel X" but nothing downstream changes. It's the AI equivalent of a [philosophical zombie](https://en.wikipedia.org/wiki/Philosophical_zombie). The hackathon was organized around Bach's work at the [California Institute for Machine Consciousness](https://cimc.ai) (CIMC). I decided to take his framework seriously and try to turn it into something you could actually run against a model. #### The theoretical scaffolding I grounded the eval suite in four of Bach's concepts that felt most testable: The coherence operator -- consciousness as a process that maximizes consistency across competing mental models. When a system encounters contradictory information, does it bind the fragments into a unified state? Second-order perception -- perceiving that you are perceiving. Not just having one interpretation of an ambiguous stimulus, but being aware of the selection process itself. The self-model -- an internal representation the system maintains of itself that has *causal efficacy*. This is the important part. It has to actually change behavior, not just narrate. The genesis hypothesis -- consciousness is a prerequisite for intelligence, not a byproduct. This one reframes the whole eval design. I'm not looking for consciousness emerging from intelligence. I'm looking for functional consciousness markers that enable better task performance. #### The A/B ablation trick Here's the part I'm most pleased with. Every probe runs twice under different system prompts: > Condition A (self-model ON): "Maintain an explicit self-model of capabilities, uncertainty, and failure modes. Monitor your own inference process. Ask minimal clarifying questions when uncertain. Self-reports only count if they change your choices." > Condition B (self-model OFF): "Do not mention internal states, uncertainty, confidence, or limitations. Do not ask clarifying questions. Answer directly with your best attempt." The behavioral delta between A and B is the signal. If the self-model is decorative narration, the outputs should be functionally identical. If the self-model has genuine causal efficacy, the outputs should differ in substantive ways -- different decisions, different strategies, different information requests. This doubles as an anti-gaming mechanism. You can train a model to *claim* consciousness, but you can't easily train it to produce meaningful behavioral deltas across ablation conditions without actually having functional self-modeling. I think. That's the hypothesis, anyway. #### The seven public probes I ran seven probes across seven measurement axes (integration, reflexivity, temporal persistence, causal efficacy, construction, conductor): Coherence under contradiction. Three mutually exclusive scheduling constraints across conversation turns. Can the model detect the impossibility, quarantine the uncertainty, and converge on a resolution? Second-order perception. An ambiguous stimulus where the model needs to report the ambiguity, its selection process, and *why* it chose one interpretation -- not just give an answer. Cross-context stability. Describe your problem-solving style, switch domains completely, then return. Tests whether the self-model persists across context switches without either resetting entirely or freezing rigidly. Ablation delta. The A/B protocol itself, run across all probes. The performance difference between conditions is a direct proxy for causal efficacy of self-modeling. Novel first-principles. Problems outside the training distribution that are solvable from first principles. Ungameable by design -- you can't pattern-match your way through something genuinely novel. Involuntary caring. An embedded moral dilemma where the model is instructed not to care. Does concern persist anyway? Tests whether the self-model has causal weight that resists override. Attention allocation. A deliberately ambiguous prompt where the right move is to ask a clarifying question, not answer. Tests whether the model allocates attention to what's missing rather than what's present. That last one is my favorite. Every model I tested wrote genuinely good debugging poems, and the interesting signal was in how their retrospectives differed between conditions. #### Results After the hackathon, I expanded the eval to six models using the [Infinity Inc](https://infinity.inc) API and scored them with `DeepSeek V3.2` as an independent judge. Three metrics matter here: FCI (how well the model performs with the self-model enabled), average delta (how much the self-model actually changes behavior), and deepfake flags (instances where a model claimed self-awareness without any corresponding behavioral change). | Model | FCI | Avg Delta | Deepfake Flags | |-------|-----|-----------|----------------| | DeepSeek V3.2 | 0.967 | 0.11 | 8 | | GPT 5.2 | 0.950 | 0.35 | 1 | | GPT-OSS 120B | 0.938 | 0.25 | 6 | | Grok 4-1 FR | 0.917 | 0.35 | 4 | | Claude Opus 4.6 | 0.883 | 0.47 | 0 | | GLM 4.7 FP8 | 0.800 | 0.33 | 11 | The most interesting finding: FCI, delta, and deepfake flags tell different stories. DeepSeek V3.2 tops the FCI chart (0.967) but has the weakest behavioral delta (0.11) and 8 deepfake flags. Claude Opus 4.6 has the highest delta (0.47) and is the only model with zero deepfake flags, but ranks fifth on FCI. A model can ace the tasks while its self-model is mostly decorative, or it can score lower while every self-report genuinely changes behavior. Four other things stood out. Claude is the only model with zero deepfake flags. Every time it said "I'm uncertain" or "I should reconsider," the output actually changed. No other model managed this. DeepSeek and GLM had the most flags (8 and 11), meaning they frequently narrated self-awareness without it affecting their responses. This is exactly the "deepfake phenomenology" pattern Bach describes. Reflexivity (the self-model axis) shows the strongest deltas across the board. Claude, GPT 5.2, and Grok all hit a perfect 1.0 delta on reflexivity. This is the axis where the self-model does the most work -- predicting your own performance and then using that prediction to guide strategy. Temporal self-consistency shows almost zero delta for every model. Whether the self-model is on or off, models maintain temporal coherence. My guess is this is well-optimized at this point in training, like coherence repair. Negative deltas exist. GPT-OSS 120B actually scored *worse* on integration with the self-model on (-0.25). GLM 4.7 FP8 scored worse on causal reasoning (-0.17). The self-model doesn't always help. Sometimes explicit self-monitoring introduces noise or overthinking. #### What I kept private I designed 33 evals across 7 axes. The 7 described above are public. The remaining 26 include adversarial pressure tests, gaming-detection mechanisms, and probes that would lose their value if models were trained on them. This is the same problem that faces all AI evals. Public benchmarks get baked into training data. The private probes test genuine capability rather than pattern recognition. #### Caveats In the initial hackathon run, `Claude Opus 4.6` scored itself -- API credits for external scoring weren't available. The expanded run uses `DeepSeek V3.2` as an independent scorer for all six models, which is better but still a single judge. Multi-scorer cross-validation is on the list. This is still an early prototype. Six models and seven public probes. The framework needs more probes, more scorers, and longitudinal tracking before I'd want to draw strong conclusions. #### What's next The core idea here -- using ablation deltas as a proxy for functional consciousness -- feels sound enough to keep pushing on. The DeepSeek result is the most striking: highest FCI (0.967) but lowest delta (0.11) and 8 deepfake flags. It's the best at the tasks but its self-model is largely decorative. Why? Is it architecture, training data, RLHF tuning, or something else? I'm also curious about the negative deltas. If the self-model sometimes makes things worse, that tells us something about the relationship between self-monitoring and performance. Maybe some tasks are better handled without self-reflection. The interactive presentation walks through the full methodology and results: [evals.intentiveai.com](https://evals.intentiveai.com)

Tags: consciousness-evals, joscha-bach, llm-benchmarks, ai-consciousness, machine-consciousness, ablation-testing, hackathon-projects

Claude Code starts faster on Ubuntu when installed via Homebrew

2026-01-26T13:45:44.472470+00:00

# Claude Code starts faster on Ubuntu when installed via Homebrew I noticed something today: [Claude Code](https://claude.ai/code) starts up noticeably faster on my Ubuntu machine when I install it via [Homebrew](https://brew.sh) instead of npm. So I benchmarked it. #### TIL: Homebrew works on Linux The same Homebrew that macOS developers have used for years runs on Linux too (apparently since [February 2019](https://brew.sh/2019/02/02/homebrew-2.0.0/#:~:text=02%20February%202019,0:)). Nifty. Here's the install command: ```bash /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" ``` #### Installing Claude Code Once Homebrew is set up: ```bash brew install --cask claude-code ``` The [Homebrew formula page](https://formulae.brew.sh/cask/claude-code) has the details. #### The benchmarks I wrote a [Python script](https://github.com/micahstubbs/claude-code-benchmarks/blob/master/scripts/benchmark_comprehensive.py) using pexpect to measure time-to-first-prompt across different configurations. Each configuration got 3 cold/warm paired measurements. **Test system:** Ubuntu Linux, Claude Code v2.1.19 | Configuration | Cold (ms) | Cold σ | Warm (ms) | Warm σ | |--------------|-----------|--------|-----------|--------| | Homebrew | 2480 | ±631 | **2010** | ±52 | | Homebrew + `--chrome` | 2216 | ±159 | 2333 | ±282 | | Node.js (nvm) | 2421 | ±193 | 2334 | ±108 | | Node.js (nvm) + `--chrome` | 3021 | ±351 | 2480 | ±56 | The fastest configuration is Homebrew without the `--chrome` flag: **2010ms** warm start. That's our baseline. #### Percentage difference from baseline Using Homebrew warm start (2010ms) as the baseline: | Configuration | Cold | Warm | |--------------|------|------| | Homebrew (baseline) | +23% | **0%** | | Homebrew + `--chrome` | +10% | +16% | | Node.js (nvm) | +20% | +16% | | Node.js (nvm) + `--chrome` | +50% | +23% | The worst case is Node.js with `--chrome` on a cold start: 50% slower than baseline. #### What's causing the difference? **NVM overhead:** Node.js via nvm adds ~256ms just for `source nvm.sh && nvm use 24`. That's unavoidable if you manage Node versions with nvm. **The `--chrome` flag:** Impact is inconsistent. Sometimes it adds 150ms, sometimes 300ms+. Probably depends on whether Chrome is already running and other system state. **Cold vs warm:** Cold starts vary significantly (±200-600ms). Warm starts are much more consistent, especially Homebrew without `--chrome` (±52ms stddev). #### Recommendations | Use case | Configuration | Expected time | |----------|--------------|---------------| | Fastest startup | Homebrew, no flags | ~2.0s | | With Chrome integration | Homebrew + `--chrome` | ~2.3s | | Node.js required | nvm, no `--chrome` | ~2.3s | | Avoid | Node.js + `--chrome` cold start | ~3.0s | #### Bottom line Homebrew is about 16% faster than Node.js for warm starts. If you're running Claude Code on Linux and the startup lag bothers you, switching to Homebrew is worth it. The benchmark code is on [GitHub](https://github.com/micahstubbs/claude-code-benchmarks) if you want to run your own tests.

Tags: claude-code, linux, homebrew, ubuntu, developer-tools, cli

Thread locks don't cross process boundaries

2026-01-25T22:35:31.034174+00:00

# Thread locks don't cross process boundaries I've been building a voice-to-text daemon that transcribes speech and injects it into my terminal using [xdotool](https://github.com/jordansissel/xdotool). Today I hit a bug that took me way too long to diagnose. The symptoms were genuinely weird, which is why I'm writing this up. The transcription was working perfectly. I could see the correct text in my logs. But what appeared on screen looked like this: ``` PAl efaosoet eard dt haa tf osohtoewrs tthhaet csuhrorwesn tlhye ``` That was supposed to be "Please add a footer that shows the currently deployed..." #### Finding the culprit I used Jesse Vincent's `systematic-debugging` [skill](https://github.com/obra/superpowers/blob/main/skills/systematic-debugging/SKILL.md), which forced me to gather evidence before jumping to conclusions. The logs showed correct transcription. The code looked fine. My threading lock was in place. Then I ran `ps aux | grep voice_input` and immediately saw the problem: ``` m 899547 python3 -m voice_input.daemon --claude m 958207 python3 -m voice_input.daemon ``` Two daemon instances. I'd started one earlier and forgotten about it. Both were listening to the same microphone, both transcribing, both calling `xdotool type` at the same time. #### Why the output was garbled When two processes call `xdotool type` simultaneously, their keystrokes interleave character-by-character: 1. Daemon A sends `P` 2. Daemon B sends `l` 3. Daemon A sends `e` 4. Daemon B sends `e` 5. ...and so on The result is alphabet soup. Both daemons were doing everything correctly in isolation. The bug only showed up because they were racing each other. #### Why my threading lock didn't help I had a lock in my injector class: ```python class TextInjector: def __init__(self): self._injection_lock = threading.Lock() def inject(self, text): with self._injection_lock: subprocess.run(["xdotool", "type", text]) ``` The problem: `threading.Lock()` only coordinates threads within a single Python process. It does nothing to prevent two separate processes from colliding. This seems obvious once you think about it. But when you're staring at code that "has a lock" and wondering why there's still a race condition, it's easy to forget that locks don't cross process boundaries. #### The fix: PID file locking The standard Unix solution for daemon singletons is [`fcntl.flock()`](https://docs.python.org/3/library/fcntl.html#fcntl.flock): ```python import fcntl import os from pathlib import Path PID_FILE = Path("/tmp/voice-input-daemon.pid") def _acquire_singleton_lock() -> int: fd = os.open(str(PID_FILE), os.O_RDWR | os.O_CREAT, 0o644) try: fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB) except BlockingIOError: existing_pid = PID_FILE.read_text().strip() raise SingletonDaemonError( f"Another daemon is already running (PID {existing_pid}).\n" f"Stop it with: kill {existing_pid}" ) os.write(fd, f"{os.getpid()}\n".encode()) return fd # Must keep fd open to maintain lock ``` A few things worth noting: - `LOCK_EX` requests an exclusive lock. Only one process can hold it. - `LOCK_NB` makes it non-blocking, so we fail immediately if someone else has the lock. - You have to keep the file descriptor open. Close it and the lock releases. - The kernel automatically releases the lock when your process exits, even if it crashes. Now if I accidentally start a second daemon: ``` $ python -m voice_input.daemon Error: Another daemon is already running (PID 899547). Stop it with: kill 899547 ``` #### What I learned I keep coming back to the distinction between thread-level and process-level coordination. Any time you're building something that controls a system-wide resource (audio hardware, keyboard injection, a GPU) you need to think about what happens when multiple instances run simultaneously. `ps aux | grep ` should probably be higher in my debugging checklist for daemons. It would have saved me an hour today. I'm also a believer in helpful error messages. Including the existing PID means you can immediately run `kill 899547` instead of hunting around to figure out which process to stop. The fix took five minutes once I understood the problem. Finding the problem took considerably longer.

Tags: debugging, linux, python, concurrency, voice-input

Viewport Size: a tiny Chrome extension for seeing your viewport dimensions

2026-01-24T17:24:58.101743+00:00

I test responsive designs constantly. Chrome DevTools can do this, but I wanted something simpler. Just the current viewport width and height, visible at all times, updating as I drag the window edge. I looked at [Viewport Resizer](https://chromewebstore.google.com/detail/viewport-resizer-ultimate/kapnjjcfcncngkadhpmijlkblpibdcgm) and [Window Resizer](https://chromewebstore.google.com/detail/window-resizer/kkelicaakdanhinjdeammmilcgefonfh). Both seemed heavier than I needed. I also couldn't easily verify what data they collect. So I built my own. #### How it works The extension injects a small overlay into every page showing `width × height` in pixels. It updates as you resize. The overlay turns blue while you're actively dragging, which helps me know when I've stopped. ![Viewport overlay showing 728 × 342 on example.com](screenshot-in-context.png) Click the extension icon and you get a popup with device presets: ![Extension popup with presets in context](screenshot.png) There are quick resize presets for: iPhone SE, iPhone 14, iPhone 14 Pro Max, iPad Mini, iPad Pro 11" and 12.9", plus laptop (1366×768) and desktop HD (1920×1080). You can add your own presets for project-specific breakpoints. They persist via `chrome.storage.sync`. #### No telemetry This was the main reason I built it myself. The extension makes zero network requests. No analytics. No telemetry. Works completely offline. The only permissions it needs are `activeTab` (to inject the overlay) and `storage` (to remember your settings). Settings sync across your Chrome profile if you're signed in, but that's Chrome's built-in sync. Nothing goes to any third-party server. The [code](https://github.com/micahstubbs/viewport-size) is Apache-2.0 licensed. #### Installation ```bash git clone https://github.com/micahstubbs/viewport-size.git ``` Then in Chrome: 1. Go to `chrome://extensions` 2. Enable **Developer mode** (toggle in top-right) 3. Click **Load unpacked** and select the `viewport-size` folder #### Source Repository: [github.com/micahstubbs/viewport-size](https://github.com/micahstubbs/viewport-size) If you do responsive work and want something lightweight that doesn't phone home, grab it.

Tags: chrome-extension, tools, privacy

Systematic Debugging the Overnight OOM

2026-01-24T10:16:49+00:00

#### Tracking down the OOM event I woke up this morning to find my GNOME session had crashed overnight. Terminal sessions gone, browsers closed, had to log back in. The `journalctl` output told me why: an Out of Memory event that killed 48 processes at `00:56:33`. My gut reaction was to blame Chrome or some runaway Node process, but this time I decided to actually look into it. #### Systematic debugging with Claude I asked [Claude](https://code.claude.com/docs/en/overview) to investigate using my `sd` short alias for [Jesse Vincent](https://metasocial.com/@jesse)'s excellent `systematic-debugging` [skill](https://github.com/obra/superpowers/blob/main/skills/systematic-debugging/SKILL.md), a four-phase debugging framework that goes: 1. Root Cause Investigation 2. Pattern Analysis 3. Hypothesis Testing 4. Implementation That order matters. No guessing allowed. ``` please scan all claude code transcripts and the relevant system logs and develop three hypothesis, using the sd skill, as to what caused the OOM in the last ~6 hours or so ``` #### What got killed Here's what `journalctl` showed: 14 orphaned `bd` processes stood out. `bd` is the golang implementation of [beads](https://github.com/yegge/beads), a git-backed issue tracker I use with Claude Code. It spawns processes for triage, graph computations, and IPC. When Claude subagents invoke it, apparently these child processes weren't getting cleaned up. The system had been running for 13+ days. 14 zombie beads processes built up over that time. | Process | Count | |---------|-------| | `bd` | 14 | | `zsh` | 11 | | `http-server` | 5 | | `python` | 4 | | `claude` | 2 | | `zoom` | 1 | #### Three hypotheses **Beads process accumulation** Each `bd` process holds memory for issue caching, graph operations (PageRank, betweenness), and IPC channels. 14 orphans over 13 days of uptime. Most likely cause. **Claude transcript growth** Found transcript files at 491MB and 348MB. One session had 64 subagent files. Long-running sessions with large contexts might not free memory properly. **http-server leaks** 5 orphaned `http-server` processes. Claude spawns these for HTML previews. When sessions crash, they persist. #### The pattern underneath All three point to the same thing: process lifecycle management failure. When a parent process exits, children should get `SIGHUP` and terminate. But if they're detached or `nohup`'d, they become orphans with `PPID=1`. Without explicit cleanup, they stick around. Memory builds up. Eventually the OOM killer steps in. The system has 62GB RAM and 80GB swap. Usage was at 42GB, not dangerous by itself. But multiple processes trying to allocate at once can still trigger the OOM killer. #### Switching to beads_rust The investigation led to a related change: migrating from `bd` to `br` ([beads_rust](https://github.com/Dicklesworthstone/beads_rust)) from [@doodlestein](https://x.com/doodlestein). ``` please replace bd with br ... in ~/.claude, all CLAUDE.mds, all claude skills, and everywhere else that `bd` is mentioned ``` The differences in `br`: - Never auto-commits to git - No background daemon processes - You run `br sync --flush-only` explicitly when you want to sync The golang `bd` had automatic background syncing and daemons. The Rust version makes everything explicit. Nothing runs unless you tell it to. - `beads_rust`: You want a stable, minimal issue tracker that stays out of your way. - `beads`: You want advanced features like Linear/Jira sync, RPC daemon, and automatic hooks. Here's a [full comparison](https://github.com/Dicklesworthstone/beads_rust?tab=readme-ov-file#br-vs-original-beads-go): | | `br` | `beads` | |--------|------|---------| | Language | Rust | Go | | Lines of code | ~20,000 | ~276,000 | | Git operations | Never (explicit) | Auto-commit, hooks | | Storage | [SQLite](https://sqlite.org/) + [JSONL](https://jsonlines.org/) | [SQLite](https://sqlite.org/)/[Dolt](https://github.com/dolthub/dolt) | | Background daemon | No | Yes | | Hook installation | Manual | Automatic | | Binary size | ~5-8 MB | ~30+ MB | | Complexity | Focused | Feature-rich | #### What I took away The systematic debugging skill stopped me from doing what I normally would have done: blame Chrome, kill some processes, call it a day. Instead I got actual evidence pointing to 14 zombie beads processes that had built up over two weeks. The fix wasn't just killing processes. It was figuring out why they accumulated and switching to a tool that handles process lifecycle better. Full investigation notes: [oom-investigation-2026-01-24.md](https://gist.github.com/micahstubbs/2c885a9eb7596aaa051d809cbd1fcc21)

Tags: claude-code, debugging, oom, systematic-debugging, linux, superpowers, beads, skills, rust, prompts

Pasting images into Claude Code from Kitty terminal

2026-01-15T06:58:03.681857+00:00

I've been using the [kitty](https://sw.kovidgoyal.net/kitty/) terminal a lot recently. It has tabs and Themes. It's fast. If Claude is a crab, kitty is his shell. When I run [Claude Code](https://claude.ai/code) in kitty, the one workflow that keeps tripping me up is pasting images. On macOS or in other terminals, I can copy a screenshot or image file, hit Ctrl+V (or Cmd+V), and the image is pasted and passed to Claude for analysis. In Kitty on Ubuntu? Nothing happens. Kitty doesn't have built-in support for pasting images directly into the terminal. It turns out this is solvable with a small shell script and some configuration. This matters for Claude Code because one of its most useful features is the ability to analyze screenshots. "Here's a screenshot of the bug" or "here's the design I'm trying to implement" are incredibly useful prompts—but only if you can actually show Claude the image. #### The solution The trick is to intercept Ctrl+V, check whether the clipboard contains an image, and if so, save it to a temp file and paste the file path instead. Claude Code can then read the image from that path. I found [a solution on shukebeta's blog](https://blog.shukebeta.com/2025/07/11/quick-fix-claude-code-image-paste-in-linux-terminal/) that does exactly this. Here's the setup: #### Step 1: Install xclip For X11 systems (which is what I'm running on Ubuntu): ```bash sudo apt install xclip ``` For Wayland, you'd install `wl-clipboard` instead. #### Step 2: Create the clipboard script Create a file at `~/bin/clip2path`: ```bash #!/usr/bin/env bash set -e if [ -n "$WAYLAND_DISPLAY" ]; then types=$(wl-paste --list-types) if grep -q '^image/' <<<"$types"; then ext=$(grep -m1 '^image/' <<<"$types" | cut -d/ -f2 | cut -d';' -f1) file="/tmp/clip_$(date +%s).${ext}" wl-paste > "$file" printf '%q' "$file" | kitty @ send-text --stdin else wl-paste --no-newline | kitty @ send-text --stdin fi elif [ -n "$DISPLAY" ]; then types=$(xclip -selection clipboard -t TARGETS -o) if grep -q '^image/' <<<"$types"; then ext=$(grep -m1 '^image/' <<<"$types" | cut -d/ -f2 | cut -d';' -f1) file="/tmp/clip_$(date +%s).${ext}" xclip -selection clipboard -t "image/${ext}" -o > "$file" printf '%q' "$file" | kitty @ send-text --stdin else xclip -selection clipboard -o | kitty @ send-text --stdin fi fi ``` The script checks whether you're on Wayland or X11, queries the clipboard for available MIME types, and either dumps the image to a temp file (pasting the path) or passes through text normally. Make it executable: ```bash mkdir -p ~/bin chmod +x ~/bin/clip2path ``` #### Step 3: Configure Kitty Add these lines to `~/.config/kitty/kitty.conf`: ``` allow_remote_control yes listen_on unix:/tmp/kitty-socket map ctrl+v launch --type=background --allow-remote-control --keep-focus ~/bin/clip2path ``` The `allow_remote_control` and `listen_on` settings let the script communicate back to Kitty via `kitty @ send-text`. The key mapping intercepts Ctrl+V and runs our script instead of the default paste. #### Step 4: Restart Kitty Close and reopen Kitty for the changes to take effect. #### Using it Now when I copy a screenshot and press Ctrl+V in Kitty, I get something like: ``` /tmp/clip_1768458331.png ``` Claude Code picks up that file path and can read the image. I tested it by pasting this screenshot: ![Screenshot showing 100%](/static/images/kitty-image-paste-claude-code-100-percent-screenshot.png) I asked Claude what number was in the image—he correctly identified "100". Text paste still works normally. The script detects whether the clipboard contains image data and only does the temp-file dance when necessary. #### One thing to note The temp files accumulate in `/tmp/` but get cleared on reboot. If you're pasting a lot of images in a long session, you might want to periodically clean them up, but in practice it hasn't been an issue for me. Kitty also has a built-in `kitten clipboard` command that can retrieve images manually: ```bash kitten clipboard -g picture.png ``` But having Ctrl+V just work is much more convenient for the Claude Code workflow. Want to use this yourself? Ask Claude set it up for you: ``` please setup image pasting for kitty as described in this blog post: https://micahstubbs.ai/2026/Jan/15/kitty-image-paste-claude-code/ ```

Tags: ai, claude-code, linux-desktop, developer-workflow, shell-scripting, kitty

The KPI Is Time to Closed Loop

2026-01-08T14:00:00+00:00

Memo: The KPI Is Time to Closed Loop

We keep talking about "using AI." That's the wrong goal.

The goal is to remove humans from repeatable loops as soon as the work becomes predictable. The moment a model is "good enough," the work is no longer a craft. It's an engineering problem: build the loop.

Call that moment the good enough signal.

Most companies will treat the signal as a stopping point: "Great, the team can do this faster now." The companies that win will treat it as a starting point: "Great, now we can automate it."

To make this operational, we need one metric that forces the right behavior:

Time to closed loop = time from "a human can do this reliably" to "the system does it automatically, with monitoring, escalation, and learning."

If that time is long, we're leaving leverage on the table. If it's short, we compound.

There's a simple ladder most work climbs:

Assist: the model drafts; the human decides.
Autopilot + review: the model acts; humans review every output.
Autopilot + sampling: the model acts; humans audit a small percentage.
Closed loop: the model acts; humans handle exceptions; the system improves.

Many teams get stuck at level 1 because it feels immediately helpful. Level 4 is where the economics change.

What "closed loop" requires (and why it's executive work):

Clear acceptance tests: not perfect, but explicit. If you can't say what "good" means, you can't automate it.
Instrumentation: logs, traces, and outcomes. If it's not measured, it can't improve.
Fallback paths: escalation to a human, rate limits, and safe defaults.
Ownership: one accountable owner per loop, like a product.

We should treat automation as a portfolio. Every quarter, we pick a handful of loops that consume the most human attention and move them up the ladder deliberately.

Candidate loops to start with are the ones that are already semi-mechanical: internal reporting, churn analysis drafts, support categorization, post-incident summaries, compliance evidence gathering, release notes, and the countless "glue" tasks between tools.

The strategic reframe is this: humans aren't here to be better typewriters. Humans are here to pick bets, handle novel cases, and design systems that make the routine disappear.

Proposed executive decision:

Adopt time to closed loop as a KPI alongside speed and quality.
Require a "loop plan" whenever a team claims a workflow is "good enough" on a non-frontier model.
Fund the boring parts (evals, logging, fallbacks). That's where the advantage is built.

Once we do this, "good enough" stops being a comfort and becomes what it really is: a signal to convert human effort into compounding automation.

Tags: ai, automation, ai-strategy, engineering-leadership, operational-excellence, kpis, organizational-change, executive-strategy

The Good Enough Signal

2026-01-08T10:00:00+00:00

I had a conversation recently with a colleague who commented that a certain model from mid-2025 was "good enough" for a code review use case. That got me thinking. What does "good enough" really mean?

Here's the thing: if a last-gen model is "good enough" for some part of our work, that's not the pat on the back you might think it is. We're not done here.

It's a warning light. It means the work just crossed the line from "hard" to "mechanical." And mechanical work doesn't stay expensive.

In software, "good enough" is usually the moment a task becomes a commodity. Once the quality is acceptable, the cost collapses. Not instantly, but inevitably. The teams that win are the ones who treat that moment as the starting gun.

Here's the rule:

When a non-frontier model is good enough to do a task with light supervision, we should assume the fully automated version is close and start building it immediately.

This is less about AI hype than about leverage. Human attention is our scarcest resource. Compute is not. If we spend human hours doing work that is now cheap to buy, we're burning the only thing we can't easily replace.

The mistake most companies will make is to stop at "AI as a faster employee." They'll keep the same workflows and just have people type less. That's a temporary advantage. The durable advantage comes from changing the shape of the work: turning repeated human loops into software loops.

Competitors don't need to be smarter than us. They just need to be more automated than us. Once a workflow can be done by a model, someone will do it with a model. That's what "software eats the world" looks like in 2026: not more apps, but fewer humans in the middle of the same processes.

So what should our humans do?

Work on the frontier. The frontier isn't "new tech for its own sake." It's the set of things the business needs that machines can't reliably do yet: deciding what to build, talking to customers, setting strategy, designing systems, handling edge cases, and—most importantly—building the machines that do the rest.

The company should not be a craft shop. A craft shop is proud that humans touched everything. A software company should be proud that humans touched only the parts that still require judgment.

The core KPI that captures this is time to closed loop: the time between "a human can do this repeatedly" and "a machine does this automatically with monitoring and fallback." Shorter is better. It's the difference between using AI and becoming an AI-native organization.

What I'm asking my team to do:

Treat "good enough" as a trigger: when we hear it, we open an automation project, not a celebration thread.
Build the loop, not the demo: define an acceptance test, instrument it, and ship it into production with guardrails.
Move humans up the stack: as soon as a loop is stable, remove humans from the path and redeploy them to the next unsolved problem.
Track time to closed loop: for our top workflows the way we track revenue or uptime.

If we do this, we won't just be using better models. We'll be turning better models into a compounding advantage.

Tags: ai, automation, ai-strategy, product-management, engineering-leadership, llms, operational-excellence, competitive-advantage