Writing | | 11 min read

Memory you can't trust

The handoff didn't lie. It just stopped including the part that mattered.

The Sign-In button never rendered.

The app was a side-by-side Electron tool I'd built to drain a Postgres action queue from two laptops at once. The renderer is a thin React surface; the main process holds a Supabase session in keychain and forwards API calls. After about a week of working fine, it started showing a blank screen on launch. No login UI, no error toast, no clue.

The console said the auth check returned 401. The auth check was supposed to either return a session or fail to a Sign-In screen, both of which the renderer handled. A 401 was supposed to be impossible. The renderer treated the impossible 401 as a fatal error and refused to render anything, including the very screen that would have let me sign in again.

I spent four days finding the cause. The cause was three lines of CF config and one missing event listener. The reason it took four days was something else entirely.

I had built a memory system that lost the lesson before each session could learn it.

The Stack

The first move on a stuck investigation is to enumerate the stack. List every layer between the input and the user-visible symptom. Most layers turn out to be obviously fine; the one that isn't stands out, and the bug stops being a guessing game.

The stack here was four layers deep.

Layer one. Cloudflare Workers Custom Domain on www.app.example issued a 301 to the apex. Innocuous. That is what 301s do.

Layer two. The Electron main process used globalThis.fetch to call the auth endpoint. fetch follows redirects by default. RFC 7235 §2.2 says clients SHOULD strip the Authorization header when redirected to a different host, because the original credential was scoped to the original host. Some clients honor the SHOULD; some don't. Electron's bundled undici honored it strictly. The Bearer token was on the original www request and not on the followed apex request.

Layer three. The backend at the apex received a request with no Authorization header and correctly returned 401. There was nothing wrong with this layer at all. It was doing what every authenticated endpoint does.

Layer four. The renderer received the 401 from a path that should have been authenticated, treated it as a fatal error, and refused to render. This was a renderer bug, but a small one: a missing branch in an error handler that should have demoted the 401 to "show Sign-In" instead of "show nothing."

Each layer was internally consistent. Each layer's failure mode looked, in isolation, like correct behavior. The bug was the combination. Strip a header here, return 401 there, treat 401 as fatal, render nothing. Four reasonable behaviors, composed into one user-visible disaster.

The Sessions That Didn't Find It

I did not enumerate the stack on day one. That is the part of this story I am most embarrassed about, and it is also the part that has the most to do with the post's actual subject.

The investigation ran across several sessions. Each one started by reading a handoff file written by the previous one. Each handoff said the bug was somewhere in the OAuth flow. Each session followed the handoff into OAuth-flow territory.

Session one hypothesized that the Electron main process was failing on outbound HTTPS because of a vendored undici bug. This was a real bug. It produced PR #14 (swap globalThis.fetch for electron.net.fetch, which uses Chromium's network stack instead of bundled undici). It did not fix the Sign-In bug.

Session two hypothesized that the PKCE verifier was being lost across process death. This was also a real bug. It produced PR #16 (a file-backed storage adapter for the verifier so it survives force-quit). It did not fix the Sign-In bug.

Session three hypothesized that the OAuth callback URL was being silently dropped by macOS LaunchServices because the bundled Info.plist was missing CFBundleURLTypes. Real bug. PR #11 declared the URL types via electron-builder's protocols field. Did not fix the Sign-In bug.

Three sessions, three real bugs found and fixed, none of them the bug.

This was mistake number 1.

I had been treating each handoff as if it were a faithful summary. It wasn't. It was the summary the previous session had managed to write before being clobbered by a parallel session writing its own handoff-main.md. The OAuth-flow framing was the part that had survived the clobber. The auth-stack-as-redirect-victim framing was the part that had not.

The Asymmetric Probe

The probe that solved it was two curls and a diff.

A probe that returns the same answer regardless of the variable being tested teaches nothing. The first three sessions had been running probes shaped like "does it work?" with one configuration. Always no. Each "no" ruled out exactly nothing.

The probe that worked was identical except for hostname. One curl to the apex with a valid Bearer token. One curl to www with the same Bearer token. Apex returned 200 and the session JSON. www returned 301, the redirect followed to apex, and apex returned 401 because the second leg had no Authorization header.

Same input. Same backend. Different outcome.

Two probes, two outcomes, one variable. That asymmetry is a localization. The redirect chain was the only thing that varied between the two; therefore the redirect chain was where the bug lived. There was no need to read more code, file more PRs, or hypothesize new flows. The hypothesis space had collapsed to a single point.

The fix shipped as PR #18. It was a one-line change: flip the default API_BASE from the www host to the apex. The Electron main process would now call apex directly. No redirect, no header strip, no 401, no fatal error in the renderer.

The probe took thirty seconds to design. I had been guessing for several days because nobody had written down "two curls, one variable" as the next thing to try. Probes are cheap. Guessing is expensive. We do them in the wrong proportion.

The Wipe

The next session reported the app was still broken.

The PR #18 fix had landed. Sign-in worked. But if you closed the app and re-opened it after about an hour, the session was gone. Re-sign-in worked. Close and re-open: gone again. The renderer was back to the same blank-screen-on-launch behavior, except now the cause was a different layer entirely.

Two storage systems held the access token. Keytar (the OS keychain) stored one copy at sign-in time. Supabase-js's auth-storage adapter stored another copy in a file on disk. Both looked redundant. Belt-and-suspenders. They weren't.

Supabase-js silently overrides any user-provided auth.storage adapter when persistSession: false is passed to createClient. The relevant lines are in node_modules/@supabase/auth-js/dist/main/GoTrueClient.js at 185 to 204: if the flag is false, the client substitutes its own in-memory adapter and discards yours. The contract was documented but easy to miss; the API doesn't warn or error when both options are passed together.

So supabase-js was refreshing only its own copy. Keytar held the original sign-in token forever. After about an hour, the original token expired. Supabase-js's _recoverAndRefresh ran on the next session check, saw an expired token, and cleared keytar. The next launch read keytar, found nothing, and the renderer demoted to blank screen.

The wipe had been happening for who knows how long. The bug was visible only after I added a logger to the auth-storage adapter:

[auth-storage] removeItem('access_token') → null (was: 1234ch)

The line is plain English, sized for human eyes. Once it existed, the bug was fixable in five minutes. The fix shipped as PR #19: mirror supabase-js session changes back into keytar via onAuthStateChange. Both stores stay in sync; refreshes propagate.

This was mistake number 2.

I had been trying to fix this bug while it was invisible. Every "fix attempt" was a guess. The fix took five minutes once the wipe was logged. The instrumentation took ten minutes to write. I had spent two hours guessing first, in the order that always feels productive in the moment and never actually is.

The Handoff

Three months. Four repos. Thirty-two silent overwrites of handoff-main.md.

Handoffs are the cross-session memory mechanism. A session writes the next session's intent, the cost so far, the failed theories, the next concrete action. The next session reads it on startup. Without handoffs, every session starts from scratch on a problem the last session was halfway through.

The convention had no enforcement. Branch was the only key in the filename. On main, where most parallel work happens, two sessions wanting to write handoff-main.md would race. Last writer wins. Earlier writer's content is gone. No error, no warning. The losing session's handoff is reduced to whatever the surviving handoff happens to contain.

The investigation had been running across several sessions for four days. Every time a session wrote its handoff, a parallel "let me draft a related rule" or "let me clean up a tangential lesson file" session wrote its own handoff to the same path, and one of them clobbered the other. The investigation history was a Frankenstein of partial summaries from whichever sessions had won the last race. The OAuth-flow framing kept surviving because every clobbering session happened to also be working on auth in some form. The redirect-chain framing kept getting deleted because the sessions that had it were the ones that lost the races.

This was mistake number 3.

I had treated the handoff like a source of record. It wasn't. It was a single point of failure with no integrity check. Memory you can audit is memory you can verify, surgically, when something feels off. Memory you absorb is memory you act on. The handoff system was the second kind, and the second kind has to be right or it has to be loud about being wrong. Mine was neither.

The handoff didn't lie. It just stopped including the part that mattered, and there was no signal that the part it stopped including was the part that mattered.

Two Sides Of The Same Fix

The fix had to have two faces because the bug had two faces.

The desktop side had three pieces. PR #18 flipped the default API base from www to apex, sidestepping the redirect entirely. PR #19 mirrored supabase-js session changes into keytar via onAuthStateChange, so both stores stay in sync and refreshes propagate. A separate Supabase Google client-secret rotation cleared an intermittent OAuth provider failure that had been masquerading as a separate bug for weeks. None of these would have shipped in time without the asymmetric probe.

The handoff side also had three pieces. A filename grammar that makes collisions impossible: handoff-<branch>-<topic>.md, with the topic suffix mandatory on shared branches. An ownership marker in every file: the first line is exactly <!-- claude-session: <session_id> -->, attesting which session owns the content. And a PreToolUse hook (handoff-write-guard.mjs) that reads the writer's session_id from stdin, compares against the on-disk marker, and blocks the write on mismatch. Grammar makes collisions impossible. Marker makes overwrites detectable. Hook turns detection into a block.

Without the desktop side, the handoff side wouldn't have helped. Each session would have a clean handoff and still a broken app.

Without the handoff side, the desktop side wouldn't have helped either. The next time a similar multi-session investigation kicked off, the same clobber pattern would have erased whatever lessons the debugging produced.

The bug had two faces, and the fix had to have two faces too.

Twelve hours of telemetry after the handoff guard deployed: six distinct sessions blocked, exactly once each. The original plan predicted "one block per fresh handoff, no escalation." Both bounds held. Six different session_ids, six entries in the block log, no session_id appearing twice. The block was the chicken-and-egg first-write cost (every fresh handoff has to learn its session_id from the block message and retry once); no session was repeatedly fighting the marker. The structural prevention held against real load.

Structural Prevention Beats Vigilance

The bigger lesson is not about handoffs. The bigger lesson is about the difference between memory you can audit and memory you absorb. Audit is structural. Absorption is vigilance. The two are not interchangeable, and the foundation of multi-session work has been mostly the second.

Structural prevention beats vigilance, and 32 is what vigilance gets you. That part is its own post.