Writing | April 30, 2026 | 11 min read

Memory you can't trust

Why the Workbench OAuth bug took five sessions to find, and the contract that wasn't enforced

Josh Duffy FMLOps

The Sign-In button never rendered.

Workbench is the side-by-side Electron app I'd built to drain a Postgres action queue from two laptops at once. The renderer is a thin React surface; the main process holds a Supabase session in keychain and forwards API calls. After about a week of working fine, it started showing a blank screen on launch. No login UI, no error toast, no clue.

The console said the auth check returned 401. The auth check was supposed to either return a session or fail to a Sign-In screen, both of which the renderer handled. A 401 was supposed to be impossible. The renderer treated the impossible 401 as a fatal error and refused to render anything, including the very screen that would have let me sign in again.

I spent four days finding the cause. The cause was three lines of CF config and one missing event listener. The reason it took four days was something else entirely.

I had built a memory system that lost the lesson before each session could learn it.

01

The Stack

The first move on a stuck investigation is to enumerate the stack. List every layer between the input and the user-visible symptom. Most layers turn out to be obviously fine; the one that isn't stands out, and the bug stops being a guessing game.

The stack here was four layers deep.

Layer one. Cloudflare Workers Custom Domain on www.workbench.example issued a 301 to the apex. Innocuous. That is what 301s do.

Layer two. The Electron main process used globalThis.fetch to call the auth endpoint. fetch follows redirects by default. RFC 7235 §2.2 says clients SHOULD strip the Authorization header when redirected to a different host, because the original credential was scoped to the original host. Some clients honor the SHOULD; some don't. Electron's bundled undici honored it strictly. The Bearer token was on the original www request and not on the followed apex request.

Layer three. The backend at the apex received a request with no Authorization header and correctly returned 401. There was nothing wrong with this layer at all. It was doing what every authenticated endpoint does.

Layer four. The renderer received the 401 from a path that should have been authenticated, treated it as a fatal error, and refused to render. This was a renderer bug, but a small one: a missing branch in an error handler that should have demoted the 401 to "show Sign-In" instead of "show nothing."

Each layer was internally consistent. Each layer's failure mode looked, in isolation, like correct behavior. The bug was the combination. Strip a header here, return 401 there, treat 401 as fatal, render nothing. Four reasonable behaviors, composed into one user-visible disaster.

02

The Sessions That Didn't Find It

I did not enumerate the stack on day one. That is the part of this story I am most embarrassed about, and it is also the part that has the most to do with the post's actual subject.

The Workbench investigation ran across several sessions. Each one started by reading a handoff file written by the previous one. Each handoff said the bug was somewhere in the OAuth flow. Each session followed the handoff into OAuth-flow territory.

Session one hypothesized that the Electron main process was failing on outbound HTTPS because of a vendored undici bug. This was a real bug. It produced PR #4 (swap globalThis.fetch for electron.net.fetch, which uses Chromium's network stack instead of bundled undici). It did not fix the Sign-In bug.

Session two hypothesized that the PKCE verifier was being lost across process death. This was also a real bug. It produced PR #5 (a file-backed storage adapter for the verifier so it survives force-quit). It did not fix the Sign-In bug.

Session three hypothesized that the OAuth callback URL was being silently dropped by macOS LaunchServices because the bundled Info.plist was missing CFBundleURLTypes. Real bug. PR #3 declared the URL types via electron-builder's protocols field. Did not fix the Sign-In bug.

Three sessions, three real bugs found and fixed, none of them the bug.

This was mistake number 1.

I had been treating each handoff as if it were a faithful summary. It wasn't. It was the summary the previous session had managed to write before being clobbered by a parallel session writing its own handoff-main.md. The OAuth-flow framing was the part that had survived the clobber. The auth-stack-as-redirect-victim framing was the part that had not.

03

The Asymmetric Probe

The probe that solved it was two curls and a diff.

A probe that returns the same answer regardless of the variable being tested teaches nothing. The first three sessions had been running probes shaped like "does it work?" with one configuration. Always no. Each "no" ruled out exactly nothing.

The probe that worked was identical except for hostname. One curl to the apex with a valid Bearer token. One curl to www with the same Bearer token. Apex returned 200 and the session JSON. www returned 301, the redirect followed to apex, and apex returned 401 because the second leg had no Authorization header.

Same input. Same backend. Different outcome.

Two probes, two outcomes, one variable. That asymmetry is a localization. The redirect chain was the only thing that varied between the two; therefore the redirect chain was where the bug lived. There was no need to read more code, file more PRs, or hypothesize new flows. The hypothesis space had collapsed to a single point.

The fix shipped as PR #6, commit 0153f47. It was a one-line change: flip the default WORKBENCH_API_BASE from the www host to the apex. The Electron main process would now call apex directly. No redirect, no header strip, no 401, no fatal error in the renderer.

The probe took thirty seconds to design. I had been guessing for several days because nobody had written down "two curls, one variable" as the next thing to try. Probes are cheap. Guessing is expensive. We do them in the wrong proportion.

04

The Wipe

The next session reported the app was still broken.

The PR #6 fix had landed. Sign-in worked. But if you closed the app and re-opened it after about an hour, the session was gone. Re-sign-in worked. Close and re-open: gone again. The renderer was back to the same blank-screen-on-launch behavior, except now the cause was a different layer entirely.

Two storage systems held the access token. Keytar (the OS keychain) stored one copy at sign-in time. Supabase-js's auth-storage adapter stored another copy in a file on disk. Both looked redundant. Belt-and-suspenders. They weren't.

Supabase-js silently overrides any user-provided auth.storage adapter when persistSession: false is passed to createClient. The relevant lines are in node_modules/@supabase/auth-js/dist/main/GoTrueClient.js at 185 to 204: if the flag is false, the client substitutes its own in-memory adapter and discards yours. The contract was documented but easy to miss; the API doesn't warn or error when both options are passed together.

So supabase-js was refreshing only its own copy. Keytar held the original sign-in token forever. After about an hour, the original token expired. Supabase-js's _recoverAndRefresh ran on the next session check, saw an expired token, and cleared keytar. The next launch read keytar, found nothing, and the renderer demoted to blank screen.

The wipe had been happening for who knows how long. The bug was visible only after I added a logger to the auth-storage adapter:

[auth-storage] removeItem('access_token') → null (was: 1234ch)

The line is plain English, sized for human eyes. Once it existed, the bug was fixable in five minutes. The fix shipped as PR #7, commit 83f9b87: mirror supabase-js session changes back into keytar via onAuthStateChange. Both stores stay in sync; refreshes propagate.

This was mistake number 2.

I had been trying to fix this bug while it was invisible. Every "fix attempt" was a guess. The fix took five minutes once the wipe was logged. The instrumentation took ten minutes to write. I had spent two hours guessing first, in the order that always feels productive in the moment and never actually is.

05

The Handoff

Three months. Four repos. Thirty-two silent overwrites of handoff-main.md.

Handoffs are the cross-session memory mechanism. A session writes the next session's intent, the cost so far, the failed theories, the next concrete action. The next session reads it on startup. Without handoffs, every session starts from scratch on a problem the last session was halfway through.

The convention had no enforcement. Branch was the only key in the filename. On main, where most parallel work happens, two sessions wanting to write handoff-main.md would race. Last writer wins. Earlier writer's content is gone. No error, no warning. The losing session's handoff is reduced to whatever the surviving handoff happens to contain.

The Workbench investigation had been running across several sessions for four days. Every time a session wrote its handoff, a parallel "let me draft a related rule" or "let me clean up a tangential lesson file" session wrote its own handoff to the same path, and one of them clobbered the other. The Workbench investigation history was a Frankenstein of partial summaries from whichever sessions had won the last race. The OAuth-flow framing kept surviving because every clobbering session happened to also be working on auth in some form. The redirect-chain framing kept getting deleted because the sessions that had it were the ones that lost the races.

A fake heckler in the back is shouting that obviously the handoff was the bug, I should have read it skeptically the first time. You're right, oddly-well-informed heckler, and I should have. The reason I didn't is that handoffs read like authoritative summaries. They are written by past versions of me. The voice is mine. The structure is the one I taught the model to use. A partial-but-plausible summary in my own voice triggers exactly the same response as a complete one: nod, absorb, proceed. A blank handoff would have triggered the right response, which is "ask the user what was tried." A wrong-but-plausible handoff didn't trigger any response, because there was no signal it was wrong.

This was mistake number 3.

I had treated the handoff like a source of record. It wasn't. It was a single point of failure with no integrity check. Memory you can audit is memory you can verify, surgically, when something feels off. Memory you absorb is memory you act on. The handoff system was the second kind, and the second kind has to be right or it has to be loud about being wrong. Mine was neither.

The handoff didn't lie. It just stopped including the part that mattered, and there was no signal that the part it stopped including was the part that mattered.

06

Two Sides Of The Same Fix

The fix had to have two faces because the bug had two faces.

The Workbench side had three pieces. PR #6 (commit 0153f47) flipped the default API base from www to apex, sidestepping the redirect entirely. PR #7 (commit 83f9b87) mirrored supabase-js session changes into keytar via onAuthStateChange, so both stores stay in sync and refreshes propagate. A separate Supabase Google client-secret rotation cleared an intermittent OAuth provider failure that had been masquerading as a separate bug for weeks. None of these would have shipped in time without the asymmetric probe.

The handoff side also had three pieces. A filename grammar that makes collisions impossible: handoff-<branch>-<topic>.md, with the topic suffix mandatory on shared branches. An ownership marker in every file: the first line is exactly , attesting which session owns the content. And a PreToolUse hook (handoff-write-guard.mjs) that reads the writer's session_id from stdin, compares against the on-disk marker, and blocks the write on mismatch. Grammar makes collisions impossible. Marker makes overwrites detectable. Hook turns detection into a block.

Without the Workbench side, the handoff side wouldn't have helped. Each session would have a clean handoff and still a broken Workbench.

Without the handoff side, the Workbench side wouldn't have helped either. The next time a similar multi-session investigation kicked off, the same clobber pattern would have erased whatever lessons the Workbench debugging produced.

The bug had two faces, and the fix had to have two faces too.

Twelve hours of telemetry after the handoff guard deployed: six distinct sessions blocked, exactly once each. The original plan predicted "one block per fresh handoff, no escalation." Both bounds held. Six different session_ids, six entries in the block log, no session_id appearing twice. The block was the chicken-and-egg first-write cost (every fresh handoff has to learn its session_id from the block message and retry once); no session was repeatedly fighting the marker. The structural prevention held against real load.

07

Structural Prevention Beats Vigilance

The bigger lesson is not about handoffs. The bigger lesson is about the difference between memory you can audit and memory you absorb.

Vigilance is the wrong layer for high-frequency low-salience hazards. Both humans and models fail at "remember to do X every single time forever," and both succeed at "respond to a clear signal when one fires." Structural prevention either makes the hazard impossible (filename grammar enforces uniqueness, so two sessions cannot pick the same filename) or makes failure loud (the hook block surfaces the collision in the writer's own console, in the writer's own session, before the bad write happens). Either is dramatically more reliable than asking the operator to be perpetually careful about a hazard they cannot see.

The number 32 is what happens when prevention gets delegated. None of those 32 clobbers were a discipline failure. Every one happened because the writing session genuinely could not see the collision. Same filename, no marker, no hook, equals silent data loss every time, no matter how careful any one session was. The cost of one hook plus one rule is far below the cumulative cost of the next 32 incidents. It is also far below the cost of one incident that produces four days of debugging the wrong layer.

The contract between two parties is the unit of failure. Cursor-delete was a contract between an indexer and a supervisor; both parties read the same file and disagreed on what its absence meant. Hash-skip was a contract between a hash function and the page it was hashing; the page contained telemetry the hash function couldn't see. Handoff-clobbering is a contract between two parallel sessions; both wanted to write to the same file, and neither knew the other existed. Three different mechanics. Same pattern. The bug is not in either party. It is in the unwritten thing between them.

This is the third post in a small series about substrate failures of LLM-driven multi-session development. The first was about a hash that drifted because the page it was hashing turned out to contain telemetry. The second was about two pieces of code that agreed on a file's name and disagreed on its meaning. This one is about a memory layer that returned partial-but-plausible content and was trusted because it looked authoritative. There is at least one more in the queue: a memory index that silently failed to register about a quarter of the files it was supposed to track. They are all about the same thing, which is that the bug is rarely in the layer; it is almost always in the contract between two layers, or between two sessions, or between a session and itself, and the contract is almost always unwritten.

Structural prevention beats vigilance, and 32 is what vigilance gets you.