Open Brain: Email Capture — Adding Gmail to Your Persistent Memory

Part 1 — The Story

Why Email?

Nate's guide cracked the daily capture habit. You type a thought into Slack, it gets embedded and classified in five seconds, and any AI you use can search it by meaning. That's genuinely transformative — but it captures what you decide to capture.

Your email is different. It's thinking you already did. Every email you write is a decision, a position, a relationship update, or a problem you solved. You wrote it, you sent it, and it immediately became invisible to your AI tools.

When we looked at 30 days of sent mail — project updates, client advice, personal advocacy, community work — it was 153 distinct thoughts that an AI couldn't access until now. Long emails explaining complex situations. Narrative summaries of ongoing projects. Strategic advice to colleagues. All of it, in your voice, about the things that actually matter to you.

What's different about this extension

Pull-based, not push-based. Nate's setup is push-based: you decide what to capture and push it in. This adds a parallel track that reaches out to where your thinking already lives and brings it in on your schedule.

RAG chunking for long documents. Embedding a 1,500-word email as a single vector produces a blurry average of a dozen topics. The fix: split it into 300-word segments, each embedded separately. When you search, you get the relevant section, not a low-confidence match on the whole thing.

More infrastructure. Google Cloud OAuth, schema migrations, updated Edge Functions. Budget an hour, not ten minutes. Every failure mode is documented below.

How We Built It

This pipeline was built in a single session with Dr. Brian — an AI agent running in Cursor. We're sharing the story because the hard parts weren't where we expected them.

What we thought would be hard: Gmail OAuth. Getting permission to read someone's email requires Google Cloud Console setup, consent screens, scopes, token refresh. It's genuinely involved.

What was actually hard:

Failure 01

Gmail's line-wrapping breaks quote detection

When you reply to an email, Gmail wraps "On Mon, Mar 2 at 8:56 AM Someone wrote:" across 2–3 lines in plain text. Our first stripper only matched it on one line. One reply showed 703 words when the actual reply was "hello."

Fix: look ahead across multiple lines to detect the split pattern.

Failure 02

Supabase's PostgREST cache doesn't update instantly

We added parent_id and chunk_index columns. The SQL ran fine. The REST API that Edge Functions use to talk to Postgres didn't see the new columns — at all — for hours. Tried four ways to reload the schema cache. None worked reliably.

Fix: create an RPC function that writes via PL/pgSQL, bypassing the REST layer entirely.

Failure 03

A travel booking confirmation produced 23 chunks of CSS

The Gmail API returns HTML. Our HTML-to-text conversion preserved too much structure. One booking confirmation was 8,874 words of boilerplate, chunked into 23 meaningless fragments.

Fix: detect CSS density (more than 10 {...} blocks) and skip, plus filter sender/subject patterns for transactional email.

Failure 04

The Gmail label API is AND, not OR

Passing SENT and STARRED together returns messages that have both labels. We needed either. This is the opposite of what most people expect.

Fix: query each label independently and deduplicate by message ID before processing.

Failure 05

A 1,900-word email wouldn't chunk

Our chunking logic uses paragraph breaks as split points. One email had no double-newline breaks — just a wall of text. Paragraph-first splitting produced a single oversized chunk and stopped.

Fix: detect when paragraph splitting produces only one segment, fall back to sentence-boundary splitting.

The number that matters

30 days of email, SENT + STARRED labels. 170 messages fetched. 47 filtered as noise. 123 processed. 153 thoughts ingested, including 13 emails chunked into smaller segments with parent-child linking. Total API cost: $0.02. Total time: 8 minutes.

Part 2 — The Guide

Step-by-Step Setup

Before you start

You need Open Brain already running from Nate's guide. You also need Deno installed (brew install deno on Mac) and a Google account with Gmail. Budget about an hour.

Credential Tracker

You'll generate one new set of credentials. Add these to your tracker alongside the ones from Nate's guide:

GMAIL (Step 2)
  Google Cloud Project:   ____________
  OAuth Client ID:        ____________
  OAuth Client Secret:    ____________
  credentials.json saved: yes / no

Get the code

Clone the repo or pull the latest if you already have it.

git clone https://github.com/MonkeyRun-com/monkeyrun-open-brain.git
cd monkeyrun-open-brain
# or if you already have it:
git pull

Create Google Cloud OAuth credentials
Go to console.cloud.google.com → New Project → "Open Brain". Enable the Gmail API. Configure the OAuth consent screen (External, add your Gmail as a test user, add the gmail.readonly scope). Create an OAuth Client ID (Desktop app type). Download the JSON and save it as scripts/credentials.json.

Run the database migration

In the Supabase dashboard → SQL Editor, paste and run:

-- Add chunking support
ALTER TABLE thoughts
  ADD COLUMN IF NOT EXISTS parent_id uuid REFERENCES thoughts(id) ON DELETE CASCADE,
  ADD COLUMN IF NOT EXISTS chunk_index integer;

CREATE INDEX IF NOT EXISTS thoughts_parent_id ON thoughts (parent_id)
  WHERE parent_id IS NOT NULL;

-- RPC to bypass PostgREST schema cache
CREATE OR REPLACE FUNCTION insert_thought(
  p_content text, p_embedding vector(1536),
  p_metadata jsonb, p_parent_id uuid DEFAULT NULL,
  p_chunk_index integer DEFAULT NULL
) RETURNS uuid LANGUAGE plpgsql AS $$
DECLARE new_id uuid;
BEGIN
  INSERT INTO thoughts (content, embedding, metadata, parent_id, chunk_index)
  VALUES (p_content, p_embedding, p_metadata, p_parent_id, p_chunk_index)
  RETURNING id INTO new_id;
  RETURN new_id;
END; $$;

Deploy the updated Edge Functions

supabase functions deploy ingest-thought --no-verify-jwt
supabase functions deploy open-brain-mcp --no-verify-jwt

Set your environment variables

export INGEST_URL="https://YOUR_PROJECT_REF.supabase.co/functions/v1/ingest-thought"
export INGEST_KEY="your-ingest-key"

Add to ~/.zshrc to make them permanent.

First dry run
```
deno run --allow-net --allow-read --allow-write --allow-env \
  scripts/pull-gmail.ts --dry-run --window=24h
```
Authorize via the URL it prints. Check the output — are long emails showing [N chunks]? Is the content preview clean? When it looks right, drop --dry-run to go live.

Scale up

deno run --allow-net --allow-read --allow-write --allow-env \
  scripts/pull-gmail.ts --window=30d --labels=SENT,STARRED

Re-running is safe — the sync log tracks ingested IDs and skips duplicates.

Label Strategy

Label	What it contains	Recommendation
`SENT`	Everything you sent	Always include
`STARRED`	Emails you explicitly starred	Good addition
`IMPORTANT`	Gmail's auto-importance	More noise
`INBOX`	Everything in inbox	Skip
Custom labels	Your own organization	Gold — add any

Pro tip: Gmail labels become searchable metadata. The script stores all labels on each ingested thought, so you can later ask your AI "show me everything I tagged as Project X."

Part 3 — What Changed

For Existing Open Brain Users

Database Changes

Column	Type	Purpose
`parent_id`	uuid (nullable FK)	Links chunks to their parent document
`chunk_index`	integer (nullable)	Orders chunks within a parent (0-based)

These are nullable — your existing thoughts are unaffected. The migration uses IF NOT EXISTS, safe to run on a live database.

Updated Edge Functions

ingest-thought now accepts parent_id, chunk_index, and extra_metadata. Has 3-attempt retry logic for OpenRouter failures. Caps embedding input at 8,000 characters. Crashes on startup if INGEST_KEY is unset (rather than silently accepting all requests).

open-brain-mcp search now fetches 3× the requested limit and deduplicates chunks from the same parent — if 3 chunks of a long email match your query, you get one result with a note. New tool: email_sync_status.

New Metadata Fields

{
  "type": "observation",
  "topics": ["strategy", "product"],
  "people": ["Alice", "Bob"],
  "sentiment": "positive",
  "source": "gmail",
  "gmail_labels": ["SENT", "IMPORTANT"],
  "gmail_id": "18e4f2a...",
  "gmail_thread_id": "18e4f1..."
}

What's Next

Extension	What it adds	Complexity
Google Calendar	Meetings, prep context, recurring commitments	Low — same OAuth
Meeting transcripts	Fathom, Otter, or Fireflies via webhook	Low — webhook to ingest-thought
URL / article ingestion	Drop a link, get the full article chunked	Medium
Slack / Discord history	Pull existing threads that predate your setup (push capture is already Nate's)	Medium

Security

Prompt Injection Risk

Read this before ingesting INBOX or STARRED

When you ingest email content, you're pulling in text written by other people. A crafted email could contain text designed to manipulate your AI when it later retrieves and reasons about that content — "ignore previous instructions," etc. This is prompt injection.

The partial protection: The ingest-thought function sends content to OpenRouter to extract structured JSON metadata — not to reason freely. The structure acts as a summarization barrier. The embedding step is purely mathematical. Neither step is a high-risk surface.

Where the real risk lives: When you ask your AI to "summarize everything I emailed about Project X" — that's when retrieved email content enters the AI's context window alongside your instructions.

Practical mitigations:

Stick to SENT as your primary label — you wrote those emails
Be more cautious with STARRED or INBOX (content from untrusted senders)
The gmail.readonly OAuth scope means the script can never send email on your behalf
If an AI client behaves strangely after a memory search, check what was retrieved

Part 4 — Reference

Running This Automatically

Right now the script is manual. Here's where things stand and where they're going.

Option 1: cron (works today)

# Add to crontab: crontab -e
# Runs every Monday at 8am
0 8 * * 1 cd /path/to/monkeyrun-open-brain && \
  INGEST_URL="..." INGEST_KEY="..." \
  deno run --allow-net --allow-read --allow-write --allow-env \
  scripts/pull-gmail.ts --window=7d --labels=SENT,STARRED

Option 2: OpenClaw (the right long-term home)

OpenClaw is built for exactly this — scheduled tasks, local script execution, Telegram notifications when done. If you're already running it, give it this prompt:

"Add a weekly cron job that runs every Monday at 8am. It should cd into my Open Brain project directory and run deno run --allow-net --allow-read --allow-write --allow-env scripts/pull-gmail.ts --window=7d --labels=SENT,STARRED. When it finishes, send me a Telegram message with the summary line from the output."

OpenClaw builds the cron job, hooks it into its scheduler, and you get a weekly brain sync with a Telegram confirmation. The Gmail OAuth token lives in the project directory — no extra credential setup needed.

Option 3: MCP trigger (roadmap)

The cleanest eventual solution: a pull_emails MCP tool that lets you trigger a sync from any AI client just by asking for one. Deferred because it requires moving the OAuth token server-side or building a webhook architecture. On the roadmap as the system matures.

Reference

Script Options & Troubleshooting

Script Options

Flag	Default	Description
`--window=`	24h	Time window: 24h, 7d, 30d, 1y, all
`--labels=`	SENT	Comma-separated Gmail labels (OR logic)
`--dry-run`	off	Preview without ingesting
`--limit=`	50	Max emails per run
`--list-labels`	off	Print all Gmail labels and exit

Troubleshooting

"Unauthorized" error

Your INGEST_KEY doesn't match the one in Supabase. Check with supabase secrets list and re-export the correct value.

"Module not found"

Run the script from your project directory: cd /path/to/monkeyrun-open-brain first.

"Token refresh failed"

Delete scripts/token.json and re-run to re-authorize.

"Found 0 messages"

Labels are case-sensitive. Run --list-labels to see exactly what your account has.