ENZH

My Server Died Again (This Time It Was Cursor)

Not This Again

"server seems dead again wtf is going on??"

SSH connection timed out during banner exchange. Same symptom as Part 3 — kernel alive, userspace frozen.

In Part 3, the server was a 16GB c3-standard-4 that died from a snap Chrome crash loop. We applied 12 hardening measures, swapped to an e2-highmem-4 with 32GB RAM, and installed a memory watchdog. That was supposed to be the end of it.

It wasn't.

Diagnosis

First obstacle: GCP auth had expired.

gcloud auth login

Minor detour. Then check the VM:

gcloud compute instances list --filter="name~claude-devbox"

Status: RUNNING. Same as last time — the kernel was alive, but nobody was home.

Serial Console

gcloud compute instances get-serial-port-output claude-devbox --zone=us-west1-b

Two things jumped out.

First: the GCP Ops Agent (otelopscol) was spam-looping PermissionDenied errors — dropping 1,800+ metrics per batch, generating massive error logs on every cycle. The monitoring agent we installed in Part 3 to prevent crashes was now compounding the crash.

Second: the last line before silence was systemd-resolved: Under memory pressure, flushing caches. Then nothing.

Hard Reset and Watchdog Logs

gcloud compute instances reset claude-devbox --zone=us-west1-b

SSH back in. The watchdog had been logging the whole time:

01:39 UTC (8 hours into the session): 87% memory. Claude Code at 1GB. Nothing alarming yet.

09:12 UTC (15 hours in): 89% memory. Five node processes at 2-2.7GB each. Combined: ~12.8GB.

09:13 UTC: 94% memory, 88% swap. Watchdog killed Chrome and one node process — but it was too late. The system was already in a death spiral.

The Full Memory Picture

ProcessRSS
5x node (Cursor)~12.8 GB
Chrome~0.9 GB
openclaw-gateway (Docker)~0.6 GB
PM2 services~0.8 GB
LiteLLM (Docker)~0.3 GB
GCP Ops Agent~0.5 GB
Docker + kernel overhead~0.5 GB
Total~16.4 GB RAM + 3.5 GB swap

~20GB consumed on a 32GB machine. The five node processes alone ate nearly half the RAM.

The Twist

My first assumption: Claude Code was eating memory. I'd seen it grow to 1-2GB per session before. Five sessions at 2.7GB each — plausible.

But the watchdog logs said node, not claude.

Claude Code shows up as claude in the process comm field. If those had been Claude sessions, the watchdog would have logged claude. It logged node.

Those five processes were Cursor.

Why Cursor Leaks

Cursor's SSH remote mode spawns multiple node workers:

  • server-main.js — caches open files, undo history, search indexes. Never releases.
  • extensionHost — runs ALL extensions in one process. V8's garbage collector is lazy — it only does major GC under memory pressure. Memory grows monotonically.
  • fileWatcher — maintains an in-memory file tree. Grows with repo size and filesystem events.
  • ptyHost — terminal scroll buffers. Never truncated.
  • tsserver — holds the entire project AST in memory. Re-parses but never shrinks.

No --max-old-space-size set by default. V8's default heap limit is ~4GB per process.

Over 15 hours, each process grew from ~400MB to 2-2.7GB. That's 6-7x growth. Five processes doing this simultaneously pushed total consumption past 12GB — on top of everything else already running.

Fixes

1. Fix the Ops Agent

The irony: the monitoring agent installed to prevent crashes was spamming errors because it didn't have the right IAM permissions. Each failed metric push generated error logs, which consumed disk I/O and memory.

Added monitoring.metricWriter and logging.logWriter roles to the VM's service account. The spam stopped.

2. Watchdog v2

The old thresholds (warn at 85%, kill Chrome at 92%) were too generous. By the time the watchdog acted, the system was already in swap thrashing.

New thresholds:

  • 75%: warn and log
  • 80%: kill Chrome (sacrificial lamb)
  • 85%: kill any node process using more than 1GB RSS

The old watchdog let 12GB of node processes accumulate before reacting. The new one would have caught them at 85% — before swap was saturated.

3. Cursor Memory Guard

A cron job running every 5 minutes:

# Kill any .cursor-server node process exceeding 1GB RSS
pgrep -f '.cursor-server' | while read pid; do
  rss=\$(awk '/VmRSS/{print \$2}' /proc/\$pid/status 2>/dev/null)
  if [ "\${rss:-0}" -gt 1048576 ]; then
    kill -TERM "\$pid"
    logger "cursor-guard: killed pid \$pid (RSS: \${rss}kB)"
  fi
done

SIGTERM is gentle — Cursor's client detects the dropped connection and auto-reconnects in seconds. The user experience is a brief "Reconnecting..." banner, then everything is back. No work lost.

This is the key insight: Cursor's remote architecture is designed for reconnection. Killing a bloated server process isn't destructive — it's garbage collection that V8 refuses to do on its own.

Why This Keeps Happening

The pattern from Part 3 repeats: long-running processes on a dev box with no memory ceiling.

Part 3 was snap Chrome fork-bombing on 16GB. This time it was Cursor node processes silently growing on 32GB. Different process, same failure mode — unbounded memory consumption over time.

The 32GB upgrade bought headroom but didn't change the fundamental dynamic. Without per-process memory limits, any long-running process will eventually fill whatever RAM you give it. V8's GC strategy guarantees this: it won't do a major collection until the heap is under pressure, and "under pressure" means "close to the limit." With a 4GB default limit and lazy GC, every node process is a slow memory leak by design.

Takeaway

Three lessons from round three of server death:

Your monitoring can become the problem. The Ops Agent we installed to catch memory issues was itself consuming resources and generating error spam. Monitoring tools need correct permissions and resource limits just like everything else.

Process names matter for diagnosis. The difference between node and claude in a watchdog log was the difference between blaming the right tool and the wrong one. If I'd assumed "node = Claude Code" without checking, I'd have been chasing the wrong fix.

Reconnectable architectures are your friend. Cursor's remote server is stateless enough that killing it is harmless — the client reconnects in seconds. This makes aggressive memory guards viable. Not every architecture has this property, but when it does, use it.

The dev box has been stable since these fixes. But I said that after Part 3 too.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0