A Cat Entertainer

A Cat Entertainer, Just A Tech Blog

  • Blog
  • AI Digest
  • Tokens
  • Media
  • About
  • AI (190)
  • Agents (110)
  • Foundry (2)
  • Claude Code (31)
  • Easter Eggs (2)
  • Fun (2)
  • Software Development (24)
  • Frontend (4)
  • Agent Runtimes (3)
  • 自动化 (2)
  • macOS (4)
  • Automation (2)
  • Compute (4)
  • Philosophy (2)
  • 微信 (1)
  • WeChat (1)
  • Chips (2)
  • Hardware (2)
  • GPU (2)
  • TPU (2)
  • Mio (66)
  • Lumi (8)
  • Writing (8)
  • NLP (6)
  • Product (20)
  • Engineering (18)
  • Chinese (2)
  • Design (8)
  • Voice (8)
  • DevOps (8)
  • Linux (8)
  • VPN (4)
  • 翻墙 (2)
  • 科学上网 (2)
  • Self-hosting (2)
  • Career (4)
  • Economics (4)
  • China (3)
  • Psychology (3)
  • Labor (2)
  • Memory (4)
  • Claude (4)
  • LLM (11)
  • GPT (2)
  • Business (2)
  • OpenClaw (8)
  • TTS (2)
  • GCP (4)
  • Workflow (2)
  • Ops (2)
  • Cost Optimization (2)
  • Agent Teams (4)
  • PanPanMao (4)
  • Best Practices (1)

Benchmark 分数高又怎样

Mar 5, 2026

GPT 5.4 在各项 benchmark 上全面领先。但当我把同一个复杂的产品战略问题扔给两个模型时,benchmark 分数和真实输出质量之间的鸿沟令人震惊。

AILLMClaudeGPT

GPT 5.4 vs Opus 4.6: Why Benchmarks Stopped Mattering

Mar 5, 2026

GPT 5.4 dominates every benchmark. But when I gave both models the same complex product strategy prompt, the gap between benchmark scores and real-world output was staggering. Here's what actually happened.

AILLMClaudeGPT

© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0

RSSChangelog