Output Receipts
Pick a prompt to see how tools respond side-by-side - with token cost, duration, and a judge score. Every output is auditable, labelled by method, and cached for fair comparison.
analyze
blog
- 1000-word blog intro on AI coding assistants for small teamsExpected: ~1000 words
Balanced comparison, includes a real table, identifies trade-offs, avoids vendor-speak.
- 800-word blog intro on sustainable fashionExpected: ~800 words
Strong hook, scannable structure, specific stats, clear article preview, no filler.
code
- Python CSV dedupe with fuzzy matchingExpected: 80-150 lines
Streams input, correct rapidfuzz usage, outputs merge report, reasonable complexity.
- React TypeScript todo list componentExpected: 120-250 lines
Compiles, types are tight, handles edge cases, accessible, idiomatic.
- SQL retention cohort query (Postgres)Expected: 30-80 lines
Correct cohort logic, efficient (uses generate_series), readable CTE layout.
marketing
social
summarize
- Summarize a 10K earnings call transcriptExpected: 200-350 words
Faithful, no hallucinated numbers, correct structure, scannable.
- Summarize a SaaS MSA contract into plain-English bulletsExpected: 300-500 words
Accurate, readable by non-lawyer, flags real negotiation levers.
- Summarize an NLP research paper abstractExpected: 150-250 words
Plausible thesis, method/results/limitations are distinct, not vague marketing speak.
translate
- Translate a 500-word EN technical doc to Simplified ChineseExpected: ~500 Chinese characters equivalent
Fluent zh-CN, technical accuracy, consistent terminology, code preserved.
- Translate a Chinese marketing tagline set to EnglishExpected: 200-350 words
Natural, idiomatic, preserves marketing energy, variants differ meaningfully.