nuonuo/doc/p2_auto_paraphrase.md

# P2: Auto Paraphrase Generation

## 核心数据

| 策略 | bg=0 | bg=500 | bg=2000 | 实现难度 |
|------|------|--------|---------|---------|
| None | 95% | 65% | 55% | - |
| Heuristic (synonym swap) | 95% | 85% | **75%** | 零成本 |
| Oracle (hard cases only) | 100% | 95% | **95%** | 需 LLM |
| Oracle (全覆盖) | 100% | 100% | **100%** | 需 LLM |

## 发现

1. **Heuristic 已经很有价值**：55% → 75%（+20pp），不需要 LLM
2. **Oracle 全覆盖 = 100%**：证明问题完全可通过 paraphrase 解决
3. **大部分 failure 可被 paraphrase 修复**：9 个 failure 中 8 个有 oracle fix

## Failure 分类

| 类型 | 例子 | 原因 | 修复方式 |
|------|------|------|---------|
| 词汇鸿沟 | "Ship the release" ↔ "deploy" (cos=0.46) | 完全不同的词 | LLM paraphrase ✓ |
| 概念映射 | "Need observability" ↔ "monitoring" (cos=0.26) | 抽象→具体 | LLM paraphrase ✓ |
| 领域知识 | "Fix login issue" ↔ "auth bug" (cos=0.65) | 需要知道 login=auth | LLM paraphrase ✓ |
| 竞争 | "DB terrible" ↔ "DB slow" (cos=0.72) 但被 bg 抢走 | cos 够高但 bg 更近 | 增加 augmentation 密度 |

## 实际部署策略

### 存储时（异步，不影响延迟）
```
1. 用户说了一句话
2. 提取 (cue, target)
3. 同步存原始 cue
4. 异步：LLM 生成 3-5 个 paraphrase → 追加存入
```

### Heuristic fallback（LLM 不可用时）
当前 heuristic 规则已验证有效（+20pp），可以作为 baseline：
- 去除常见前缀 ("Can you", "I need to", "How do I")
- 同义词替换 (deploy↔release, database↔DB, fix↔resolve)
- 添加 "issue with X" 模式

### LLM Prompt（待 Gateway 恢复后验证）
```
Generate 3-5 different ways a user might say this:
"The database is slow again"

Requirements:
- Same core meaning, different wording
- Include informal/colloquial versions
- Include technical jargon alternatives
```