Backend: Fix three race conditions in SetSnapshot that caused account
scheduling anomalies and broken sticky sessions:
- Use Lua CAS script for atomic version activation, preventing version
rollback when concurrent goroutines write snapshots simultaneously
- Add UnlockBucket to release rebuild lock immediately after completion
instead of waiting 30s TTL expiry
- Replace immediate DEL of old snapshots with 60s EXPIRE grace period,
preventing readers from hitting empty ZRANGE during version switches
Frontend: Remove serial queue throttle (1-2s delay per request) from
usage loading since backend now uses passive sampling. All usage
requests execute immediately in parallel.
The errcheck linter flagged an unchecked type assertion on
item["type"].(string). Use the two-value form with require.True
to satisfy the linter and fail clearly on unexpected types.
- Security: force token_uri to Google default, preventing SSRF via crafted service account JSON
- Dedup: extract shared getVertexServiceAccountAccessToken() to eliminate ~35 lines of duplication between ClaudeTokenProvider and GeminiTokenProvider
- Fix: apply model mapping + Vertex model ID normalization in forward_as_responses and forward_as_chat_completions paths
- Fix: exclude service_account from AI Studio endpoint selection (Vertex cannot serve generativelanguage.googleapis.com)
- Feature: add model restriction/mapping UI for service_account in EditAccountModal
- Dedup: extract VERTEX_LOCATION_OPTIONS to shared constants
- i18n: replace all hardcoded Chinese strings in Vertex UI with translation keys
(*net.OpError).Error() concatenates Source/Addr fields, so the previous
disconnectMsg surfaced internal source IP/port and upstream server address
to clients via SSE error frames and UpstreamFailoverError.ResponseBody
(reported by @Wei-Shaw on PR #2066).
- Add sanitizeStreamError that maps known errors (io.ErrUnexpectedEOF,
context.Canceled, syscall.ECONNRESET/EPIPE/ETIMEDOUT/...) to fixed
descriptions and falls back to a generic placeholder, with an explicit
*net.OpError branch that drops Source/Addr fields entirely.
- Use sanitized message in client-facing disconnectMsg; full ev.err is
still preserved in the existing operator log line for diagnosis.
- Tests cover net.OpError redaction, the failover ResponseBody path, and
every known sanitized error mapping.
Background / 背景
The ops cleanup task currently rejects retention days < 1 in both validate
and normalize, so operators who want minimal-history setups (e.g. high
churn deployments that prefer near-realtime cleanup) cannot express that
intent through the UI. The only options are 1+ days, which keeps at least
24h of history regardless of cron frequency.
ops 清理任务目前在 validate 和 normalize 两处都拒绝小于 1 的保留天数,
让希望尽量不留历史的运维场景(高吞吐部署 + 想用近实时清理)无法通过 UI
表达。最低只能配 1,等于不管 cron 多频繁,至少都会保留 24 小时的历史。
Purpose / 目的
Let admins set retention days to 0, meaning "every scheduled cleanup
run wipes the corresponding table(s) entirely". Combined with a more
frequent cron (e.g. `0 * * * *`) this yields effectively rolling cleanup.
允许管理员把保留天数设为 0,语义为"每次定时清理时把对应表全部清空"。
搭配更频繁的 cron(比如每小时整点)即可获得近似滚动清理的效果。
Changes / 改动内容
Backend
- service/ops_settings.go: validate accepts [0, 365]; normalize only
refills default 30 when value is < 0 (negative is treated as legacy
bad data, 0 is honoured)
- service/ops_cleanup_service.go: introduce `opsCleanupPlan(now, days)`
returning `(cutoff, truncate, ok)`. days==0 returns truncate=true and
short-circuits to a new `truncateOpsTable` helper that uses
`TRUNCATE TABLE` (O(1), no WAL, no VACUUM pressure). days>0 keeps
the existing batched DELETE path unchanged. Empty tables skip
TRUNCATE to avoid the ACCESS EXCLUSIVE lock entirely
- Extract `isMissingRelationError` helper to dedupe the "table not
yet created" tolerance shared by both delete and truncate paths
- Add unit tests for `opsCleanupPlan` (three branches) and
`isMissingRelationError`
后端
- service/ops_settings.go: validate 接受 [0, 365];normalize 仅在 < 0
时回填默认 30(负数视为脏数据,0 被尊重)
- service/ops_cleanup_service.go: 抽 `opsCleanupPlan(now, days)` 返回
`(cutoff, truncate, ok)`。days==0 → truncate=true,走新增
`truncateOpsTable`(TRUNCATE TABLE,O(1),无 WAL、无 VACUUM 压力);
days>0 仍走原批量 DELETE 路径,行为完全不变。空表跳过 TRUNCATE,
避免无意义的 ACCESS EXCLUSIVE 锁
- 抽 `isMissingRelationError` helper 复用 delete / truncate 两处的
"表不存在"宽容判断
- 补 `opsCleanupPlan` 三分支 + `isMissingRelationError` 单元测试
Frontend
- OpsSettingsDialog.vue: validation accepts [0, 365]; input min=0
- i18n (zh/en): hint mentions "0 = wipe all on every cleanup",
validation message updated to 0-365 range
前端
- OpsSettingsDialog.vue: 校验放宽到 [0, 365],input min 改 0
- i18n(zh/en):hint 补"0 = 每次清理时清空所有",错误提示改 0-365
Trade-offs / 取舍
- TRUNCATE requires ACCESS EXCLUSIVE lock briefly, but ops tables only
have the cleanup task as a writer, so the lock is invisible to other
workloads
- Empty-table guard avoids the lock when there is nothing to clean
- Negative values are still treated as legacy bad data and replaced
with default 30 to preserve compatibility
Closes#1957
The OAuth path forwards client requests to chatgpt.com/backend-api/codex/responses,
where applyCodexOAuthTransform forces store=false (chatgpt.com's codex backend
rejects store=true). Reasoning items emitted under store=false are NEVER
persisted upstream, so any rs_* reference that a client carries forward in a
subsequent input[] array triggers a guaranteed upstream 404:
Item with id 'rs_...' not found. Items are not persisted when `store` is
set to false. Try again with `store` set to true, or remove this item
from your input.
sub2api wraps this as 502 "Upstream request failed" and the conversation
breaks on every multi-turn /v1/responses request that uses reasoning + tools
(reproducible with gpt-5.5; gpt-5.4 happens to dodge it because the upstream
does not emit reasoning items for that model).
Affected clients include any that follow the OpenAI Responses API spec and
replay prior assistant items verbatim — in practice this hit OpenClaw and
similar agent harnesses on every turn ≥2 with tool use.
The fix: in filterCodexInput, drop input items with type == "reasoning"
entirely. The model never reads reasoning summary text from input (only
encrypted_content can carry reasoning context across turns, and chatgpt.com
under store=false does not emit it), so this is a no-op for the model itself
and a clean removal of unreachable upstream lookups.
Scope is intentionally narrow:
* Only OAuth account requests (account.Type == AccountTypeOAuth) reach
applyCodexOAuthTransform / filterCodexInput.
* API-key accounts going to api.openai.com/v1/responses are unaffected
(store=true works there, rs_* persists, multi-turn already works).
* Anthropic / Gemini platform groups go through different transforms and
are unaffected.
* /v1/chat/completions is unaffected (no reasoning items).
* item_reference items (different type) are unaffected — only type ==
"reasoning" is dropped.
Verification:
* Existing tests pass: go test ./internal/service/ -run Codex|Tool|OAuth
* New regression test asserts reasoning items are dropped under both
preserveReferences=true and preserveReferences=false.
* End-to-end repro on gpt-5.5 multi-turn + tools: pre-patch 502, post-patch
200. Repro on gpt-5.4 unchanged. Three-turn deep loop on gpt-5.5 passes.
Two follow-ups to PR #2066's failover-wrap fix:
1. Failover ResponseBody (`UpstreamFailoverError.ResponseBody`) was encoded
as `{"error": "<msg>"}` (string field). `ExtractUpstreamErrorMessage`
probes for `error.message`, `detail`, or top-level `message` only — so
`handleFailoverExhausted` and downstream passthrough rules saw an empty
message, losing the EOF root cause in ops logs. Re-encode as the
Anthropic standard shape `{"type":"error","error":{"type":"upstream_disconnected","message":"..."}}`.
(Addresses the inline review comment from copilot-pull-request-reviewer
on Wei-Shaw/sub2api#2066.)
2. The streaming `event: error` SSE frame for `response_too_large`,
`stream_read_error`, and `stream_timeout` was non-standard
(`{"error":"<reason>"}`). Anthropic SDKs (and Claude Code) expect
`{"type":"error","error":{"type":"...","message":"..."}}` and parse
`error.type`/`error.message` accordingly. Refactor `sendErrorEvent` to
take both reason and message, and emit the standard frame so client
SDKs surface a real diagnostic message instead of a generic stream error.
This does not by itself prevent task interruption on long-stream EOF
(SSE has no resume; client-side retry remains the only complete fix), but
it gives both server-side ops logs and client-side error UIs a meaningful
upstream message so users know the next step is to retry.
Tests updated to assert the new body shape on both branches plus a new
assertion that `ExtractUpstreamErrorMessage` returns a non-empty string.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anthropic streaming path (gateway_service.go) returned a plain error on
upstream SSE read failure, so the handler-level UpstreamFailoverError check
never fired and the client received a bare `stream_read_error` event,
breaking long-running tasks even when no bytes had been written yet.
The most common trigger is HTTP/2 GOAWAY from api.anthropic.com edge
backends doing graceful rotation: Go's http.Transport surfaces this as
`unexpected EOF` and never auto-retries.
Mirror what the OpenAI and antigravity gateways already do: when the read
error happens before any byte has reached the client (`!c.Writer.Written()`),
return `*UpstreamFailoverError{StatusCode: 502, RetryableOnSameAccount: true}`
so the handler can retry on the same or another account. After client
output has begun, SSE has no resume protocol — keep the existing passthrough
behavior.
Tests cover both branches via streamReadCloser-based fixtures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1914 unconditionally applied the full mimicry pipeline to all OAuth
accounts, including real Claude Code CLI clients. This replaced the
client's long system prompt (~10K+ tokens with stable cache_control
breakpoints) with a short ~45 token [billing, CC prompt] pair, which
falls below Anthropic's 1024-token minimum cacheable prefix threshold.
The result: every request created a new cache but never hit an existing
one.
Fix: restore the Claude Code client detection gate so that real CC
clients bypass body-level mimicry (system rewrite, message cache
management, tool name obfuscation). Non-CC third-party clients
(opencode, etc.) continue to receive full mimicry.
Also harden the detection logic:
- Make UA regex case-insensitive (align with claude_code_validator.go)
- Validate metadata.user_id format via ParseMetadataUserID() instead of
just checking non-empty, preventing third-party tools from spoofing
a claude-cli/* UA with an arbitrary user_id string to bypass mimicry
- gofmt: realign AffiliateDetail struct tags in affiliate_service.go
- ineffassign: remove dead seenCompleted assignment before return in account_test_service.go
The hardcoded codex CLI version (0.104.0) causes upstream rejection
when using gpt-5.5 with compact, as the server treats the request
as an outdated client and returns 400/502.
Update codexCLIVersion, codexCLIUserAgent, and openAICodexProbeVersion
to 0.125.0 to match the current Codex CLI release.
Fixes#1933, #1887, #1865
Related: #1609, #1298, #849
- staticcheck QF1001: apply De Morgan's law to the OAuth-mimic header
passthrough guard (`!(a && b)` → `a != ... || !b`).
- unused: drop `isClaudeCodeRequest`, which became dead after PR #1914
switched both `/v1/messages` and `/count_tokens` paths to unconditional
`account.IsOAuth()` mimicry. The lowercase helper `isClaudeCodeClient`
is kept (still referenced by `TestIsClaudeCodeClient`).
- Drop SetAffiliateService setters and ProvideAuthService /
ProvidePaymentService / ProvideUserHandler wrappers in favor of direct
Wire constructor injection. AffiliateService has no back-edge to
Auth/Payment/User, so the indirection was never required.
- Change RegisterWithVerification's variadic affiliateCode to a fixed
parameter; adjust all call sites.
- Validate aff_code length and charset in BindInviterByCode before any
DB lookup, eliminating timing-side-channel and useless DB roundtrips
on malformed input.
- Make affiliate cache invalidation synchronous; surface Redis errors
via the project logger instead of swallowing them in a detached
goroutine.
- Add an integration test guarding cross-layer tx propagation in
AccrueQuota and a unit test pinning the aff_code format rules.
Root cause of persistent third-party detection: sub2api's
buildUpstreamRequest transparently forwards client headers via
allowedHeaders whitelist (addHeaderRaw) before applying mimicry
overrides. When third-party clients (opencode, etc.) send their own
anthropic-beta / user-agent / x-stainless-* / x-claude-code-session-id
values, these get appended to the request alongside our injected
headers, creating an inconsistent header set that Anthropic detects.
Parrot's build_upstream_headers constructs exactly 9 headers from
scratch and never forwards anything from the client. This is why
'same opencode version, some users work some don't' — different
opencode configs/versions send different header combinations.
Fix: when tokenType=oauth and mimicClaudeCode=true, skip the
client header passthrough loop entirely. The subsequent
applyClaudeCodeMimicHeaders + ApplyFingerprint + beta merge
pipeline constructs all necessary headers from our controlled values.
Also: remove systemIncludesClaudeCodePrompt gate — OAuth accounts
now unconditionally rewrite system (even if client already sent a
Claude Code-style prompt), ensuring billing attribution block is
always present.