Make model extraction economically infeasible
Fairvisor acts as an LLM Abuse Firewall at the edge: multi-dimensional quotas, extraction-focused anomaly detection, and identity-aware enforcement with auditable incident evidence.
What Fairvisor Does for LLM Hosters
Multi-Dimensional Quotas
Control more than requests per second:
- Tokens/minute, tokens/day, and cost/minute
- Limits by endpoint type, model, route, and prompt class
- Burst + sliding-window controls and sustained throughput caps for long-running campaigns
Identity-Aware Enforcement
Policy follows identity, not just IP:
- API key + tenant + user + device fingerprint + ASN/geo + reputation signals
- Distinct trust profiles per key or customer segment
- Step-up actions on suspicion (re-auth, attestation, low-trust profile)
Extraction-Focused Detection
Detect patterns that look like harvesting, not normal product usage:
- Coverage-style prompt sweeps across broad task spaces
- High prompt variability with similar objective
- Large counts of short sessions and template-correlated prompts
- Uniform overnight runs and sustained high-entropy traffic
Edge Enforcement Playbooks
Automate responses in real time:
- Throttle
- Tarpit latency
- Quota squeeze
- Hard block
- Cooldown windows
Forensics and Auditability
Produce proof for security and legal workflows:
- Who initiated traffic, when, and through which identity path
- Token/cost impact over time
- Which controls fired and why
- Exportable incident timeline for compliance and IR
Optional Honey Endpoints
Canary routes can improve detection fidelity and attribution for automated collection behavior.What Extraction Looks Like
Model extraction campaigns don't look like DDoS. They look like unusually systematic product usage.
Prompt sweep
— A single API key sends 50,000 requests over 72 hours. Each prompt is a slight variation on the same task template, incrementally covering the model’s behavior space. Request volume is moderate; coverage breadth is the signal. Normal users don’t systematically probe every capability.Coverage run
— Requests are distributed across task categories: summarization, translation, code generation, reasoning, classification. The prompt structure is formulaic. Completion lengths cluster tightly. This is a training data collection job, not product usage.Overnight template run
— Traffic is uniform between 2am and 6am. Identical prompt structure, high retry rate, no user-session context. Human users don’t work like this; automated collection jobs do.Key rotation
— When one key hits its limit, traffic shifts to a new key within seconds. The request fingerprint is identical. Fairvisor tracks this across identity transitions. → LLM Token Limiter | Loop DetectorPolicy Playbooks Out of the Box
Prebuilt policy packs for common deployments:
- Public LLM API
- B2B Copilot
- Internal Assistant
- Consumer Chat
Each preset includes defaults for mass-query and scraping resistance, then can be tuned per tenant and model.
Minimal MVP Scope (Practical Rollout)
Start with three controls:
- Token/cost limits with burst controls
- Policy engine with core extraction signals (velocity, coverage, template and retry patterns)
- Automated responses (
throttle,tarpit,block) with complete audit log
This is enough to make a credible claim: reduced risk and attack economics for model extraction attempts, with measurable controls.
Who This Is For
- Teams exposing LLM functionality as paid API products
- Vertical copilots with expensive prompt pipelines and proprietary system prompts
- Platforms where behavioral cloning risk is material
FAQ
What types of rate limits can I apply per model?
Token/minute, token/day, cost/minute — configurable per model, endpoint type, prompt class, or customer segment. Separate burst and sustained throughput controls for limiting both spike and long-running campaign traffic.How does extraction detection work?
Fairvisor tracks prompt patterns across requests per identity: coverage breadth (are prompts systematically spanning task types?), variability with shared objective, timing uniformity, retry rates, and session context absence. Normal product usage doesn’t look like a systematic capability sweep.What happens when extraction is detected?
Configurable enforcement playbook: throttle → tarpit latency → quota squeeze → block → cooldown window. You control the escalation levels. Fairvisor provides the signals and automates the response in real time.How does Fairvisor track identity across key rotation?
By fingerprinting request patterns — prompt structure, timing, ASN, and session context — not just the API key. When one key hits a limit and traffic shifts immediately to a new key with the same behavioral fingerprint, Fairvisor tracks the continuity across the transition. → Loop Detector docsDoes Fairvisor support streaming (SSE) responses?
Yes. Token counting happens during streaming. If a completion exceeds configured limits mid-stream, Fairvisor closes the stream gracefully withfinish_reason: length. No corrupted responses, no wasted tokens after the cutoff.