Fairvisor for LLM Hosters

What Fairvisor Does for LLM Hosters

Multi-Dimensional Quotas

Control more than requests per second:

Tokens/minute, tokens/day, and cost/minute
Limits by endpoint type, model, route, and prompt class
Burst + sliding-window controls and sustained throughput caps for long-running campaigns

Identity-Aware Enforcement

Policy follows identity, not just IP:

API key + tenant + user + device fingerprint + ASN/geo + reputation signals
Distinct trust profiles per key or customer segment
Step-up actions on suspicion (re-auth, attestation, low-trust profile)

Extraction-Focused Detection

Detect patterns that look like harvesting, not normal product usage:

Coverage-style prompt sweeps across broad task spaces
High prompt variability with similar objective
Large counts of short sessions and template-correlated prompts
Uniform overnight runs and sustained high-entropy traffic

Edge Enforcement Playbooks

Automate responses in real time:

Throttle
Tarpit latency
Quota squeeze
Hard block
Cooldown windows

Forensics and Auditability

Produce proof for security and legal workflows:

Who initiated traffic, when, and through which identity path
Token/cost impact over time
Which controls fired and why
Exportable incident timeline for compliance and IR

Optional Honey Endpoints

Canary routes can improve detection fidelity and attribution for automated collection behavior.

What Extraction Looks Like

Model extraction campaigns don't look like DDoS. They look like unusually systematic product usage.

Prompt sweep

— A single API key sends 50,000 requests over 72 hours. Each prompt is a slight variation on the same task template, incrementally covering the model’s behavior space. Request volume is moderate; coverage breadth is the signal. Normal users don’t systematically probe every capability.

Coverage run

— Requests are distributed across task categories: summarization, translation, code generation, reasoning, classification. The prompt structure is formulaic. Completion lengths cluster tightly. This is a training data collection job, not product usage.

Overnight template run

— Traffic is uniform between 2am and 6am. Identical prompt structure, high retry rate, no user-session context. Human users don’t work like this; automated collection jobs do.

Key rotation

— When one key hits its limit, traffic shifts to a new key within seconds. The request fingerprint is identical. Fairvisor tracks this across identity transitions. → LLM Token Limiter | Loop Detector

Policy Playbooks Out of the Box

Prebuilt policy packs for common deployments:

Public LLM API
B2B Copilot
Internal Assistant
Consumer Chat

Each preset includes defaults for mass-query and scraping resistance, then can be tuned per tenant and model.

Minimal MVP Scope (Practical Rollout)

Start with three controls:

Token/cost limits with burst controls
Policy engine with core extraction signals (velocity, coverage, template and retry patterns)
Automated responses (throttle, tarpit, block) with complete audit log

This is enough to make a credible claim: reduced risk and attack economics for model extraction attempts, with measurable controls.

Who This Is For

Teams exposing LLM functionality as paid API products
Vertical copilots with expensive prompt pipelines and proprietary system prompts
Platforms where behavioral cloning risk is material

FAQ

What types of rate limits can I apply per model?

Token/minute, token/day, cost/minute — configurable per model, endpoint type, prompt class, or customer segment. Separate burst and sustained throughput controls for limiting both spike and long-running campaign traffic.

How does extraction detection work?

Fairvisor tracks prompt patterns across requests per identity: coverage breadth (are prompts systematically spanning task types?), variability with shared objective, timing uniformity, retry rates, and session context absence. Normal product usage doesn’t look like a systematic capability sweep.

What happens when extraction is detected?

Configurable enforcement playbook: throttle → tarpit latency → quota squeeze → block → cooldown window. You control the escalation levels. Fairvisor provides the signals and automates the response in real time.

How does Fairvisor track identity across key rotation?

By fingerprinting request patterns — prompt structure, timing, ASN, and session context — not just the API key. When one key hits a limit and traffic shifts immediately to a new key with the same behavioral fingerprint, Fairvisor tracks the continuity across the transition. → Loop Detector docs

Does Fairvisor support streaming (SSE) responses?

Yes. Token counting happens during streaming. If a completion exceeds configured limits mid-stream, Fairvisor closes the stream gracefully with finish_reason: length. No corrupted responses, no wasted tokens after the cutoff.

Can I expose usage data to my customers?

Yes. Usage metrics are queryable per tenant via API and exportable as CSV/JSON. You can build customer-facing dashboards on top of this data without building your own metering infrastructure.

What are honey endpoints?

Optional canary routes that surface detection signals for automated extraction campaigns. They behave differently for probing behavior than for product usage, improving attribution fidelity and making systematic sweeps more detectable without alerting the attacker.

Can we run extraction controls in shadow mode before hard enforcement?

Yes. Start in shadow mode to measure extraction-like patterns against real traffic, tune thresholds, and validate escalation playbooks before enabling throttle/tarpit/block actions.

Why teams choose Fairvisor

Control, detect, and prove

Multi-dimensional quotas, extraction-focused anomaly detection, and auditable evidence — all enforced at the edge.

Abuse firewall at the inference layer

Token and cost limits combined with behavioral signals that distinguish normal usage from automated harvesting campaigns.

Anti-extraction by design

Make model distillation economically infeasible before prompt harvesting turns into a behavioral clone.

Put anti-extraction controls in front of your model endpoints

Deploy in shadow mode

Also relevant

For AI Teams

Token budgets, loop detection, and cost controls for LLM agents in production.

For FinOps

Real-time cost attribution and budget enforcement by tenant, team, and endpoint.

For Compliance

Immutable audit logs, RBAC, and SOC 2 control mapping.

Make model extraction economically infeasible