Live scheduler pressure / OpenAI-compatible edge

obleth

Fairshare admission for shared AI capacity.

Put one Rust gateway between clients and your OpenAI-compatible backends. obleth resolves tenant identity, admits work by weighted share when GPUs are full, and records token-accurate usage for the teams sharing the cluster.

Quick start Architecture View source

hot path

auth -> budget -> scheduler -> upstream

under load

queue by weighted share, not arrival time

ledger

tokens, wait time, model route, tenant

8-step

Hot path

Authenticate, budget, cache, schedule, proxy, stream, reconcile, and record every request

OpenAI

Compatible API

Chat, embeddings, images, audio, MCP, and model:auto routing behind one authenticated surface

Weights

Fair under load

Production, research, and sandbox traffic keep their configured share when the pool saturates

Live

Operator console

Tenants, keys, models, queue depth, model health, and usage are visible without shell-diving

Control plane

The gateway is visible

The dashboard is not decoration. It is where operators see capacity, queue pressure, model health, tenants, and keys while the data plane keeps serving traffic.

Scheduler pressure, route health, and tenant load update from the same control-plane concepts documented here.

Dashboard guide

control-plane / overview

obleth dashboard showing gateway capacity, queue depth, model health, traffic, and tenant load

Admission path

Every request earns its place

Cache hits return before fairshare. Token budget exhaustion is a hard stop. Saturation is different: requests wait until their tenant deserves the next slot.

Identify the tenant

Bearer keys are hashed, cached, and resolved into tenant context before any upstream work starts.

Reserve the budget

Redis-backed token buckets enforce per-tenant TPM and term budgets across every gateway pod.

Earn a slot

When capacity is tight, the scheduler admits whichever tenant is most behind its fair share.

Reconcile reality

Actual input and output tokens land in ClickHouse with wait time, admission class, and model route.

Live scheduler

Fairshare you can see

This is an illustrative scheduler lens: slots, queues, share_score, and target share move together so the fairness model is tangible.

Live admission console

Queue pressure becomes weighted admission, held permits, and reconciled usage rows.

queue

12 waiting

pick

api-batch

pool

64/64

ledger

ClickHouse

Incoming demand

chatbotqueued 0

api-batchqueued 12

analyticsqueued 0

Fairshare picker

next permit

api-batch

Waiting tenants are ranked by served tokens divided by weight. Lowest score gets the next released slot.

#1chatbot8,420,000

#2api-batch8,420,000

#3analytics8,420,000

Active pool

Held request permits by tenant.

64/64

chatbotheld 49 / target 76.9%

api-batchheld 5 / target 7.7%

analyticsheld 10 / target 15.4%

utilization 100%

Usage ledger

req_84f2

chatbot

1.8k tok

12ms

req_0ac9

api-batch

7.2k tok

1.4s

req_91be

analytics

3.1k tok

34ms

After streaming completes, actual tokens and wait time are reconciled into ClickHouse for dashboards and rollups.

Slot utilization

100%

Queue depth

Entitlement gap

What obleth adds

The layer inference backends skip

vLLM and Aibrix are excellent at serving models. obleth handles the multi-tenant policy layer they deliberately leave out.

fairshare_engine

Weighted admission

Hierarchical mode partitions global in-flight slots by group, then splits within the group. Weighted mode competes globally on share_score. Both are starvation-free.

model_router

Capacity-aware routing

Send model:auto and obleth can choose by capacity, health, price, tags, and an optional intent classifier across registered providers.

token_budget

Budgets that mean it

Per-tenant TPM, in-flight caps, model allowlists, and lifetime or monthly spend budgets stop overload without hiding why a request was held.

control_plane

Operate the hot path

Create tenants, rotate keys, tune model slots, watch health, and inspect scheduler pressure from a dashboard backed by Postgres, Redis, and ClickHouse.

Operations

Built for shared clusters

Tune priority live

Raise a tenant's weight during an incident and every gateway pod honors it on the next request.

Find the model knee

Run capacity autotune probes and apply recommended model slot caps when you are ready.

Compose with Aibrix

Let Aibrix or vLLM handle replica execution while obleth owns tenant policy before routing.

Deploy

Bring a real gateway online fast.

Docker Compose gives you the data plane, dashboard, Postgres, Redis, ClickHouse, HAProxy, Prometheus, Grafana, and a benchmark backend. Helm charts are ready when the same shape moves to Kubernetes.

Quick start View source

Stack

Compose

Gateway, dashboard, edge, and datastores

Ops

Grafana

Pre-wired Prometheus dashboards

K8s

Helm

Published chart and overrideable values

Start

~5 min

First tenant, key, and chat request