while you focus on building,
we handle the cache.
fgy cache is your inference caching department from day one. every application paying for LLM inference has a structural cost recovery opportunity — repeated prompts, semantically equivalent queries, concurrent identical requests. we capture all of it and charge only a fraction of what we save you.
api_key="fgy_...", # cache tenant key
base_url="https://api.fgy.ai/v1",
default_headers={
"X-Provider-Auth": f"Bearer {os.environ['OPENAI_API_KEY']}", # forwarded on miss
},
)
fgy does not store your provider key. the X-Provider-Auth header travels with each request and is forwarded upstream only on cache misses. on a hit, it is never read.
why this exists
prompt repetition is structural, not an edge case.
every production LLM application has a class of traffic where the same or semantically equivalent prompt arrives repeatedly — support bots, semantic search, document Q&A, classification pipelines, code completion with shared context. that traffic is invisible money.
fgy sits in front of your provider and captures that traffic. exact matches return in microseconds from ETS. near-matches resolve against a pgvector store. concurrent identical in-flight requests collapse into a single upstream call. the savings accumulate from the first request.
you keep 85 cents of every dollar fgy saves you. we take 15. if nothing is saved, nothing is charged.
cache paths
three paths. each one cheaper than going upstream.
exact match
the request is normalized and hashed. if the tenant's ETS shard has a matching entry, the cached response returns in microseconds. no embedding call, no database query, no upstream hop.
semantic match
on an ETS miss, the prompt is embedded and checked against prior responses via pgvector cosine distance. entries within the 0.92 similarity threshold serve the stored payload. no provider call.
miss + coalescing
true misses go upstream using your forwarded provider key. concurrent callers with the same prompt collapse into one upstream request via the OTP GenServer coalescer. all waiters receive the broadcast simultaneously.
stack
built on the right runtime for this problem.
a cache is a concurrency problem. Elixir and OTP were built for exactly this — millions of lightweight processes, preemptive scheduling, no GC pauses, and message passing as the concurrency primitive. the entire cache layer is a natural expression of the BEAM.
the runtime
Elixir on the BEAM gives each request its own process. supervision trees mean crashes are isolated and recovered automatically. the scheduler handles thousands of concurrent cache lookups without blocking.
in-memory exact store
16 shards with read_concurrency: true. keyed via :erlang.phash2. lookup, validation, and ttl check happen without touching any external process. exact hits never leave the BEAM VM.
semantic similarity
prompt embeddings stored per tenant and model. nearest-neighbour search via the <=> cosine distance operator directly in postgres. hit counts incremented via Task.start, never blocking the response path.
global distribution
deployed across Fly.io regions. requests route to the nearest instance. the BEAM cluster handles state coordination across nodes. low-latency cache access regardless of where your traffic originates.
N concurrent requests → 1 upstream call.
this is not a debounce or a queue. the first process to register for an in-flight key gets :execute. every subsequent arrival gets :wait and blocks on receive. when the executing process completes, GenServer.cast broadcasts to all waiters simultaneously. your provider sees one request, billed once, regardless of how many clients triggered it.
pricing model
we take a cut of what we save you.
no platform fee. no minimum spend. misses pass through for free — your provider key is forwarded, and fgy charges nothing for the proxy hop. on hits, fgy bills 15% of the avoided provider cost at list price.
roadmap
v1 currentwhat's coming in v2.
v1 establishes the foundation — proxy, three-layer cache, pay-as-you-save billing. v2 focuses on depth: better controls, deeper observability, and expanding what the cache can serve.
a reproducible script that runs a real prompt corpus through fgy and graphs hit rates, latency distributions, and cost savings against baseline direct inference. will be linked here once ready.
we're building this in the open.
fgy is at round zero. if you're spending meaningful money on LLM inference and want to be involved early — as a design partner, pilot user, or in an investment conversation — we want to hear from you. we're looking for teams where the math on caching is obvious and who want to shape what the product becomes.
hello@fgy.ai