fast generative yield

while you focus on building,we handle the cache.

fgy cache is your inference caching department from day one. every application paying for LLM inference has a structural cost recovery opportunity — repeated prompts, semantically equivalent queries, concurrent identical requests. we capture all of it and charge only a fraction of what we save you.

fgy_charge = tokens_saved × provider_rate × 0.15
client.py
from openai import OpenAI
import os
 
- client = OpenAI(api_key="sk-...")
+ client = OpenAI(
  api_key="fgy_...",                  # cache tenant key
  base_url="https://api.fgy.ai/v1",
  default_headers={
    "X-Provider-Auth": f"Bearer {os.environ['OPENAI_API_KEY']}", # forwarded on miss
  },
)
 
response = client.chat.completions.create(
  model="gpt-4o-mini", messages=[...]
)
x-fgy-cache: exact x-fgy-tokens-saved: 1024 x-fgy-cache: miss

fgy does not store your provider key. the X-Provider-Auth header travels with each request and is forwarded upstream only on cache misses. on a hit, it is never read.

~0 μs
exact cache latency
<10 ms
semantic hit latency
N→1
concurrent deduplication
$0
fgy charge on miss

why this exists

prompt repetition is structural, not an edge case.

every production LLM application has a class of traffic where the same or semantically equivalent prompt arrives repeatedly — support bots, semantic search, document Q&A, classification pipelines, code completion with shared context. that traffic is invisible money.

fgy sits in front of your provider and captures that traffic. exact matches return in microseconds from ETS. near-matches resolve against a pgvector store. concurrent identical in-flight requests collapse into a single upstream call. the savings accumulate from the first request.

you keep 85 cents of every dollar fgy saves you. we take 15. if nothing is saved, nothing is charged.

cache paths

three paths. each one cheaper than going upstream.

01

exact match

the request is normalized and hashed. if the tenant's ETS shard has a matching entry, the cached response returns in microseconds. no embedding call, no database query, no upstream hop.

provider: $0  ·  fgy: $0
02

semantic match

on an ETS miss, the prompt is embedded and checked against prior responses via pgvector cosine distance. entries within the 0.92 similarity threshold serve the stored payload. no provider call.

provider: $0  ·  fgy: 15% of avoided cost
03

miss + coalescing

true misses go upstream using your forwarded provider key. concurrent callers with the same prompt collapse into one upstream request via the OTP GenServer coalescer. all waiters receive the broadcast simultaneously.

provider: normal rate  ·  fgy: $0

stack

built on the right runtime for this problem.

a cache is a concurrency problem. Elixir and OTP were built for exactly this — millions of lightweight processes, preemptive scheduling, no GC pauses, and message passing as the concurrency primitive. the entire cache layer is a natural expression of the BEAM.

elixir / otp

the runtime

Elixir on the BEAM gives each request its own process. supervision trees mean crashes are isolated and recovered automatically. the scheduler handles thousands of concurrent cache lookups without blocking.

ets

in-memory exact store

16 shards with read_concurrency: true. keyed via :erlang.phash2. lookup, validation, and ttl check happen without touching any external process. exact hits never leave the BEAM VM.

pgvector

semantic similarity

prompt embeddings stored per tenant and model. nearest-neighbour search via the <=> cosine distance operator directly in postgres. hit counts incremented via Task.start, never blocking the response path.

fly.io

global distribution

deployed across Fly.io regions. requests route to the nearest instance. the BEAM cluster handles state coordination across nodes. low-latency cache access regardless of where your traffic originates.

genserver coalescer

N concurrent requests → 1 upstream call.

this is not a debounce or a queue. the first process to register for an in-flight key gets :execute. every subsequent arrival gets :wait and blocks on receive. when the executing process completes, GenServer.cast broadcasts to all waiters simultaneously. your provider sees one request, billed once, regardless of how many clients triggered it.

pid 1
:execute
pid 2
:wait
pid 3
:wait
...
n waiters
1 upstream request → N simultaneous responses

pricing model

we take a cut of what we save you.

no platform fee. no minimum spend. misses pass through for free — your provider key is forwarded, and fgy charges nothing for the proxy hop. on hits, fgy bills 15% of the avoided provider cost at list price.

saved $10 you keep $8.50 · fgy $1.50
full pricing and formula breakdown

roadmap

v1 current

what's coming in v2.

v1 establishes the foundation — proxy, three-layer cache, pay-as-you-save billing. v2 focuses on depth: better controls, deeper observability, and expanding what the cache can serve.

streaming response cache
cache and replay streaming completions token by token. full streaming api compatibility.
per-tenant similarity thresholds
configure the pgvector cosine threshold per key. tighter for precise use cases, looser for high-repetition pipelines.
cache warming api
pre-seed the cache from your prompt corpus before traffic arrives. deploy knowing your hit rate is already primed.
multi-provider routing
route misses to the cheapest available provider for the requested model. cache the result regardless of which provider fulfilled it.
sdk packages
first-class sdk support for Python, TypeScript, and Go with built-in header inspection and fallback helpers.
savings analytics
per-key hit rate, savings breakdowns by model and prompt class, and projected monthly savings curves.
TODO
benchmark evaluation script

a reproducible script that runs a real prompt corpus through fgy and graphs hit rates, latency distributions, and cost savings against baseline direct inference. will be linked here once ready.

round 0

we're building this in the open.

fgy is at round zero. if you're spending meaningful money on LLM inference and want to be involved early — as a design partner, pilot user, or in an investment conversation — we want to hear from you. we're looking for teams where the math on caching is obvious and who want to shape what the product becomes.

hello@fgy.ai