documentation
integration and system reference.
covers quickstart, credentials, cache behavior, response headers, fallback options, and the internals behind the three cache paths.
this is v1. some behavior and specs described here will change. things that are clearly marked as stable are stable. everything else is subject to revision.
integration
quickstart
fgy exposes an openai-compatible endpoint. any client already configured for the openai api requires two changes: the base_url and the api_key. the request body, model names, and response shape are unchanged. your provider key travels separately — see the credentials section below.
credentials
two keys, two jobs.
fgy operates on two separate credentials that serve completely different purposes. conflating them is the most common source of integration confusion.
issued by fgy when you create an account. identifies your tenant for cache namespace isolation and billing. sent as the Authorization: Bearer header to the fgy endpoint.
your openai (or other provider) key. sent per-request in the X-Provider-Auth header. on a cache hit, it is ignored. on a miss, fgy forwards it upstream to fulfill the request.
this design means fgy cannot make upstream calls without an active request from you. it also means rotating your provider key requires no changes to fgy — just update the header value on your side.
observability
response headers.
every response from fgy appends headers indicating what happened at the cache layer. these are your primary observability surface during local testing and in production.
0 on miss. used to compute your billing amount.design decisions
how much should you depend on the cache?
before integrating, it is worth thinking through how your application handles the cache at a strategic level. the two main questions are whether cache savings are a nice-to-have or a hard requirement, and whether fgy should ever be on your critical uptime path.
cache savings are a bottom-line requirement
if your inference costs are high enough that cache hit rates directly affect profitability, route all traffic through fgy without a fallback. this maximizes cache coverage — every miss is stored and every future matching request benefits. the tradeoff is that fgy is on your critical path. if fgy is down, your inference is down.
appropriate for use cases with high prompt repetition (support bots, search suggestions, document Q&A) where the same or semantically similar prompts are common enough to justify the dependency.
cache savings are opportunistic
if uptime is more important than guaranteed savings, configure a fallback to your provider directly when fgy is unreachable. your application keeps running. the cost is that fallback requests never populate the cache — a fgy outage during a traffic spike means you miss the window to seed high-value entries.
appropriate for experimental integrations or systems where no third-party can be on the critical path under any circumstance.
design decisions
fallback routing.
if you go with option b above, the implementation is straightforward. maintain two client instances and catch connection or timeout errors from the fgy client.
requests that take the fallback path bypass the cache entirely. they are not stored by fgy and produce no future hits. if you experience fgy downtime during high-traffic periods, you lose the seeding opportunity for those prompts.
cache internals
ets exact cache.
the first lookup stage uses erlang term storage (ets) partitioned into 16 shards, each owned by a genserver. shards are configured with read_concurrency: true and write_concurrency: true, allowing concurrent readers without locking.
requests are normalized to a stable representation covering model name, sorted message content, and relevant sampling parameters, then hashed. the hash routes to a shard via :erlang.phash2(key, 16). a match is validated against the stored ttl and returned. hits bump the expiry on read, so active prompts stay warm.
cache internals
semantic store.
on an ets miss, the prompt text is embedded and its float vector is compared against stored embeddings for the same tenant and model using pgvector's <=> cosine distance operator.
the query returns the nearest neighbour. if the cosine distance is at or below 0.08 (equivalent to similarity ≥ 0.92), the stored response is returned without touching the provider. this threshold is a compile-time config and can be adjusted per enterprise deployment.
hit counts are incremented asynchronously via Task.start/1. the database write for incrementing is never on the critical response path.
cache internals
request coalescing.
when both the ets and pgvector checks miss, the request goes upstream using your forwarded provider key. if multiple callers arrive with identical keys before the first upstream call completes, only one executes. the rest register as waiters inside the Fgy.Cache.Coalescer genserver.
this is not a debounce or a queue. it is a beam message-passing pattern: the first pid to register for a key gets :execute. subsequent pids get :wait and block on a receive. when the executing pid completes, GenServer.cast sends the result to all waiters simultaneously. your provider sees one request regardless of how many clients triggered it.
billing
how billing works.
fgy does not charge a platform fee. it charges a percentage of the provider cost it avoids on your behalf. if a request misses cache, your provider key is forwarded, the provider bills you normally, and fgy charges nothing for the proxy hop.
get your api key
create an account in the dashboard to get your tenant cache key.