documentation

integration and system reference.

covers quickstart, credentials, cache behavior, response headers, fallback options, and the internals behind the three cache paths.

this is v1. some behavior and specs described here will change. things that are clearly marked as stable are stable. everything else is subject to revision.

integration

quickstart

fgy exposes an openai-compatible endpoint. any client already configured for the openai api requires two changes: the base_url and the api_key. the request body, model names, and response shape are unchanged. your provider key travels separately — see the credentials section below.

python
from openai import OpenAI
import os
 
client = OpenAI(
    api_key="fgy_tenant_key",         # your fgy cache key
    base_url="https://api.fgy.ai/v1",
    default_headers={
        "X-Provider-Auth": f"Bearer {os.environ['OPENAI_API_KEY']}",  # forwarded on miss only
    },
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
)
typescript
import OpenAI from "openai";
 
const client = new OpenAI({
  apiKey: "fgy_tenant_key",
  baseURL: "https://api.fgy.ai/v1",
  defaultHeaders: {
    "X-Provider-Auth": `Bearer ${process.env.OPENAI_API_KEY}`,
  },
});

credentials

two keys, two jobs.

fgy operates on two separate credentials that serve completely different purposes. conflating them is the most common source of integration confusion.

fgy_...
fgy cache key

issued by fgy when you create an account. identifies your tenant for cache namespace isolation and billing. sent as the Authorization: Bearer header to the fgy endpoint.

stored by fgy: yes (hashed, for auth)
sk-...
your provider key

your openai (or other provider) key. sent per-request in the X-Provider-Auth header. on a cache hit, it is ignored. on a miss, fgy forwards it upstream to fulfill the request.

stored by fgy: never

this design means fgy cannot make upstream calls without an active request from you. it also means rotating your provider key requires no changes to fgy — just update the header value on your side.


observability

response headers.

every response from fgy appends headers indicating what happened at the cache layer. these are your primary observability surface during local testing and in production.

header
values
meaning
x-fgy-cache
exact
ets shard returned a match. no provider call was made. no provider cost.
x-fgy-cache
semantic
pgvector found a match above the 0.92 cosine threshold. no provider call was made.
x-fgy-cache
miss
no match. your provider key was forwarded upstream. response stored for future hits. fgy charges $0 for the miss.
x-fgy-tokens-saved
integer
total prompt and completion tokens avoided. 0 on miss. used to compute your billing amount.
reading headers in python
response = client.chat.completions.create(...)
 
# access the underlying http response
headers = response._raw_response.headers
 
cache_result = headers.get("x-fgy-cache") # "exact" | "semantic" | "miss"
tokens_saved = headers.get("x-fgy-tokens-saved") # "0" | "n"

design decisions

how much should you depend on the cache?

before integrating, it is worth thinking through how your application handles the cache at a strategic level. the two main questions are whether cache savings are a nice-to-have or a hard requirement, and whether fgy should ever be on your critical uptime path.

option a

cache savings are a bottom-line requirement

if your inference costs are high enough that cache hit rates directly affect profitability, route all traffic through fgy without a fallback. this maximizes cache coverage — every miss is stored and every future matching request benefits. the tradeoff is that fgy is on your critical path. if fgy is down, your inference is down.

appropriate for use cases with high prompt repetition (support bots, search suggestions, document Q&A) where the same or semantically similar prompts are common enough to justify the dependency.

option b

cache savings are opportunistic

if uptime is more important than guaranteed savings, configure a fallback to your provider directly when fgy is unreachable. your application keeps running. the cost is that fallback requests never populate the cache — a fgy outage during a traffic spike means you miss the window to seed high-value entries.

appropriate for experimental integrations or systems where no third-party can be on the critical path under any circumstance.


design decisions

fallback routing.

if you go with option b above, the implementation is straightforward. maintain two client instances and catch connection or timeout errors from the fgy client.

python — soft fallback
from openai import OpenAI, APIConnectionError, APITimeoutError
 
fgy = OpenAI(
    api_key="fgy_...",
    base_url="https://api.fgy.ai/v1",
    default_headers={"X-Provider-Auth": "Bearer sk-..."},
)
direct = OpenAI(api_key="sk-...")
 
def complete(**kwargs):
    try:
        return fgy.chat.completions.create(**kwargs, timeout=4.0)
    except (APIConnectionError, APITimeoutError):
        # fgy unreachable — go direct, no cache seeding
        return direct.chat.completions.create(**kwargs)

requests that take the fallback path bypass the cache entirely. they are not stored by fgy and produce no future hits. if you experience fgy downtime during high-traffic periods, you lose the seeding opportunity for those prompts.


cache internals

ets exact cache.

the first lookup stage uses erlang term storage (ets) partitioned into 16 shards, each owned by a genserver. shards are configured with read_concurrency: true and write_concurrency: true, allowing concurrent readers without locking.

requests are normalized to a stable representation covering model name, sorted message content, and relevant sampling parameters, then hashed. the hash routes to a shard via :erlang.phash2(key, 16). a match is validated against the stored ttl and returned. hits bump the expiry on read, so active prompts stay warm.

shards
16
keyed by phash2
default ttl
3600 s
configurable per deploy
eviction sweep
5 min
per shard interval

cache internals

semantic store.

on an ets miss, the prompt text is embedded and its float vector is compared against stored embeddings for the same tenant and model using pgvector's <=> cosine distance operator.

the query returns the nearest neighbour. if the cosine distance is at or below 0.08 (equivalent to similarity ≥ 0.92), the stored response is returned without touching the provider. this threshold is a compile-time config and can be adjusted per enterprise deployment.

hit counts are incremented asynchronously via Task.start/1. the database write for incrementing is never on the critical response path.

elixir — semantic lookup query
from(e in Entry,
  where: e.tenant_id == ^tenant_id and e.model == ^model,
  order_by: fragment("embedding <=> ?::vector", ^embedding),
  limit: 1,
  select: %{
    entry: e,
    distance: fragment("embedding <=> ?::vector", ^embedding)
  }
)
 
# distance <= 0.08 means similarity >= 0.92 — serve cached response

cache internals

request coalescing.

when both the ets and pgvector checks miss, the request goes upstream using your forwarded provider key. if multiple callers arrive with identical keys before the first upstream call completes, only one executes. the rest register as waiters inside the Fgy.Cache.Coalescer genserver.

this is not a debounce or a queue. it is a beam message-passing pattern: the first pid to register for a key gets :execute. subsequent pids get :wait and block on a receive. when the executing pid completes, GenServer.cast sends the result to all waiters simultaneously. your provider sees one request regardless of how many clients triggered it.

100 concurrent identical requests
pid 1
executes upstream call
pid 2
registers as waiter
pid 3
registers as waiter
...
97 more waiters
provider receives 1 request. all 100 pids receive the result via broadcast.

billing

how billing works.

fgy does not charge a platform fee. it charges a percentage of the provider cost it avoids on your behalf. if a request misses cache, your provider key is forwarded, the provider bills you normally, and fgy charges nothing for the proxy hop.

billing formula
avoided_cost = tokens_saved * provider_rate
fgy_charge = avoided_cost * 0.15
 
# gpt-4o-mini output: $0.60 per 1M tokens = $0.000060 per token
# 1000 tokens saved on a hit:
avoided_cost = 1000 * 0.000060 # $0.060
fgy_charge = 0.060 * 0.15 # $0.009
outcome
provider cost
fgy charge
note
exact hit
$0
15% of saved
never touches provider
semantic hit
$0
15% of saved
pgvector matched
miss
normal
$0
key forwarded upstream

get your api key

create an account in the dashboard to get your tenant cache key.