documentation

integration and system reference.

covers quickstart, credentials, cache behavior, response headers, fallback options, and the internals behind the three cache paths.

this is v1. some behavior and specs described here will change. things that are clearly marked as stable are stable. everything else is subject to revision.

integration

quickstart

fgy exposes an openai-compatible endpoint. any client already configured for the openai api requires two changes: the base_url and the api_key. the request body, model names, and response shape are unchanged. your provider key travels separately — see the credentials section below.

python

from openai import OpenAI

import os

client = OpenAI(

api_key="fgy_tenant_key", # your fgy cache key

base_url="https://api.fgy.ai/v1",

default_headers={

"X-Provider-Auth": f"Bearer {os.environ['OPENAI_API_KEY']}", # forwarded on miss only

)

response = client.chat.completions.create(

model="gpt-4o-mini",

messages=[{"role": "user", "content": "..."}],

)

typescript

import OpenAI from "openai";

const client = new OpenAI({

apiKey: "fgy_tenant_key",

baseURL: "https://api.fgy.ai/v1",

defaultHeaders: {

"X-Provider-Auth": `Bearer ${process.env.OPENAI_API_KEY}`,

});

credentials

two keys, two jobs.

fgy operates on two separate credentials that serve completely different purposes. conflating them is the most common source of integration confusion.

fgy_...

fgy cache key

issued by fgy when you create an account. identifies your tenant for cache namespace isolation and billing. sent as the Authorization: Bearer header to the fgy endpoint.

stored by fgy: yes (hashed, for auth)

sk-...

your provider key

your openai (or other provider) key. sent per-request in the X-Provider-Auth header. on a cache hit, it is ignored. on a miss, fgy forwards it upstream to fulfill the request.

stored by fgy: never

this design means fgy cannot make upstream calls without an active request from you. it also means rotating your provider key requires no changes to fgy — just update the header value on your side.

observability

response headers.

every response from fgy appends headers indicating what happened at the cache layer. these are your primary observability surface during local testing and in production.

header

values

meaning

x-fgy-cache

exact

ets shard returned a match. no provider call was made. no provider cost.

x-fgy-cache

semantic

pgvector found a match above the 0.92 cosine threshold. no provider call was made.

x-fgy-cache

miss

no match. your provider key was forwarded upstream. response stored for future hits. fgy charges $0 for the miss.

x-fgy-tokens-saved

integer

total prompt and completion tokens avoided. 0 on miss. used to compute your billing amount.

reading headers in python

response = client.chat.completions.create(...)

# access the underlying http response

headers = response._raw_response.headers

cache_result = headers.get("x-fgy-cache") # "exact" | "semantic" | "miss"

tokens_saved = headers.get("x-fgy-tokens-saved") # "0" | "n"

design decisions

how much should you depend on the cache?

before integrating, it is worth thinking through how your application handles the cache at a strategic level. the two main questions are whether cache savings are a nice-to-have or a hard requirement, and whether fgy should ever be on your critical uptime path.

option a

cache savings are a bottom-line requirement

if your inference costs are high enough that cache hit rates directly affect profitability, route all traffic through fgy without a fallback. this maximizes cache coverage — every miss is stored and every future matching request benefits. the tradeoff is that fgy is on your critical path. if fgy is down, your inference is down.

appropriate for use cases with high prompt repetition (support bots, search suggestions, document Q&A) where the same or semantically similar prompts are common enough to justify the dependency.

option b

cache savings are opportunistic

if uptime is more important than guaranteed savings, configure a fallback to your provider directly when fgy is unreachable. your application keeps running. the cost is that fallback requests never populate the cache — a fgy outage during a traffic spike means you miss the window to seed high-value entries.

appropriate for experimental integrations or systems where no third-party can be on the critical path under any circumstance.

design decisions

fallback routing.

if you go with option b above, the implementation is straightforward. maintain two client instances and catch connection or timeout errors from the fgy client.

python — soft fallback

from openai import OpenAI, APIConnectionError, APITimeoutError

fgy = OpenAI(

api_key="fgy_...",

base_url="https://api.fgy.ai/v1",

default_headers={"X-Provider-Auth": "Bearer sk-..."},

)

direct = OpenAI(api_key="sk-...")

def complete(**kwargs):

try:

return fgy.chat.completions.create(**kwargs, timeout=4.0)

except (APIConnectionError, APITimeoutError):

# fgy unreachable — go direct, no cache seeding

return direct.chat.completions.create(**kwargs)

requests that take the fallback path bypass the cache entirely. they are not stored by fgy and produce no future hits. if you experience fgy downtime during high-traffic periods, you lose the seeding opportunity for those prompts.

cache internals

ets exact cache.

the first lookup stage uses erlang term storage (ets) partitioned into 16 shards, each owned by a genserver. shards are configured with read_concurrency: true and write_concurrency: true, allowing concurrent readers without locking.

requests are normalized to a stable representation covering model name, sorted message content, and relevant sampling parameters, then hashed. the hash routes to a shard via :erlang.phash2(key, 16). a match is validated against the stored ttl and returned. hits bump the expiry on read, so active prompts stay warm.

shards

keyed by phash2

default ttl

3600 s

configurable per deploy

eviction sweep

5 min

per shard interval

cache internals

semantic store.

on an ets miss, the prompt text is embedded and its float vector is compared against stored embeddings for the same tenant and model using pgvector's <=> cosine distance operator.

the query returns the nearest neighbour. if the cosine distance is at or below 0.08 (equivalent to similarity ≥ 0.92), the stored response is returned without touching the provider. this threshold is a compile-time config and can be adjusted per enterprise deployment.

hit counts are incremented asynchronously via Task.start/1. the database write for incrementing is never on the critical response path.

elixir — semantic lookup query

from(e in Entry,

where: e.tenant_id == ^tenant_id and e.model == ^model,

order_by: fragment("embedding <=> ?::vector", ^embedding),

limit: 1,

select: %{

entry: e,

distance: fragment("embedding <=> ?::vector", ^embedding)

}

)

# distance <= 0.08 means similarity >= 0.92 — serve cached response

cache internals

request coalescing.

when both the ets and pgvector checks miss, the request goes upstream using your forwarded provider key. if multiple callers arrive with identical keys before the first upstream call completes, only one executes. the rest register as waiters inside the Fgy.Cache.Coalescer genserver.

this is not a debounce or a queue. it is a beam message-passing pattern: the first pid to register for a key gets :execute. subsequent pids get :wait and block on a receive. when the executing pid completes, GenServer.cast sends the result to all waiters simultaneously. your provider sees one request regardless of how many clients triggered it.

100 concurrent identical requests

pid 1

executes upstream call

pid 2

registers as waiter

pid 3

registers as waiter

...

97 more waiters

provider receives 1 request. all 100 pids receive the result via broadcast.

billing

how billing works.

fgy does not charge a platform fee. it charges a percentage of the provider cost it avoids on your behalf. if a request misses cache, your provider key is forwarded, the provider bills you normally, and fgy charges nothing for the proxy hop.

billing formula

avoided_cost = tokens_saved * provider_rate

fgy_charge = avoided_cost * 0.15

# gpt-4o-mini output: $0.60 per 1M tokens = $0.000060 per token

# 1000 tokens saved on a hit:

avoided_cost = 1000 * 0.000060 # $0.060

fgy_charge = 0.060 * 0.15 # $0.009

outcome

provider cost

fgy charge

note

exact hit

15% of saved

never touches provider

semantic hit

15% of saved

pgvector matched

miss

normal

key forwarded upstream

get your api key

create an account in the dashboard to get your tenant cache key.

create account full pricing