Local Semantic Caching
TokenSense includes a local semantic cache powered bysqlite-vec. This cache intercepts requests before they hit the LLM provider, saving both cost and latency.
By storing prompt embeddings locally, TokenSense can instantly return identical past responses. This drops your latency to ~1ms and brings your API cost for duplicate questions to $0.00.
Usage
How it works
- TokenSense intercepts your prompt.
- It generates an embedding for the prompt and performs a semantic search in your local SQLite vector database.
- If an exact match is found within your defined similarity threshold, the cached response is immediately returned.
- If no match is found, the request proceeds to the provider, and the new response is logged in the cache for future use.
