← Back to Docs
TokenCut + LlamaIndex
Compress retrieved RAG context before it hits your LLM.
Installation
pip install agentready-sdk llama-indexBasic Usage
from llama_index.core import VectorStoreIndex
from agentready.integrations.llamaindex import TokenCutPostprocessor
# Create the postprocessor (use your API key from env)
import os
postprocessor = TokenCutPostprocessor(api_key=os.environ["AGENTREADY_API_KEY"])
# Add to your query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
node_postprocessors=[postprocessor]
)
# Context is compressed before LLM call
response = query_engine.query("What is the system architecture?")
# Check savings
print(postprocessor.stats)
# {'total_tokens_saved': 5847, 'total_savings_usd': 0.1754}How It Works
- 1. Query → Your question goes to the vector store
- 2. Retrieve → Top-k nodes are returned
- 3. Compress → TokenCut compresses each node's text ← this step
- 4. Generate → Compressed context goes to LLM
Configuration
postprocessor = TokenCutPostprocessor(
api_key=os.environ["AGENTREADY_API_KEY"],
level="medium", # light, medium, aggressive
preserve_code=True, # keep code blocks intact
min_length=200, # skip short nodes (chars)
)Why Compress RAG Context?
- Cost reduction: RAG context is often the largest part of the prompt
- More context: Fit more retrieved chunks within the context window
- Better responses: Dense context = less noise for the LLM
- Faster: Fewer tokens = faster LLM inference