← Back to Docs

TokenCut + LlamaIndex

Compress retrieved RAG context before it hits your LLM.

Installation

pip install agentready-sdk llama-index

Basic Usage

from llama_index.core import VectorStoreIndex
from agentready.integrations.llamaindex import TokenCutPostprocessor

# Create the postprocessor (use your API key from env)
import os
postprocessor = TokenCutPostprocessor(api_key=os.environ["AGENTREADY_API_KEY"])

# Add to your query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
    node_postprocessors=[postprocessor]
)

# Context is compressed before LLM call
response = query_engine.query("What is the system architecture?")

# Check savings
print(postprocessor.stats)
# {'total_tokens_saved': 5847, 'total_savings_usd': 0.1754}

How It Works

  1. 1. Query → Your question goes to the vector store
  2. 2. Retrieve → Top-k nodes are returned
  3. 3. Compress → TokenCut compresses each node's text ← this step
  4. 4. Generate → Compressed context goes to LLM

Configuration

postprocessor = TokenCutPostprocessor(
    api_key=os.environ["AGENTREADY_API_KEY"],
    level="medium",       # light, medium, aggressive
    preserve_code=True,   # keep code blocks intact
    min_length=200,       # skip short nodes (chars)
)

Why Compress RAG Context?

  • Cost reduction: RAG context is often the largest part of the prompt
  • More context: Fit more retrieved chunks within the context window
  • Better responses: Dense context = less noise for the LLM
  • Faster: Fewer tokens = faster LLM inference