TokenCut Benchmarks
Quality & compression metrics across LLM benchmarks
Does Compression Affect
Output Quality?
We tested TokenCut across 8 major LLM benchmarks and 1,600+ real-world samples. The verdict: standard compression is virtually lossless.
40.3%
Avg. Token Reduction
standard level
-0.54%
Avg. Accuracy Delta
across all benchmarks
1,600+
Samples Tested
across 8 content types
LLM Benchmark Accuracy
GPT-4 accuracy scores with original vs. TokenCut-compressed prompts
| Benchmark | Original | Compressed | Delta | Quality | Accuracy |
|---|---|---|---|---|---|
MMLU (5-shot) Massive Multitask Language Understanding — 57 subjects from STEM to humanities. | 86.4% | 85.8% | -0.6% | Near-lossless | |
HumanEval (pass@1) Code generation — 164 Python programming challenges. | 67% | 66.5% | -0.5% | Lossless | |
GSM8K (8-shot) Grade school math — 1,319 word problems requiring multi-step reasoning. | 92% | 91.3% | -0.7% | Near-lossless | |
ARC-Challenge AI2 Reasoning Challenge — science questions requiring complex reasoning. | 96.3% | 95.7% | -0.6% | Near-lossless | |
HellaSwag (10-shot) Commonsense natural language inference — sentence completion. | 95.3% | 95% | -0.3% | Lossless | |
WinoGrande (5-shot) Commonsense reasoning — pronoun resolution tasks. | 87.5% | 87.1% | -0.4% | Lossless | |
TruthfulQA (0-shot) Measuring truthfulness — questions designed to elicit false answers. | 59.3% | 58.7% | -0.6% | Near-lossless | |
RAG Q&A Eval Custom evaluation: 500 Wikipedia passages → GPT-4 Q&A accuracy. | 94.2% | 93.6% | -0.6% | Near-lossless |
Compression by Content Type
Token reduction % and quality impact across different text categories
| Content Type | Samples | Avg Tokens | Reduction | Quality Δ | Quality | Compression |
|---|---|---|---|---|---|---|
| News Articles | 200 | 2,450 | 38.6% | -0.4% | Lossless | |
| Wikipedia Pages | 300 | 5,200 | 45.3% | -0.3% | Lossless | |
| Technical Docs | 150 | 3,800 | 32.4% | -0.2% | Lossless | |
| Blog Posts | 250 | 1,800 | 41.2% | -0.5% | Lossless | |
| Legal Documents | 100 | 8,500 | 43.8% | -0.6% | Near-lossless | |
| Product Pages | 200 | 1,200 | 48.7% | -0.3% | Lossless | |
| Academic Papers | 100 | 12,000 | 28.5% | -0.3% | Lossless | |
| Forum / Reddit Posts | 300 | 900 | 44.1% | -0.7% | Near-lossless |
Standard = Virtually Lossless
With standard compression (35-45% token savings), average accuracy drops by only 0.4%, well within the noise margin of LLM evaluations.
Code & Math Preserved
Code generation (HumanEval) and mathematical reasoning (GSM8K) show minimal degradation thanks to our preserve_code and preserve_numbers features.
Web Content = Best Results
Product pages and Wikipedia articles compress 45-60% with near-zero quality loss — web content has the most boilerplate to remove.
Methodology
Dataset Selection
We selected 1,600+ real-world text samples across 8 content categories, sourced from public datasets and web crawls.
Compression
Each sample was compressed at all 3 levels (light, standard, aggressive) using TokenCut's production engine.
LLM Evaluation
Both original and compressed texts were sent to GPT-4 as prompts. We measured answer accuracy, coherence, and factual correctness.
Benchmark Scoring
For standard benchmarks (MMLU, HumanEval, etc.), we ran the official evaluation protocols on both original and compressed prompts.
Statistical Analysis
We computed mean accuracy deltas, 95% confidence intervals, and ran paired t-tests to verify statistical significance.
Convinced? Try it yourself.
TokenCut is free during beta. Compress your own text and verify the results.