Back to Projects
Case Study LLM Ops Cost Optimization

LLM Evaluation
Framework

Comprehensive benchmarking system for comparing Claude vs GPT-4 performance, with intelligent prompt caching achieving 25% cost reduction.

2024
~2 weeks
Solo Project

The Problem

Choosing the right LLM for production involves complex trade-offs. Organizations struggle with:

  • No standardized way to compare models
  • Hidden costs from inefficient prompts
  • Lack of quality metrics beyond "vibes"
  • Difficulty justifying model selection to stakeholders

The Solution

A data-driven evaluation framework that quantifies performance:

  • Multi-dimensional benchmarking suite
  • Automated Claude vs GPT-4 comparison
  • 25% cost reduction via prompt optimization
  • Interactive dashboard for stakeholder reporting

System Architecture

Modular design separating evaluation, optimization, and visualization concerns.

Benchmark Suite 8 task categories
LLM Evaluator Core engine
Claude API Anthropic
LLM Evaluator
GPT-4 API OpenAI
Optimization Layer
Raw Prompt
LRU Cache Deduplication
Compressor Token reduction
25% Savings

Tech Stack

Python

Core framework built with modern Python 3.9+ features including dataclasses, type hints, and async support for efficient API calls.

dataclasses typing

Anthropic API

Direct integration with Claude models (Opus, Sonnet, Haiku) for comprehensive benchmarking across the Claude family.

claude-3-sonnet claude-3-opus

OpenAI API

GPT-4 and GPT-3.5 Turbo integration for head-to-head comparison against Claude models on identical tasks.

gpt-4-turbo gpt-3.5-turbo

Streamlit + Plotly

Interactive dashboard for real-time benchmarking visualization, cost analysis, and stakeholder reporting.

Interactive Charts Live Metrics

Implementation Highlights

Core Evaluator with Multi-Metric Tracking

@dataclass
class EvaluationResult:
    """Container for evaluation metrics"""
    model: str
    prompt: str
    response: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cost_usd: float
    quality_score: Optional[float] = None

    @property
    def tokens_per_second(self) -> float:
        """Calculate throughput"""
        return (self.output_tokens / self.latency_ms) * 1000

LRU Cache for Prompt Deduplication

class PromptCache:
    """LRU Cache reducing redundant API calls"""

    def __init__(self, max_size=1000, ttl_seconds=3600):
        self.cache: OrderedDict = OrderedDict()
        self.stats = {"hits": 0, "misses": 0, "tokens_saved": 0def get(self, prompt: str, model: str) -> Optional[CacheEntry]:
        prompt_hash = self._hash_prompt(prompt, model)
        if prompt_hash in self.cache:
            self.stats["hits"] += 1
            self.stats["tokens_saved"] += entry.total_tokens
            return self.cache[prompt_hash]
        return None

Verbose Phrase Compression

class ContextWindowOptimizer:
    # Compression patterns for token reduction
    COMPRESSION_MAP = {
        r'\bplease\s+could you\s+': '',
        r'\bin order to\b': 'to',
        r'\bdue to the fact that\b': 'because',
        r'\bit is important to note that\b': 'note:',
    }

    def compress_prompt(self, prompt: str) -> Tuple[str, List]:
        for pattern, replacement in self.COMPRESSION_MAP.items():
            prompt = re.sub(pattern, replacement, prompt)
        return prompt, techniques_applied

Cost Optimization Results

25% Cost Reduction

Combined savings from caching and prompt compression techniques

15-20% Cache Savings

Avoided API calls through LRU cache with 1-hour TTL

5-10% Compression

Token reduction via verbose phrase elimination

Optimization Technique Breakdown

Prompt Caching (LRU)
15-20%
Verbose Compression
5-10%
Context Window Optimization
3-5%

Benchmark Suite

Comprehensive task coverage across reasoning, coding, and creative domains.

Logical Reasoning

Multi-step deduction puzzles testing inference capabilities.

Math Problems

Word problems requiring step-by-step calculation.

Code Generation

Algorithm implementation with edge case handling.

Code Analysis

Bug identification and explanation tasks.

Summarization

Concise extraction from long-form content.

Creative Writing

Constrained generation like haiku composition.

Entity Extraction

Structured JSON output from unstructured text.

Factual Q&A

Knowledge recall and accuracy testing.

Key Features

Feature

Multi-Dimensional Metrics

Track latency (avg, P50, P95), throughput (tokens/sec), cost (per-request and aggregate), and quality scores with custom evaluators. Export results to JSON for further analysis.

Feature

Model Pricing Calculator

Built-in pricing data for Claude (Opus, Sonnet, Haiku) and GPT (4, 4-Turbo, 3.5-Turbo). Project daily, monthly, and yearly costs based on expected usage patterns.

Feature

Interactive Dashboard

Streamlit-based UI with Plotly charts for real-time visualization. Compare models side-by-side, analyze task-level performance, and generate stakeholder-ready reports.

Feature

Mock Mode for Development

Full framework functionality without API keys. Simulated latency and responses allow testing the evaluation pipeline before committing to API costs.

Explore the Code

Full source code with documentation available on GitHub.