๐Ÿค– Model Performance Comparison Tool

Compare LLM performance on multiple-choice questions using Hugging Face models.

Format: Each line should have: Question,Correct Answer,Choice1,Choice2,Choice3

๐Ÿ’ก Features:

  • Model evaluation using HuggingFace transformers
  • Support for custom models via HF model paths
  • Detailed question-by-question results
  • Performance charts and statistics

Enter the delimiter used in your dataset:

Choose sample dataset or enter your own

Format Requirements:

  • Each data line: Question, Correct Answer, Choice1, Choice2, Choice3 (No header)
  • Use commas or tabs as separators
Select from popular models

โš ๏ธ Note:

  • Larger models require more GPU memory, currently we only run on CPU
  • First run will download models (may take time)
  • Models are cached for subsequent runs

๐Ÿ“Š Results

Results will appear here...

๐Ÿ“ฅ Export Results

๐Ÿ“‹ Markdown Table Format

๐Ÿ“Š CSV Format

Detailed results will appear here...


About Model Evaluation

This tool loads and runs HuggingFace models for evaluation:

๐Ÿ—๏ธ How it works:

  • Downloads models from HuggingFace Hub
  • Formats questions as prompts for each model
  • Runs likelihood based evaluation

โšก Performance Tips:

  • Use smaller models for testing
  • Larger models (7B+) require significant GPU memory
  • Models are cached after first load

๐Ÿ”ง Supported Models:

  • Any HuggingFace autoregressive language model
  • Both instruction-tuned and base models
  • Custom fine-tuned models via HF paths