File size: 6,838 Bytes
0cb7afe
e0f5830
 
0cb7afe
 
 
2c20aec
0cb7afe
 
 
 
 
 
e0f5830
 
 
 
 
 
0bc8ef3
e0f5830
0bc8ef3
e0f5830
 
 
 
 
 
 
 
 
 
 
0223ad5
e0f5830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bc8ef3
 
e0f5830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5282662
0223ad5
5282662
0223ad5
5282662
 
 
0223ad5
5282662
0223ad5
5282662
0223ad5
5282662
 
 
 
 
 
 
0223ad5
5282662
 
 
 
0223ad5
5282662
 
 
 
 
 
 
 
 
 
 
 
e0f5830
5282662
 
 
 
 
 
 
 
 
 
 
e0f5830
5282662
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0f5830
 
 
0bc8ef3
e0f5830
 
 
 
 
 
 
 
0bc8ef3
e0f5830
2c20aec
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: GPU Poor LLM Arena
emoji: ๐Ÿ†
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: mit
short_description: 'Compact LLM Battle Arena: Frugal AI Face-Off!'
---

# ๐Ÿ† GPU-Poor LLM Gladiator Arena ๐Ÿ†

Welcome to the GPU-Poor LLM Gladiator Arena, where frugal meets fabulous in the world of AI! This project pits compact language models (maxing out at 9B parameters) against each other in a battle of wits and words.

## ๐Ÿค” Starting from "Why?"

In the recent months, we've seen a lot of these "Tiny" models released, and some of them are really impressive.

- **Gradio Exploration**: This project serves me as a playground for experimenting with Gradio app development; I am learning how to create interactive AI interfaces with it.

- **Tiny Model Evaluation**: I wanted to develop a personal (and now public) stats system for evaluating tiny language models. It's not too serious, but it provides valuable insights into the capabilities of these compact powerhouses.

- **Accessibility**: Built on Ollama, this arena allows pretty much anyone to experiment with these models themselves. No need for expensive GPUs or cloud services!

- **Pure Fun**: At its core, this project is about having fun with AI. It's a lighthearted way to explore and compare different models. So, haters, feel free to chill โ€“ we're just here for a good time!


## ๐ŸŒŸ Features

- **Battle Arena**: Pit two mystery models against each other and decide which pint-sized powerhouse reigns supreme.
- **Leaderboard**: Track the performance of different models over time using an improved scoring system.
- **Performance Chart**: Visualize model performance with interactive charts.
- **Privacy-Focused**: Uses local Ollama API, avoiding pricey commercial APIs and keeping data close to home.
- **Customizable**: Easy to add new models and prompts.

## ๐Ÿš€ Getting Started

### Prerequisites

- Python 3.7+
- Gradio
- Plotly
- Ollama (running locally)

### Installation

1. Clone the repository:
   ```
   git clone https://huggingface.co/spaces/k-mktr/gpu-poor-llm-arena.git
   cd gpu-poor-llm-arena
   ```

2. Install the required packages:
   ```
   pip install gradio plotly requests
   ```

3. Ensure Ollama is running locally or via a remote server.

4. Run the application:
   ```
   python app.py
   ```

## ๐ŸŽฎ How to Use

1. Open the application in your web browser (typically at `http://localhost:7860`).
2. In the "Battle Arena" tab:
   - Enter a prompt or use the random prompt generator (๐ŸŽฒ button).
   - Click "Generate Responses" to see outputs from two random models.
   - Vote for the better response.
3. Check the "Leaderboard" tab to see overall model performance.
4. View the "Performance Chart" tab for a visual representation of model wins and losses.

## ๐Ÿ›  Configuration

You can customize the arena by modifying the `arena_config.py` file:

- Add or remove models from the `APPROVED_MODELS` list.
- Adjust the `API_URL` and `API_KEY` if needed.
- Customize `example_prompts` for more variety in random prompts.

## ๐Ÿ“Š Leaderboard

The leaderboard data is stored in `leaderboard.json`. This file is automatically updated after each battle.

### Main Leaderboard Scoring System

We use a scoring system to rank the models fairly. The score for each model is calculated using the following formula:

```
Score = Win Rate * (1 - 1 / (Total Battles + 1))
```

Let's break down this formula:

1. **Win Rate**: This is the number of wins divided by the total number of battles. It ranges from 0 (no wins) to 1 (all wins).

2. **1 - 1 / (Total Battles + 1)**: This factor adjusts the win rate based on the number of battles:
   - We add 1 to the total battles to avoid division by zero and to ensure that even with just one battle, the score isn't discounted too heavily.
   - As the number of battles increases, this factor approaches 1.
   - For example:
     - With 1 battle: 1 - 1/2 = 0.5
     - With 10 battles: 1 - 1/11 โ‰ˆ 0.91
     - With 100 battles: 1 - 1/101 โ‰ˆ 0.99

3. **Purpose of this adjustment**:
   - It gives more weight to models that have participated in more battles.
   - A model with a high win rate but few battles will have a lower score than a model with the same win rate but more battles.
   - This encourages models to participate in more battles to improve their score.

4. **How it works in practice**:
   - For a new model with just one battle, its score will be at most 50% of its win rate.
   - As the model participates in more battles, its score will approach its actual win rate.
   - This prevents models with very few battles from dominating the leaderboard based on lucky wins.

In essence, this formula balances two factors:
1. How well a model performs (win rate)
2. How much experience it has (total battles)

It ensures that the leaderboard favors models that consistently perform well over a larger number of battles, rather than those that might have a high win rate from just a few lucky encounters.

We sort the results primarily by this calculated score, and secondarily by the total number of battles. This ensures that models with similar scores are ranked by their experience (number of battles).

The leaderboard displays this calculated score alongside wins, losses, and other statistics.

### ELO Leaderboard

In addition to the main leaderboard, we also maintain an ELO-based leaderboard:

- Models start with an initial ELO rating based on their size.
- ELO ratings are updated after each battle, with adjustments made based on the size difference between models.
- The ELO leaderboard provides an alternative perspective on model performance, taking into account the relative strengths of opponents.

## ๐Ÿค– Models

The arena currently supports the following compact models:

- LLaMA 3.2 (1B, 3B, 8-bit)
- LLaMA 3.1 (8B, 4-bit)
- Gemma 2 (2B, 4-bit; 2B, 8-bit; 9B, 4-bit)
- Qwen 2.5 (0.5B, 8-bit; 1.5B, 8-bit; 3B, 4-bit; 7B, 4-bit)
- Mistral 0.3 (7B, 4-bit)
- Phi 3.5 (3.8B, 4-bit)
- Mistral Nemo (12B, 4-bit)
- GLM4 (9B, 4-bit)
- InternLM2 v2.5 (7B, 4-bit)
- Falcon2 (11B, 4-bit)
- StableLM2 (1.6B, 8-bit; 12B, 4-bit)
- Yi v1.5 (6B, 4-bit; 9B, 4-bit)
- Ministral (8B, 4-bit)
- Dolphin 2.9.4 (8B, 4-bit)
- Granite 3 Dense (2B, 8-bit; 8B, 4-bit)
- Granite 3 MoE (1B, 8-bit; 3B, 4-bit)

## ๐Ÿค Contributing

Contributions are welcome! Please feel free to suggest a model that Ollama supports. Some results are already quite surprising.

## ๐Ÿ“œ License

This project is open-source and available under the MIT License

## ๐Ÿ™ Acknowledgements

- Thanks to the Ollama team for providing that amazing tool.
- Shoutout to all the AI researchers and compact language models teams for making this frugal AI arena possible!

Enjoy the battles in the GPU-Poor LLM Gladiator Arena! May the best compact model win! ๐Ÿ†