File size: 2,592 Bytes
4fa9f3e
1a74fec
4fa9f3e
 
 
 
 
 
 
 
b97e015
 
 
 
 
 
 
 
4fa9f3e
 
b97e015
 
 
 
 
 
 
 
 
 
 
1a74fec
b97e015
76e1a38
1a74fec
b97e015
1a74fec
b97e015
1a74fec
76e1a38
 
 
 
1a74fec
76e1a38
 
 
 
 
 
 
 
 
ec2acd0
76e1a38
 
 
 
 
 
 
1a74fec
76e1a38
1a74fec
76e1a38
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
title: DmxPerplexity
emoji: 🌖
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 4.7.1
app_file: app.py
pinned: false
license: apache-2.0
tags:
- evaluate
- metric
description: >-
  Perplexity metric implemented by d-Matrix.
  Perplexity (PPL) is one of the most common metrics for evaluating language models.
  It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
  For more information, see https://huggingface.co/docs/transformers/perplexity
---

# Metric Card for Perplexity


## Metric Description

Perplexity metric implemented by d-Matrix.
Perplexity (PPL) is one of the most common metrics for evaluating language models.
It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
For more information, see https://huggingface.co/docs/transformers/perplexity

## How to Use
At minimum, this metric requires the model and references as inputs.
```python
>>> import evaluate
>>> perplexity = evaluate.load("dmx_perplexity", module_type="metric")
>>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
>>> results = perplexity.compute(model='distilgpt2',references=input_texts)
>>> print(results)
{'loss': 4.993086338043213, 'perplexity': 147.390625}
```

### Inputs
- **model** (`Union`[`str`,`AutoModelForCausalLM`]): model used for calculating Perplexity
- **references** (`list` of `str`): input text, each separate text snippet is one list entry.
- **device** (`str`): device to run on, defaults to 'cuda' when available.
- **max_length** (`int`): maximum sequence length, defaults to 2048.

### Output Values
- **loss** (`float`): the loss of the model predictions compared to the reference
- **perplexity**(`float`): measures the uncertainty of a model predicting text. Model performance is better when perplexity is lower.

Output Example(s):
```python
{'loss': 4.993086338043213, 'perplexity': 147.390625}
```
This metric outputs a dictionary, containing the loss and perplexity score.

### Examples
```python
>>> import evaluate
>>> from datasets import load_dataset
>>> perplexity = evaluate.load("dmx_perplexity", module_type="metric")
>>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10] 
>>> results = perplexity.compute(model='distilgpt2',references=input_texts)
>>> print(list(results.keys()))
['loss', 'perplexity']
>>> print(results['loss']) 
3.8299286365509033
>>> print(results['perplexity']) 
46.05925369262695
```

## Citation(s)
https://huggingface.co/docs/transformers/perplexity