|
--- |
|
title: DGEB |
|
app_file : leaderboard/app.py |
|
sdk: docker |
|
sdk_version: 4.36.1 |
|
--- |
|
<h1 align="center">Diverse Genomic Embedding Benchmark</h1> |
|
|
|
<p align="center"> |
|
<a href="https://github.com/tattabio/dgeb/releases"> |
|
<img alt="GitHub release" src="https://img.shields.io/github/v/release/tattabio/dgeb.svg"> |
|
</a> |
|
<a href=""> |
|
<img alt="arXiv URL" src=""> |
|
</a> |
|
<a href="https://github.com/tattabio/dgeb/blob/main/LICENSE"> |
|
<img alt="License" src="https://img.shields.io/github/license/tattabio/dgeb.svg"> |
|
</a> |
|
<a href="https://pepy.tech/project/dgeb"> |
|
<img alt="Downloads" src="https://static.pepy.tech/personalized-badge/dgeb?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads"> |
|
</a> |
|
</p> |
|
|
|
<h4 align="center"> |
|
<p> |
|
<a href="#installation">Installation</a> | |
|
<a href="#usage">Usage</a> | |
|
<a href="https://huggingface.co/spaces/tattabio/DGEB">Leaderboard</a> | |
|
<a href="#citing">Citing</a> |
|
<p> |
|
</h4> |
|
|
|
<h3 align="center"> |
|
<a href="https://huggingface.co/spaces/dgeb"><img style="float: middle; padding: 10px 10px 10px 10px;" width="100" height="100" src="./docs/images/tatta_logo.png" /></a> |
|
</h3> |
|
|
|
DGEB is a benchmark for evaluating biological sequence models on functional and evolutionary information. |
|
|
|
DGEB is designed to evaluate model embeddings using: |
|
|
|
- Diverse sequences accross the tree of life. |
|
- Diverse tasks that capture different aspects of biological function. |
|
- Both amino acid and nucleotide sequences. |
|
|
|
The current version of DGEB consists of 18 datasets covering all three domains of life (Bacteria, Archaea and Eukarya). DGEB evaluates embeddings using six different embedding tasks: Classification, BiGene mining, Evolutionary Distance Similarity (EDS), Pair Classification, Clustering, and Retrieval. |
|
|
|
We welcome contributions of new tasks and datasets. |
|
|
|
## Installation |
|
|
|
Install DGEB using pip. |
|
|
|
```bash |
|
pip install dgeb |
|
``` |
|
|
|
## Usage |
|
|
|
- Launch evaluation using the python script (see [cli.py](https://github.com/tattabio/dgeb/blob/main/dgeb/cli.py)): |
|
|
|
```bash |
|
dgeb --model facebook/esm2_t6_8M_UR50D |
|
``` |
|
|
|
- To see all supported models and tasks: |
|
|
|
```bash |
|
dgeb --help |
|
``` |
|
|
|
- Using the python API: |
|
|
|
```py |
|
import dgeb |
|
|
|
model = dgeb.get_model("facebook/esm2_t6_8M_UR50D") |
|
tasks = dgeb.get_tasks_by_modality(dgeb.Modality.PROTEIN) |
|
evaluation = dgeb.DGEB(tasks=tasks) |
|
evaluation.run(model, output_folder="results") |
|
``` |
|
|
|
### Using a custom model |
|
|
|
Custom models should be wrapped with the `dgeb.models.BioSeqTransformer` abstract class, and specify the modality, number of layers, and embedding dimension. See [models.py](https://github.com/tattabio/dgeb/blob/main/dgeb/models.py) for additional examples on custom model loading and inference. |
|
|
|
```python |
|
import dgeb |
|
from dgeb.models import BioSeqTransformer |
|
from dgeb.tasks.tasks import Modality |
|
|
|
class MyModel(BioSeqTransformer): |
|
|
|
@property |
|
def modality(self) -> Modality: |
|
return Modality.PROTEIN |
|
|
|
@property |
|
def num_layers(self) -> int: |
|
return self.config.num_hidden_layers |
|
|
|
@property |
|
def embed_dim(self) -> int: |
|
return self.config.hidden_size |
|
|
|
|
|
model = MyModel(model_name='path_to/huggingface_model') |
|
tasks = dgeb.get_tasks_by_modality(model.modality) |
|
evaluation = dgeb.DGEB(tasks=tasks) |
|
evaluation.run(model) |
|
``` |
|
|
|
### Evaluating on a custom dataset |
|
|
|
**We strongly encourage users to contribute their custom datasets to DGEB. Please open a PR adding your dataset so that the community can benefit!** |
|
|
|
To evaluate on a custom dataset, first upload your dataset to the [Huggingface Hub](https://huggingface.co/docs/hub/en/datasets-adding). Then define a `Task` subclass with `TaskMetadata` that points to your huggingface dataset. For example, a classification task on a custom dataset can be defined as follows: |
|
|
|
```python |
|
import dgeb |
|
from dgeb.models import BioSeqTransformer |
|
from dgeb.tasks import Dataset, Task, TaskMetadata, TaskResult |
|
from dgeb.tasks.classification_tasks import run_classification_task |
|
|
|
class MyCustomTask(Task): |
|
metadata = TaskMetadata( |
|
id="my_custom_classification", |
|
display_name="...", |
|
description="...", |
|
type="classification", |
|
modality=Modality.PROTEIN, |
|
datasets=[ |
|
Dataset( |
|
path="path_to/huggingface_dataset", |
|
revision="...", |
|
) |
|
], |
|
primary_metric_id="f1", |
|
) |
|
|
|
def run(self, model: BioSeqTransformer) -> TaskResult: |
|
return run_classification_task(model, self.metadata) |
|
|
|
model = dgeb.get_model("facebook/esm2_t6_8M_UR50D") |
|
evaluation = dgeb.DGEB(tasks=[MyCustomTask]) |
|
evaluation.run(model) |
|
``` |
|
|
|
## Leaderboard |
|
|
|
To add your submission to the DGEB leaderboard, proceed through the following instructions. |
|
|
|
1. Fork the DGEB repository by following GitHub's instruction [Forking Workflow](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork). |
|
|
|
2. Add your submission .json file to the leaderboard/submissions/<HF_MODEL_NAME>/ directory. |
|
|
|
```bash |
|
mv /path/to/<SUBMISSION_FILE>.json /path/to/DGEB/leaderboard/submissions/<HF_MODEL_NAME>/ |
|
``` |
|
|
|
4. Update your fork with the new submission: |
|
|
|
```bash |
|
git add leaderboard/submissions/<HF_MODEL_NAME>/<SUBMISSION_FILE>.json |
|
git commit -m "Add submission for <HF_MODEL_NAME>" |
|
git push |
|
``` |
|
|
|
5. Open a pull request to the main branch of the repository via the Github interface. |
|
|
|
6. Once the PR is review and merged, your submission will be added to the leaderboard! |
|
|
|
|
|
## Acknowledgements |
|
|
|
DGEB follows the design of text embedding bechmark [MTEB](https://github.com/embeddings-benchmark/mteb) developed by Huggingface 🤗. The evaluation code is adapted from the MTEB codebase. |
|
|
|
## Citing |
|
|
|
DGEB was introduced in "[Diverse Genomic Embedding Benchmark for Functional Evaluation Across the Tree of Life]()", feel free to cite: |
|
|
|
TODO |
|
|