File size: 6,116 Bytes

---
license: mit
datasets:
- jet-universe/jetclass2
tags:
- particle physics
- jet tagging
---

# Model Card: Sophon

<!-- Provide a quick summary of what the model is/does. -->

The Sophon model is a jet tagging model pre-trained on a 188-class classification task using the JetClass-II dataset. It is based on the [Particle Transformer](https://github.com/jet-universe/particle_transformer) architecture.
This model represents the first practical implementation under the **Sophon** (Signature-Oriented Pre-training for Heavy-resonance ObservatioN) methodology.

For more details, refer to the following links: [[Paper]](https://arxiv.org/abs/2405.12972), [[Github]](https://github.com/jet-universe/sophon).

Try out this [[Demo on Colab]](https://colab.research.google.com/github/jet-universe/sophon/blob/main/notebooks/Interacting_with_JetClassII_and_Sophon.ipynb) to get started with the model.


## Model Details

The Sophon model functions both as a generic jet tagging model and a pre-trained model tailored for LHC's analysis needs.
As a jet tagger, the model is trained to distinguish among 188 classes:

![jetclass2_table](figures/jetclass2_table.png)

Key features of the model include:

- Training in a mass-decorrelated scenario, achieved by (1) ensuring the training dataset covers a wide range of jet transverse momentum (*p*<sub>T</sub>) and soft-drop mass (*m*<sub>SD</sub>), and (2) reweighting samples in each major training class to achieve similar jet distributions for *p*<sub>T</sub> and *m*<sub>SD</sub>.
- Enhanced "scale invariance" through the use of normalized 4-vectors as input.


## Uses and Impact

### Inferring Sophon model via ONNX

The Sophon model is valuable for future LHC phenomenological research, particularly for estimating physics measurement sensitivity using fast-simulation (Delphes) datasets. For a quick example of using this model in Python, or integrating this model in C++ workflows to process Delphes files, check [[here]](https://github.com/jet-universe/sophon?tab=readme-ov-file#using-sophon-model-pythonc).

This model also offers insights for the future development of generic and foundation AI models for particle physics experiments.


## Training Details

### Install dependencies

The Sophon model is based on the [ParT](https://github.com/jet-universe/particle_transformer) architecture. It is implemented in PyTorch, with training based on the [weaver](https://github.com/hqucms/weaver-core) framework for dataset loading and transformation. To install `weaver`, run:

```bash
pip install git+https://github.com/hqucms/weaver-core.git@dev/custom_train_eval
```

 > **Note:** We are temporarily using a development branch of `weaver`.

For instructions on setting up Miniconda and installing PyTorch, refer to the [`weaver`](https://github.com/hqucms/weaver-core?tab=readme-ov-file#set-up-a-conda-environment-and-install-the-packages) page.

### Download Sophon repository

```bash
git clone https://github.com/jet-universe/sophon.git
cd sophon
```

### Download dataset

Download the JetClass-II dataset from [[HuggingFace Dataset]](https://huggingface.co/datasets/jet-universe/jetclass2). 
The training and validation files are used in this work, while the test files are not used.

Ensure that all ROOT files are accessible from:

```bash
./datasets/JetClassII/Pythia/{Res2P,Res34P,QCD}_*.parquet
```

### Training

**Step 1:** Generate dataset sampling weights according to the `weights` section in the data configuration. The processed config with pre-calculated weights will be saved to `data/JetClassII`.

```bash
./train_sophon.sh make_weight
```

**Step 2:** Start training.

```bash
./train_sophon.sh train
```

 > **Note:** Depending on your machine and GPU configuration, additional settings may be useful. Here are a few examples:
 > - Enable PyTorch's DDP for parallel training, e.g., `CUDA_VISIBLE_DEVICES=0,1,2,3 DDP_NGPUS=4 ./train_sophon.sh train --start-lr 2e-3` (the learning rate should be scaled according to `DDP_NGPUS`).
> - Configure the number of data loader workers and the number of splits for the entire dataset. The script uses the default configuration `--num-workers 5 --data-split-num 200`, which means there are 5 workers, each responsible for processing 1/5 of the data files and reading the data synchronously; the data assigned to each worker is split into 200 parts, with each worker sequentially reading 1/200 of the total data in order.

**Step 3** (optional): Convert the model to ONNX.

```bash
./train_sophon.sh convert
```

## Evaluation

The Sophon model has been evaluated on several LHC experimental tasks. The evaluation dataset is a dedicated Standard Model dataset, collected using a generic large-*R* jet trigger that selects large-*R* (*R* = 0.8) jets with *p*<sub>T</sub> > 400 GeV and trimmed mass *m*<sub>trim</sub> > 50 GeV.

Key evaluation results include:

- Superior performance in directly tagging *X→bb* jets against QCD background jets, and in tagging *X→bs* jets against QCD jets after fine-tuning. The model outperforms the best experimental taggers for *X→bb* and *X→bs*.
- Significant potential for searching for unknown heavy resonances by constructing various tagging discriminants, selecting data, and performing generic bump hunts.
- Excellent results in anomaly detection using a weakly-supervised training approach, showing greater sensitivity to signals at very low signal injection levels and improved significance with adequate signal.

For more details, refer to the [[Paper]](https://arxiv.org/abs/2405.12972).


## Citation

If you use the JetClass-II dataset or the Sophon model, please cite:

```
@article{Li:2024htp,
    author = "Li, Congqiao and Agapitos, Antonios and Drews, Jovin and Duarte, Javier and Fu, Dawei and Gao, Leyun and Kansal, Raghav and Kasieczka, Gregor and Moureaux, Louis and Qu, Huilin and Suarez, Cristina Mantilla and Li, Qiang",
    title = "{Accelerating Resonance Searches via Signature-Oriented Pre-training}",
    eprint = "2405.12972",
    archivePrefix = "arXiv",
    primaryClass = "hep-ph",
    month = "5",
    year = "2024"
}
```