esm2_t6_8M_UR50D_pep2rec_cppp / README.md

littleworth

Update README.md

f757eac verified 6 months ago

preview code

raw

history blame contribute delete

No virus

3.55 kB

	---
	language: en
	tags:
	- transformers
	- protein
	- peptide-receptor
	license: apache-2.0
	datasets:
	- custom
	---



	## Model Description


	This model predicts receptor classes, identified by their PDB IDs, from peptide sequences using the [ESM2](https://huggingface.co/docs/transformers/model_doc/esm) (Evolutionary Scale Modeling) protein language model with esm2_t6_8M_UR50D pre-trained weights. The model is fine-tuned for receptor prediction using datasets from [PROPEDIA](http://bioinfo.dcc.ufmg.br/propedia2/) and [PepNN](https://www.nature.com/articles/s42003-022-03445-2), as well as novel peptides experimentally validated to bind to their target proteins, with binding conformations determined using ClusPro, a protein-protein docking tool. The name `pep2rec_cppp` reflects the model's ability to predict peptide-to-receptor relationships, leveraging training data from ClusPro, PROPEDIA, and PepNN.
	It's particularly useful for researchers and practitioners in bioinformatics, drug discovery, and related fields, aiming to understand or predict peptide-receptor interactions.

	## How to Use

	Here is how to predict the receptor class for a peptide sequence using this model:

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	from joblib import load

	MODEL_PATH = "littleworth/esm2_t6_8M_UR50D_pep2rec_cppp"
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
	tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

	LABEL_ENCODER_PATH = f"{MODEL_PATH}/label_encoder.joblib"
	label_encoder = load(LABEL_ENCODER_PATH)


	input_sequence = "GNLIVVGRVIMS"

	inputs = tokenizer(input_sequence, return_tensors="pt", truncation=True, padding=True)

	with torch.no_grad():
	outputs = model(**inputs)
	probabilities = torch.softmax(outputs.logits, dim=1)
	predicted_class_idx = probabilities.argmax(dim=1).item()

	predicted_class = label_encoder.inverse_transform([predicted_class_idx])[0]

	class_probabilities = probabilities.squeeze().tolist()
	class_labels = label_encoder.inverse_transform(range(len(class_probabilities)))

	sorted_indices = torch.argsort(probabilities, descending=True).squeeze()
	sorted_class_labels = [class_labels[i] for i in sorted_indices.tolist()]
	sorted_class_probabilities = probabilities.squeeze()[sorted_indices].tolist()

	print(f"Predicted Receptor Class: {predicted_class}")
	print("Top 10 Class Probabilities:")
	for label, prob in zip(sorted_class_labels[:10], sorted_class_probabilities[:10]):
	print(f"{label}: {prob:.4f}")


	```

	Which gives this output:

	```
	Predicted Receptor Class: 1JXP
	Top 10 Class Probabilities:
	1JXP: 0.7793
	2OIN: 0.0058
	1A1R: 0.0026
	2QV1: 0.0025
	3KEE: 0.0022
	3KF2: 0.0016
	5LAS: 0.0016
	1QD6: 0.0014
	6ME1: 0.0013
	2XCF: 0.0013
	```

	## Evaluation Results

	The model was evaluated on a held-out test set, yielding the following metrics:

	```
	{
	"train/loss": 0.7338,
	"train/grad_norm": 4.333151340484619,
	"train/learning_rate": 2.3235385792411667e-8,
	"train/epoch": 10,
	"train/global_step": 352910,
	"_timestamp": 1711654529.5562913,
	"_runtime": 204515.04906344414,
	"_step": 715,
	"eval/loss": 0.7718502879142761,
	"eval/accuracy": 0.7761048124023759,
	"eval/runtime": 2734.4878,
	"eval/samples_per_second": 34.416,
	"eval/steps_per_second": 34.416,
	"train/train_runtime": 204505.5285,
	"train/train_samples_per_second": 13.806,
	"train/train_steps_per_second": 1.726,
	"train/total_flos": 143220103846625280,
	"train/train_loss": 1.0842229404661865,
	"_wandb": {
	"runtime": 204514
	}
	}
	```