jinyuan22
/

promogen2-small

+---
+license: cc-by-nc-4.0
+library_name: transformers
+---
+# PromoGen2 Model for Prokaryotic Promoter Sequence Generation
+PromoGen2 is a specialized language model developed for generating and scoring prokaryotic promoter sequences. The model is particularly suitable for species with limited experimentally verified data. This model card provides guidance on loading the model, generating sequences, and scoring them using a custom scoring function.
+## Model Details
+- **Model type**: Transformer-based language model (GPT-2 architecture)
+- **Primary use case**: Generating and scoring species-specific promoter sequences
+- **Tags**: Prokaryotic promoters, sequence generation, synthetic biology
+## Installation
+Ensure the required packages are installed:
+```bash
+pip install torch transformers[torch] biopython datasets pandas numpy scipy seaborn matplotlib jupyter notebook
+```
+## Loading the Model and Tokenizer
+To get started, load the model and tokenizer with Hugging Face's `transformers` library.
+```python
+from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline
+import torch
+# Load model and tokenizer
+model = GPT2LMHeadModel.from_pretrained("jinyuan22/promogen2-small")
+tokenizer = AutoTokenizer.from_pretrained("jinyuan22/promogen2-small")
+# Set device (CPU or GPU)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+pipe = pipeline("text-generation", model=model, device=device, tokenizer=tokenizer)
+```
+## Generating Sequences
+Use the `text-generation` pipeline to generate sequences based on an input sequence and various parameters such as sampling temperature, repetition penalty, and top-p sampling. Customize the input sequence (`txt`), number of sequences, and sampling parameters.
+```python
+# Define input text and generation parameters
+txt = "<|bos|>5"
+num_return_sequences = 5
+batch_size = 2
+max_new_tokens = 50
+repetition_penalty = 1.2
+top_p = 0.9
+temperature = 0.7
+do_sample = True
+# Generate sequences
+all_outputs = []
+for i in range(0, num_return_sequences, batch_size):
+    outputs = pipe(
+        txt,
+        num_return_sequences=batch_size,
+        max_new_tokens=max_new_tokens,
+        repetition_penalty=repetition_penalty,
+        top_p=top_p,
+        temperature=temperature,
+        do_sample=do_sample
+    )
+    all_outputs.extend(outputs)
+```
+## Scoring Generated Sequences
+A custom scoring function (`score`) evaluates each generated sequence. It calculates the sequence's likelihood under the model, based on the provided tag (or `none` if no tag is used).
+```python
+@torch.no_grad()
+def score(seq, tag="none"):
+    # Format input with specified tag
+    if tag == "none":
+        inputs = tokenizer(f"<|bos|>5{seq}3<|eos|>", return_tensors="pt")
+    else:
+        inputs = tokenizer(f"<|bos|>{tag}5{seq}3{tag}<|eos|>", return_tensors="pt")
+    inputs.to(device)
+    input_ids = inputs['input_ids'].to(device)
+    attention_mask = inputs['attention_mask'].to(device)
+    pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
+    return pred['loss'].item()
+```
+## Post-processing and Saving Outputs
+The generated sequences are cleaned of special tokens and then scored using the `score` function. Each sequence and its score are saved to an output file.
+```python
+# Post-process generated sequences
+tag = "none"
+seqs = [output["generated_text"].replace("<|bos|>", "").replace("5", "").replace("3", "").replace(tag, "") for output in all_outputs]
+scores = [score(seq, tag) for seq in seqs]
+# Save sequences and scores
+with open("output.txt", "w") as f:
+    for i, (seq, score) in enumerate(zip(seqs, scores)):
+        f.write(f">{i}|score={score}\n{seq}\n")
+```
+## Example Parameters
+- **txt**: Input sequence string for generation
+- **tag**: Tag to define the context or label for generation (`"none"` if no specific tag is used)
+- **num_return_sequences**: Number of sequences to generate
+- **batch_size**: Number of sequences generated per batch
+- **max_new_tokens**: Maximum length of generated sequences
+- **repetition_penalty**: Penalty to control repetition in generated sequences
+- **top_p**: Probability for nucleus sampling
+- **temperature**: Temperature for sampling (controls diversity)
+- **do_sample**: Set to `True` for sampling-based generation
+## Usage Notes
+- For best results, ensure that the device (CPU/GPU) matches the model's requirements.
+- This setup supports sequence generation tasks tailored to synthetic biology, particularly for organisms lacking experimentally verified promoter data.