jinyuan22 commited on
Commit
9767d2f
1 Parent(s): 0c463ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -3
README.md CHANGED
@@ -1,3 +1,122 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ ---
5
+
6
+
7
+ # PromoGen2 Model for Prokaryotic Promoter Sequence Generation
8
+
9
+ PromoGen2 is a specialized language model developed for generating and scoring prokaryotic promoter sequences. The model is particularly suitable for species with limited experimentally verified data. This model card provides guidance on loading the model, generating sequences, and scoring them using a custom scoring function.
10
+
11
+ ## Model Details
12
+
13
+ - **Model type**: Transformer-based language model (GPT-2 architecture)
14
+ - **Primary use case**: Generating and scoring species-specific promoter sequences
15
+ - **Tags**: Prokaryotic promoters, sequence generation, synthetic biology
16
+
17
+ ## Installation
18
+
19
+ Ensure the required packages are installed:
20
+
21
+ ```bash
22
+ pip install torch transformers[torch] biopython datasets pandas numpy scipy seaborn matplotlib jupyter notebook
23
+ ```
24
+
25
+ ## Loading the Model and Tokenizer
26
+
27
+ To get started, load the model and tokenizer with Hugging Face's `transformers` library.
28
+
29
+ ```python
30
+ from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline
31
+ import torch
32
+
33
+ # Load model and tokenizer
34
+ model = GPT2LMHeadModel.from_pretrained("jinyuan22/promogen2-small")
35
+ tokenizer = AutoTokenizer.from_pretrained("jinyuan22/promogen2-small")
36
+
37
+ # Set device (CPU or GPU)
38
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
39
+ pipe = pipeline("text-generation", model=model, device=device, tokenizer=tokenizer)
40
+ ```
41
+
42
+ ## Generating Sequences
43
+
44
+ Use the `text-generation` pipeline to generate sequences based on an input sequence and various parameters such as sampling temperature, repetition penalty, and top-p sampling. Customize the input sequence (`txt`), number of sequences, and sampling parameters.
45
+
46
+ ```python
47
+ # Define input text and generation parameters
48
+ txt = "<|bos|>5"
49
+ num_return_sequences = 5
50
+ batch_size = 2
51
+ max_new_tokens = 50
52
+ repetition_penalty = 1.2
53
+ top_p = 0.9
54
+ temperature = 0.7
55
+ do_sample = True
56
+
57
+ # Generate sequences
58
+ all_outputs = []
59
+ for i in range(0, num_return_sequences, batch_size):
60
+ outputs = pipe(
61
+ txt,
62
+ num_return_sequences=batch_size,
63
+ max_new_tokens=max_new_tokens,
64
+ repetition_penalty=repetition_penalty,
65
+ top_p=top_p,
66
+ temperature=temperature,
67
+ do_sample=do_sample
68
+ )
69
+ all_outputs.extend(outputs)
70
+ ```
71
+
72
+ ## Scoring Generated Sequences
73
+
74
+ A custom scoring function (`score`) evaluates each generated sequence. It calculates the sequence's likelihood under the model, based on the provided tag (or `none` if no tag is used).
75
+
76
+ ```python
77
+ @torch.no_grad()
78
+ def score(seq, tag="none"):
79
+ # Format input with specified tag
80
+ if tag == "none":
81
+ inputs = tokenizer(f"<|bos|>5{seq}3<|eos|>", return_tensors="pt")
82
+ else:
83
+ inputs = tokenizer(f"<|bos|>{tag}5{seq}3{tag}<|eos|>", return_tensors="pt")
84
+ inputs.to(device)
85
+ input_ids = inputs['input_ids'].to(device)
86
+ attention_mask = inputs['attention_mask'].to(device)
87
+ pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
88
+ return pred['loss'].item()
89
+ ```
90
+
91
+ ## Post-processing and Saving Outputs
92
+
93
+ The generated sequences are cleaned of special tokens and then scored using the `score` function. Each sequence and its score are saved to an output file.
94
+
95
+ ```python
96
+ # Post-process generated sequences
97
+ tag = "none"
98
+ seqs = [output["generated_text"].replace("<|bos|>", "").replace("5", "").replace("3", "").replace(tag, "") for output in all_outputs]
99
+ scores = [score(seq, tag) for seq in seqs]
100
+
101
+ # Save sequences and scores
102
+ with open("output.txt", "w") as f:
103
+ for i, (seq, score) in enumerate(zip(seqs, scores)):
104
+ f.write(f">{i}|score={score}\n{seq}\n")
105
+ ```
106
+
107
+ ## Example Parameters
108
+
109
+ - **txt**: Input sequence string for generation
110
+ - **tag**: Tag to define the context or label for generation (`"none"` if no specific tag is used)
111
+ - **num_return_sequences**: Number of sequences to generate
112
+ - **batch_size**: Number of sequences generated per batch
113
+ - **max_new_tokens**: Maximum length of generated sequences
114
+ - **repetition_penalty**: Penalty to control repetition in generated sequences
115
+ - **top_p**: Probability for nucleus sampling
116
+ - **temperature**: Temperature for sampling (controls diversity)
117
+ - **do_sample**: Set to `True` for sampling-based generation
118
+
119
+ ## Usage Notes
120
+
121
+ - For best results, ensure that the device (CPU/GPU) matches the model's requirements.
122
+ - This setup supports sequence generation tasks tailored to synthetic biology, particularly for organisms lacking experimentally verified promoter data.