enigma-1.5b
Model Details
It's a 2.5b model trained on ~1billion individual letters of DNA, kinda like training a text-based model on per-character level instead of sub-word level. It does have it's own tokenizer similar that is intersection b/w char-level and bpe-tokenizer.
For EnBERT i.e. decoder-only model is trained on lot's of sequences of DNA tokenized using k-mer tokenizer specially trained for this purpose, which means it has a larger vocab size than the enigma-2.5b. Same model architecture is used in training a 430m model that is per-char based same as 2.5b model, but it's better than that.
Model Description
- Developed by: Shivendra Singh
- License: MIT
Model Sources
- Repository: github/enigma-1.5b
- Papers: Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision
Uses
Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
Direct Use
Load the model and then can be used to generate new sequences, max_length=512
for 2.5b model and 256
for enigma-430m model.
Bias, Risks, and Limitations
This model was trained on only around ~500mbs of DNA data and that too per-character level, not sub-word or sequence level like in language models. Which means it would have more precision but limited because of training. I wasn't able to train it on other datasets for better generalizations because of my technical limits, lack of gpu and good hardware.
How to Get Started with the Model
Use the code below to get started with the model.
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
# generate from the model
from model import Transformer
model = Transformer(vocab_size=vocab_size)
from tokenizer import PerCharTokenizer
token = PerCharTokenizer()
input = "TGCCCTGGCTGCTCCGCATTGCAGGAGCTGCGCCCTTCCTTTC"
token_input = token.encode(input)
context = torch.tensor([token_input], dtype=torch.long, device=device)
generated_output = token.decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_output)
model.generate()
Training Details
Training Data
Used from this dataset: human_ref_data Consolidated 8 ~500mb files into one big dataset. I've uploaded the data for the training though.
Training Procedure
These models were trained to 3k-4k iterations, each. on ~500million letters of DNA, roughly around 500mbs of data. Final losses were around ~0.02 for 47million parameter model and ~0.003 for 2.5billion parameter model. I had saved more data, lot more than this, but couldn't train it more due to technical in-capabilities.
Try to train it yourself if possible. enigma/TrainEnigma
file contains all necessary functions needed to train it, from scratch or pre-train.
Functions:
This used a basic training procedure. get_batch()
generated batches of data, estimate_loss()
estimates losses and train()
function is kind of master function, here, calling other functions after each or set iterations.
Training Hyperparameters
Configurations are saved in the enigma/config-enigma.json
file. Suitable for 2.5b model.
{
"batch_size": 10,
"block_size": 512,
"max_iters": 5000,
"eval_interval": 50,
"learning_rate": 3e-5,
"eval_iters": 100,
"d_model": 384,
"n_head": 12,
"n_layer": 12,
"dropout": 0.2,
"norm_eps": 1e-5
}
Model Architecture and Objective
EnBERT is a 47million parameter model, follows BERT architecture, and has one more layer of masked self-attention layer to predict next tokens. Engima-2.5b is a transformer model. It has a fairly complex model.
#### Encoder Part:
It consists two different layers, each followed by their own normalization and dropout layers. Input embeddings along with positional embeddings are fed to the encoder block:
Self Attention:
- Each head of self-attention layer is similar to that of used in
grokAI
. Key and Query matrices have biases whereas Value matrix doesn't. - After implementing
torch.matmul()
on Key and Query, relational positional embeddings are applied to the attention matrix. - Attention and value matrix are then multiplied using
torch.matmul()
. - Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
FeedForward:
- Normalized outputs are then passed to position-wise
feedforward
layer, withexpansion_factor
of 5. - GELU is used as the activation function in this case and two linear layers, one for output and other for input.
- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.
Decoder Part:
Consists of three different layers:
Masked Attention:
- This layer is similar to the self-attention implemented in encoder part, except it has a triangular mask that forbids tokens to look for the context of next token.
- Rest is all same, relational positional embeddings are applied in the same way, but to the masked attention matrix this time.
- Attention and value matrix are then multiplied using
torch.matmul()
. - Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
Self-Attention:
- Before this, outputs from encoder layer and masked-attention layer are added together, and then passed to this layer.
- Same as the encoder's unmasked attention layer. Key, Query and Value matrices are created using same technique.
- Finally all the outputs are normalized and passed to dropout layer.
FeedForward:
- Normalized outputs are then passed to position-wise
feedforward
layer, withexpansion_factor
of 5. - GELU is used as the activation function in this case and two linear layers, one for output and other for input.
- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.