--- license: cc-by-nc-sa-4.0 widget: - text: >- AAAACATAATAATTTGCCGACTTACTCACCCTGTGATTAATCTATTTTCACTGTGTAGTAAGTAGAGAGTGTTACTTACTACAGTATCTATTTTTGTTTGGATGTTTGCCGTGGACAAGTGCTAACTGTCAAAACCCGTTTTGACCTTAAACCCAGCAATAATAATAATGTAAAACTCCATTGGGCAGTGCAACCTACTCCTCACATATTATATTATAATTCCTAAACCTTGATCAGTTAAATTAATAGCTCTGTTCCCTGTGGCTTTATATAAACACCATGGTTGTCAGCAGTTCAGCA tags: - DNA - biology - genomics datasets: - zhangtaolab/plant-multi-species-core-promoters metrics: - accuracy base_model: - zhangtaolab/plant-dnabert-BPE --- # Plant foundation DNA large language models The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. **Developed by:** zhangtaolab ### Model Sources - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) - **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]() ### Architecture The model is trained based on the Google BERT base model with modified tokenizer specific for DNA sequence. This model is fine-tuned for predicting active core promoters. ### How to use Install the runtime library first: ```bash pip install transformers ``` Here is a simple code for inference: ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline model_name = 'plant-dnabert-BPE-promoter' # load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) # inference sequences = ['TTACTAAATTTATAACGATTTTTTATCTAACTTTAGCTCATCAATCTTTACCGTGTCAAAATTTAGTGCCAAGAAGCAGACATGGCCCGATGATCTTTTACCCTGTTTTCATAGCTCGCGAGCCGCGACCTGTGTCCAACCTCAACGGTCACTGCAGTCCCAGCACCTCAGCAGCCTGCGCCTGCCATACCCCCTCCCCCACCCACCCACACACACCATCCGGGCCCACGGTGGGACCCAGATGTCATGCGCTGTACGGGCGAGCAACTAGCCCCCACCTCTTCCCAAGAGGCAAAACCT', 'GACCTAATGATTAACCAAGGAAAAATGCAAGGATTTGACAAAAATATAGAAGCCAATGCTAGGCGCCTAAGTGAATGGATATGAAACAAAAAGCGAGCAGGCTGTCTATATATGGACAATTAGTTGCATTAATATAGTAGTTTATAATTGCAAGCATGGCACTACATCACAACACCTAAAAGACATGCCGTGATGCTAGAACAGCCATTGAATAAATTAGAAAGAAAGGTTGTGGTTAATTAGTTAACGACCAATCGAGCCTACTAGTATAAATTGTACCTCGTTGTTATGAAGTAATTC'] pipe = pipeline('text-classification', model=model, tokenizer=tokenizer, trust_remote_code=True, top_k=None) results = pipe(sequences) print(results) ``` ### Training data We use BertForSequenceClassification to fine-tune the model. Detailed training procedure can be found in our manuscript. #### Hardware Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).