|
--- |
|
license: other |
|
base_model: microsoft/phi-1_5 |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: titletor-phi_1-5 |
|
results: [] |
|
datasets: |
|
- zelalt/scientific-papers-3.5-withprompt |
|
--- |
|
|
|
<div align="center"> |
|
|
|
# Titletor |
|
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
<img src="./titletor.png" width="300"/> |
|
</div> |
|
|
|
This model is a fine-tuned version of [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) on [zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt) dataset. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 2.1587 |
|
|
|
### Requirements |
|
```python |
|
!pip install accelerate transformers einops datasets peft bitsandbytes |
|
``` |
|
|
|
## Test Dataset |
|
If you prefer, you can use test dataset from [zelalt/scientific-papers](https://huggingface.co/datasets/zelalt/scientific-papers) |
|
or [zelalt/arxiv-papers](https://huggingface.co/datasets/zelalt/arxiv-papers) or read your pdf as text with PyPDF2.PdfReader then give this text to LLM with adding "What is the title of this paper?" prompt. |
|
|
|
```python |
|
from datasets import load_dataset |
|
|
|
test_dataset = load_dataset("zelalt/scientific-papers", split='train') |
|
test_dataset = test_dataset.rename_column('full_text', 'text') |
|
|
|
def formatting(example): |
|
text = f"What is the title of this paper? {example['text'][:180]}\n\nAnswer: " |
|
return {'text': text} |
|
|
|
formatted_dataset = test_dataset.map(formatting) |
|
``` |
|
|
|
### Sample Code |
|
```python |
|
|
|
import torch |
|
from peft import PeftModel, PeftConfig |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
peft_model_id = "zelalt/titletor-phi_1-5" |
|
config = PeftConfig.from_pretrained(peft_model_id) |
|
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,trust_remote_code=True) |
|
model = PeftModel.from_pretrained(model, peft_model_id) |
|
|
|
#Put from dataset |
|
inputs = tokenizer(f'''{formatted_dataset['text'][120]}''', return_tensors="pt", return_attention_mask=False) |
|
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id) |
|
text = tokenizer.batch_decode(outputs)[0] |
|
print(text) |
|
``` |
|
|
|
```python |
|
#Put as string |
|
inputs = tokenizer(f'''What is the title of this paper? ...[your pdf as text]..\n\nAnswer: ''', return_tensors="pt", return_attention_mask=False) |
|
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id) |
|
text = tokenizer.batch_decode(outputs)[0] |
|
print(text) |
|
``` |
|
|
|
**Notes** |
|
- After running it for the first time and loading the model and tokenizer, you can only run generating part to avoid RAM crash. |
|
|
|
### Output |
|
Input: |
|
```markdown |
|
What is the title of this paper? Bursting Dynamics of the 3D Euler Equations\nin Cylindrical Domains\nFrançois Golse ∗ †\nEcole Polytechnique, CMLS\n91128 Palaiseau Cedex, France\nAlex Mahalov ‡and Basil Nicolaenko §\n\nAnswer: |
|
``` |
|
|
|
## Output from LLM: |
|
|
|
```markdown |
|
What is the title of this paper? Bursting Dynamics of the 3D Euler Equations |
|
in Cylindrical Domains |
|
François Golse ∗ † |
|
Ecole Polytechnique, CMLS |
|
91128 Palaiseau Cedex, France |
|
Alex Mahalov ‡and Basil Nicolaenko § |
|
|
|
Answer: Bursting Dynamics of the 3D Euler Equations in Cylindrical Domains<|endoftext|> |
|
``` |
|
|
|
## Training and evaluation data |
|
Train and validation dataset: |
|
[zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt) |
|
|
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
- total_train_batch_size: 8 |
|
- lr_scheduler_type: cosine |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.35.2 |
|
- Pytorch 2.1.0+cu118 |
|
- Datasets 2.15.0 |
|
- Tokenizers 0.15.0 |