File size: 10,397 Bytes
9d19428
 
73b9fd5
 
 
 
 
 
 
 
 
 
6c03a2b
73b9fd5
6c03a2b
9d19428
61c30f6
73b9fd5
61c30f6
73b9fd5
 
 
16a85c1
d19f39b
261cca9
 
73b9fd5
 
 
 
 
 
 
6cc6e2d
 
 
 
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
11a78fd
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
36dafc0
73b9fd5
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11a78fd
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
36dafc0
73b9fd5
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da7b55e
73b9fd5
 
 
 
 
 
 
 
 
 
 
da7b55e
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
36dafc0
73b9fd5
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da7b55e
73b9fd5
 
 
 
 
 
 
 
 
 
 
da7b55e
73b9fd5
 
 
 
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
36dafc0
73b9fd5
 
 
 
 
 
 
 
cae3774
73b9fd5
 
 
 
 
 
 
 
 
 
 
 
 
61c30f6
 
cb0eb27
61c30f6
73b9fd5
 
 
 
 
 
 
f2c159e
73b9fd5
 
 
f2c159e
73b9fd5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
license: cc-by-nc-4.0
datasets:
- MBZUAI/Bactrian-X
language:
- id
- en
tags:
- qlora
- wizardlm
- uncensored
- instruct
- chat
- alpaca
- indonesia
---
# DukunLM V1.0 - Indonesian Language Model πŸ§™β€β™‚οΈ

πŸš€ Welcome to the DukunLM V1.0 repository! DukunLM V1.0 is an open-source language model trained to generate Indonesian text using the power of AI. DukunLM, meaning "WizardLM" in Indonesian, is here to revolutionize language generation 🌟. This is an updated version from [azale-ai/DukunLM-Uncensored-7B](https://huggingface.co/azale-ai/DukunLM-Uncensored-7B) with full model release, not only adapter model like before πŸ‘½.

## Model Details

| Name Model                                                                       | Parameters | Google Colab                                                                                                                                                                       | Base Model                                                                                             | Dataset                                                                                                    | Prompt Format                                          | Fine Tune Method                           | Sharded Version                           |
|----------------------------------------------------------------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|--------------------------------------------|--------------------------------------------|
| [DukunLM-7B-V1.0-Uncensored](https://huggingface.co/azale-ai/DukunLM-7B-V1.0-Uncensored)   | 7B         | [Link](https://colab.research.google.com/drive/1UEiRqkfU1jGVMM9we4X3ooN1kkNGtfLd) | [ehartford/WizardLM-7B-V1.0-Uncensored](https://huggingface.co/ehartford/WizardLM-7B-V1.0-Uncensored)  | [MBZUAI/Bactrian-X (Indonesian subset)](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/id/train) | [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | [QLoRA](https://github.com/artidoro/qlora) | [Link](https://huggingface.co/azale-ai/DukunLM-7B-V1.0-Uncensored-sharded) |
| [DukunLM-13B-V1.0-Uncensored](https://huggingface.co/azale-ai/DukunLM-13B-V1.0-Uncensored) | 13B        | [Link](https://colab.research.google.com/drive/19xXYcAwVFLSItHm__GhPTYMryOGjdFkF) | [ehartford/WizardLM-13B-V1.0-Uncensored](https://huggingface.co/ehartford/WizardLM-13B-V1.0-Uncensored) | [MBZUAI/Bactrian-X (Indonesian subset)](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/id/train) | [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | [QLoRA](https://github.com/artidoro/qlora) | [Link](https://huggingface.co/azale-ai/DukunLM-13B-V1.0-Uncensored-sharded) |

⚠️ **Warning**: DukunLM is an uncensored model without filters or alignment. Please use it responsibly as it may contain errors, cultural biases, and potentially offensive content. ⚠️

## Installation

To use DukunLM, ensure that PyTorch has been installed and that you have an Nvidia GPU (or use Google Colab). After that you need to install the required dependencies:
```bash
pip3 install -U git+https://github.com/huggingface/transformers.git
pip3 install -U git+https://github.com/huggingface/peft.git
pip3 install -U git+https://github.com/huggingface/accelerate.git
pip3 install -U bitsandbytes==0.39.0 einops==0.6.1 sentencepiece
```

## How to Use

### Normal Model

#### Stream Output

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model = AutoModelForCausalLM.from_pretrained("azale-ai/DukunLM-7B-V1.0-Uncensored", torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("azale-ai/DukunLM-7B-V1.0-Uncensored")
streamer = TextStreamer(tokenizer)

instruction_prompt = "Jelaskan mengapa air penting bagi kehidupan manusia."
input_prompt = ""

if not input_prompt:
  prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt)

else:
  prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt, input=input_prompt)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
_ = model.generate(
    inputs=inputs.input_ids,
    streamer=streamer,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_length=2048, temperature=0.7,
    do_sample=True, top_k=4, top_p=0.95
)
```

#### No Stream Output

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("azale-ai/DukunLM-7B-V1.0-Uncensored", torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("azale-ai/DukunLM-7B-V1.0-Uncensored")

instruction_prompt = "Jelaskan mengapa air penting bagi kehidupan manusia."
input_prompt = ""

if not input_prompt:
  prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt)

else:
  prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt, input=input_prompt)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    inputs=inputs.input_ids,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_length=2048, temperature=0.7,
    do_sample=True, top_k=4, top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Quantize Model

#### Stream Output

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer

model = AutoModelForCausalLM.from_pretrained(
    "azale-ai/DukunLM-7B-V1.0-Uncensored-sharded",
    load_in_4bit=True,
    torch_dtype=torch.float32,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        llm_int8_threshold=6.0,
        llm_int8_has_fp16_weight=False,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    )
)
tokenizer = AutoTokenizer.from_pretrained("azale-ai/DukunLM-7B-V1.0-Uncensored-sharded")
streamer = TextStreamer(tokenizer)

instruction_prompt = "Jelaskan mengapa air penting bagi kehidupan manusia."
input_prompt = ""

if not input_prompt:
  prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt)

else:
  prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt, input=input_prompt)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
_ = model.generate(
    inputs=inputs.input_ids,
    streamer=streamer,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_length=2048, temperature=0.7,
    do_sample=True, top_k=4, top_p=0.95
)
```

#### No Stream Output

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "azale-ai/DukunLM-7B-V1.0-Uncensored-sharded",
    load_in_4bit=True,
    torch_dtype=torch.float32,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        llm_int8_threshold=6.0,
        llm_int8_has_fp16_weight=False,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    )
)
tokenizer = AutoTokenizer.from_pretrained("azale-ai/DukunLM-7B-V1.0-Uncensored-sharded")

instruction_prompt = "Jelaskan mengapa air penting bagi kehidupan manusia."
input_prompt = ""

if not input_prompt:
  prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt)

else:
  prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""
  prompt = prompt.format(instruction=instruction_prompt, input=input_prompt)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    inputs=inputs.input_ids,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_length=2048, temperature=0.7,
    do_sample=True, top_k=4, top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Benchmark

Coming soon, stay tune πŸ™‚πŸ™‚.

## Limitations

- The base model language is English and fine-tuned to Indonesia
- Cultural and contextual biases

## License

DukunLM V1.0 is licensed under the [Creative Commons NonCommercial (CC BY-NC 4.0) license](https://creativecommons.org/licenses/by-nc/4.0/legalcode).

## Contributing

We welcome contributions to enhance and improve DukunLM V1.0. If you have any suggestions or find any issues, please feel free to open an issue or submit a pull request. Also we're open to sponsor for compute power.

## Contact Us

[[email protected]](mailto:[email protected])