First model version

Browse files

Files changed (16) hide show

.DS_Store +0 -0
LICENSE +21 -0
README.md +69 -0
config.json +38 -0
gpt-2-tamil/config.json +36 -0
gpt-2-tamil/events.out.tfevents.1626336540.t1v-n-ebe36c53-w-0.751183.3.v2 +3 -0
gpt-2-tamil/events.out.tfevents.1626339585.t1v-n-ebe36c53-w-0.759145.3.v2 +3 -0
gpt-2-tamil/events.out.tfevents.1626340740.t1v-n-ebe36c53-w-0.765413.3.v2 +3 -0
gpt-2-tamil/events.out.tfevents.1626341319.t1v-n-ebe36c53-w-0.768105.3.v2 +3 -0
gpt-2-tamil/flax_model.msgpack +3 -0
gpt-2-tamil/tokenizer.json +0 -0
model.safetensors +3 -0
pyproject.toml +31 -0
pytorch_model.bin +3 -0
requirements.txt +8 -0
tokenizer.json +0 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2021 Abinaya Mahendiran
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+language: ta
+datasets:
+- oscar
+- IndicNLP
+widget:
+- text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'
+---
+# GPT2-Tamil
+This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to pretrain a language model using GPT-2 specifically for Tamil language.
+## Setup:
+To setup the project, run the following command,
+```python
+pip install -r requirements.txt
+```
+## Model:
+Pretrained model on Tamil language using a causal language modeling (CLM) objective.
+## Dataset Used:
+The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/)
+## Intended uses & limitations:
+You can use the raw model for next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=gpt2) to look for fine-tuned versions on a task that interests you.
+## How to pretrain the model:
+To perform training, do the following steps,
+- Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)
+```python
+>>> export MODEL_DIR=<model_dir>
+```
+- Create the config.json by running the following command,
+```python
+>>> python src/create_config.py
+```
+- Create the tokenizer by running the following command,
+```python
+>>> python src/train_tokenizer.py
+```
+- Once the config and tokenizer is created, run the following script to start training the flax model
+```python
+>>> python scripts/train_gpt2-oscar-tamil.sh
+```
+## How to use:
+To perform language generation using the model, pipeline can be used directly.
+- First convert the flax model to pytorch using the following command,
+```python
+python src/convert_flax_to_pytorch.py
+```
+- Use the following snippet to perform language generation,
+```python
+ >>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
+ >>> model_name = 'abinayam/gpt-2-tamil'
+ >>> model = AutoModelWithLMHead.from_pretrained(model_name)
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
+ >>> set_seed(42)
+ >>> input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
+ >>> max_len = 300
+ >>> no_seq = 5
+ >>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
+ >>> sequence = generator(input_text, max_length=max_len, num_return_sequences=no_seq)
+```

config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "_name_or_path": "../gpt-2-tamil",
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.0,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.0,
+  "eos_token_id": 50256,
+  "gradient_checkpointing": false,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "resid_pdrop": 0.0,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 300
+    }
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.9.0.dev0",
+  "use_cache": true,
+  "vocab_size": 50257
+}

gpt-2-tamil/config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.0,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.0,
+  "eos_token_id": 50256,
+  "gradient_checkpointing": false,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "resid_pdrop": 0.0,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "transformers_version": "4.9.0.dev0",
+  "use_cache": true,
+  "vocab_size": 50257
+}

gpt-2-tamil/events.out.tfevents.1626336540.t1v-n-ebe36c53-w-0.751183.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f1799847ce42c1a5f9fe25dfa8d8da9e1a6ff57595979b2bd0daea658d9ea785
+size 40

gpt-2-tamil/events.out.tfevents.1626339585.t1v-n-ebe36c53-w-0.759145.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1b47918f07e65192c48181c8f775cbf29f08585ac3a559e67df1e3f13fb1ca01
+size 40

gpt-2-tamil/events.out.tfevents.1626340740.t1v-n-ebe36c53-w-0.765413.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5855b0a71977e29453739fe2c5055c32753a62fa6d3db8ea3f105fd8ca75357b
+size 40

gpt-2-tamil/events.out.tfevents.1626341319.t1v-n-ebe36c53-w-0.768105.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:938ebc19608236e36e53fd65f7c12c9d7ad0de447d01d60627441645872ef573
+size 22272043

gpt-2-tamil/flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:89396995064d16071519a20c2771d661400da8c3d644966f0a586d299d1b2fa3
+size 497764120

gpt-2-tamil/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a428e17fe6c41c96b3ac840af045afc1bf45a043e04d220f78944c38168b740
+size 510359598

pyproject.toml ADDED Viewed

	@@ -0,0 +1,31 @@

+# Black formatting
+[tool.black]
+line-length = 85
+include = '\.pyi?$'
+exclude = '''
+/(
+      \.eggs         # exclude a few common directories in the
+    | \.git          # root of the project
+    | \.hg
+    | \.mypy_cache
+    | \.tox
+    | \.venv
+    | _build
+    | buck-out
+    | build
+    | dist
+    | wandb
+    | model
+    | dataset
+    | notebook
+  )/
+'''
+# iSort
+[tool.isort]
+profile = "black"
+line_length = 85
+multi_line_output = 3
+include_trailing_comma = true
+skip_gitignore = true
+virtual_env = "venv"

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:183e23beb6421156e9472504710d66104d4d43829fb87cffd22d888565f27a3a
+size 510401385

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+tqdm
+transformers
+datasets
+jax
+jaxlib
+flax
+optax
+wandb

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff