sagorsarker commited on
Commit
60023e6
1 Parent(s): 725f3d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -19
README.md CHANGED
@@ -17,15 +17,15 @@ base_model:
17
 
18
  ## Model Information
19
 
20
- This model is a continually pretrained version of the [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) architecture, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in Bangla language understanding evaluation benchmarks and text generation tasks.
21
 
22
- **Model Architecture:** Llama 3.2 is an auto-regressive language model with optimized transformer architecture.
23
 
24
  | | Training Data | Params | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count | Knowledge cutoff |
25
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
26
  | Llama 3.2 (text only) | Hishab curated Bangla text corpus | 3B(3.21B) | Monolingual Text(Bangla) | Monolingual Text(Bangla) | 4096 | Yes | Yes | 6B tokens | |
27
 
28
- **Supported Languages:** Bengali (primary) and English (secondary)
29
 
30
  **Llama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
31
 
@@ -66,25 +66,25 @@ pipe("আমাদের দেশের নাম")
66
 
67
  ## Training Data
68
 
69
- **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the actual data size. Total trained tokens are __6B__ tokens.
70
 
71
  Data sources summary:
72
- - Web documents: Extracted, clean, and filtered common crawl data
73
- - Books: Extracted, clean, filtered books data
74
  - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
75
- - Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
76
- - Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
77
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
78
  - Synthetic data: We generated synthetic data using a Bangla LLM model
79
- - Others: We scrapped some selected website data, used open-source data, and used some other data sources
80
 
81
 
82
- ## Benchmarks \- Bangla Text
83
 
84
  In this section, we report the results for __titulm-llama-3.2-3b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
85
 
86
  ### Evaluation Datasets
87
- We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, its English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
88
 
89
  #### Bangla Benchmark datasets
90
  We evaluated the models on the following datasets:
@@ -96,22 +96,28 @@ We evaluated the models on the following datasets:
96
 
97
  #### English Benchmark datasets
98
  - [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
99
- - [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question-answering dataset that requires different types of commonsense knowledge to predict the correct answers.
100
  - [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
101
  - [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
102
- - [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question-answer dataset for yes/no questions containing 15942 examples. These questions are naturally occurring. They are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
103
 
104
  ### Evaluation Results
105
 
106
- #### Evaluation on Bangla Benchmark datasets
 
 
 
 
107
  | Model | Shots | Bangla MMLU | BoolQ BN | Commonsense QA BN | OpenBook QA BN | PIQA BN |
108
  |---------------------------------|---------|-------------|----------|-------------------|----------------|---------|
109
- | llama-3.2-3b | 0-shot | **0.36** | 0.55 | 0.26 | 0.31 | 0.56 |
110
- | | 5-shot | **0.38** | - | 0.29 | 0.32 | 0.58 |
111
- | titulm-llama-3.2-3b-v1.0 | 0-shot | **0.36** | **0.67** | **0.30** | **0.35** | **0.61**|
112
- | | 5-shot | 0.36 | - | **0.30** | **0.35** | **0.61**|
113
 
114
  #### Evaluation of English Benchmark datasets
 
 
115
 
116
  | Model | Shots | MMLU | BoolQ | Commonsense QA | OpenBook QA | PIQA |
117
  |--------------------------------------|--------|--------------|------------|--------------------|-----------------|-----------|
@@ -120,7 +126,6 @@ We evaluated the models on the following datasets:
120
  | titulm-llama-3.2-3b-v1.0 | 0-shot | 0.47 | 0.70 | 0.58 | 0.39 | 0.76 |
121
  | | 5-shot | 0.53 | 0.70 | 0.63 | 0.44 | 0.78 |
122
 
123
-
124
  ### Instruction Tuned Models
125
 
126
 
 
17
 
18
  ## Model Information
19
 
20
+ This model is a continually pre-trained version of the [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) architecture, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in tasks related to Bangla language understanding evaluation benchmarks and text generation.
21
 
22
+ **Model Architecture:** Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture.
23
 
24
  | | Training Data | Params | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count | Knowledge cutoff |
25
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
26
  | Llama 3.2 (text only) | Hishab curated Bangla text corpus | 3B(3.21B) | Monolingual Text(Bangla) | Monolingual Text(Bangla) | 4096 | Yes | Yes | 6B tokens | |
27
 
28
+ **Supported Languages:** Bengali(primary) and English(secondary)
29
 
30
  **Llama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
31
 
 
66
 
67
  ## Training Data
68
 
69
+ **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __6B__ tokens.
70
 
71
  Data sources summary:
72
+ - Web documents: Extract, clean, and filter common crawl data
73
+ - Books: Extract, clean, and filter book data
74
  - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
75
+ - Translation data: We trained a Bangla-English translation LLM model and used it to translate English data to Bangla
76
+ - Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
77
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
78
  - Synthetic data: We generated synthetic data using a Bangla LLM model
79
+ - Others: We scrap some selected website data, used open-source data, and used some other data sources
80
 
81
 
82
+ ## Benchmarks
83
 
84
  In this section, we report the results for __titulm-llama-3.2-3b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
85
 
86
  ### Evaluation Datasets
87
+ We evaluated our pre-trained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, its English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
88
 
89
  #### Bangla Benchmark datasets
90
  We evaluated the models on the following datasets:
 
96
 
97
  #### English Benchmark datasets
98
  - [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
99
+ - [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question-answering dataset that requires different types of commonsense knowledge to predict the correct answers .
100
  - [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
101
  - [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
102
+ - [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
103
 
104
  ### Evaluation Results
105
 
106
+ #### Evaluation of Bangla Benchmark datasets
107
+ - **llama-3.2-3b** performs better on **Bangla MMLU** with a 0-shot score of **0.36** and a 5-shot score of **0.38**. It also leads in **BoolQ BN** with a 0-shot score of **0.55** and in **OpenBook QA BN** with a 5-shot score of **0.32**.
108
+ - **hishab/titulm-llama-3.2-3b-v1.0** outperforms in **Commonsense QA BN**, **OpenBook QA BN**, and **PIQA BN** in both 0-shot and 5-shot settings, with the highest score of **0.61** in **PIQA BN**.
109
+
110
+
111
  | Model | Shots | Bangla MMLU | BoolQ BN | Commonsense QA BN | OpenBook QA BN | PIQA BN |
112
  |---------------------------------|---------|-------------|----------|-------------------|----------------|---------|
113
+ | llama-3.2-3b | 0-shot | **0.36** | **0.55** | 0.26 | 0.31 | 0.56 |
114
+ | | 5-shot | **0.38** | - | 0.29 | **0.32** | 0.58 |
115
+ | hishab/titulm-llama-3.2-3b-v1.0 | 0-shot | 0.36 | 0.67 | **0.30** | **0.35** | **0.61**|
116
+ | | 5-shot | 0.36 | - | **0.30** | 0.35 | **0.61**|
117
 
118
  #### Evaluation of English Benchmark datasets
119
+ - **llama-3.2-3b** consistently achieves the best scores across all English tasks, with top performances in **MMLU**, **BoolQ**, **Commonsense QA**, **OpenBook QA**, and **PIQA** in both 0-shot and 5-shot settings. It reaches a 5-shot score of **0.796** in **PIQA**.
120
+ - **titulm-llama-3.2-3b-v1.0** shows competitive performance but trails behind **llama-3.2-3b** in most English benchmarks, particularly in 0-shot settings, though it still performs well in **PIQA** and **Commonsense QA**.
121
 
122
  | Model | Shots | MMLU | BoolQ | Commonsense QA | OpenBook QA | PIQA |
123
  |--------------------------------------|--------|--------------|------------|--------------------|-----------------|-----------|
 
126
  | titulm-llama-3.2-3b-v1.0 | 0-shot | 0.47 | 0.70 | 0.58 | 0.39 | 0.76 |
127
  | | 5-shot | 0.53 | 0.70 | 0.63 | 0.44 | 0.78 |
128
 
 
129
  ### Instruction Tuned Models
130
 
131