Update README.md
Browse files
README.md
CHANGED
@@ -21,17 +21,6 @@ tags:
|
|
21 |
KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
|
22 |
This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
|
23 |
|
24 |
-
## Rule-based filtering
|
25 |
-
To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
|
26 |
-
* We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
|
27 |
-
* Next, we filter out files which belong to the repos with less than 5 Kotlin files
|
28 |
-
* Finally, we remove files which have less than 20 SLOC
|
29 |
-
|
30 |
-
We clean the content of the remaining dataset entries according to the following rules:
|
31 |
-
* We remove all non-ASCII entries
|
32 |
-
* We remove all package lines such as _package kotlinx.coroutines.channels_
|
33 |
-
* We remove half of the import lines.
|
34 |
-
|
35 |
# Model use
|
36 |
|
37 |
```python
|
@@ -83,10 +72,17 @@ The model was trained on one A100 GPU with following hyperparameters:
|
|
83 |
|
84 |
More details about finetuning can be found in the technical report
|
85 |
|
86 |
-
#
|
87 |
|
88 |
-
|
89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
# Evaluation
|
92 |
|
|
|
21 |
KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
|
22 |
This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
|
23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
# Model use
|
25 |
|
26 |
```python
|
|
|
72 |
|
73 |
More details about finetuning can be found in the technical report
|
74 |
|
75 |
+
# Data filtering
|
76 |
|
77 |
+
To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
|
78 |
+
* We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
|
79 |
+
* Next, we filter out files which belong to the repos with less than 5 Kotlin files
|
80 |
+
* Finally, we remove files which have less than 20 SLOC
|
81 |
+
|
82 |
+
We clean the content of the remaining dataset entries according to the following rules:
|
83 |
+
* We remove all non-ASCII entries
|
84 |
+
* We remove all package lines such as _package kotlinx.coroutines.channels_
|
85 |
+
* We remove half of the import lines.
|
86 |
|
87 |
# Evaluation
|
88 |
|