Update README.md
Browse files
README.md
CHANGED
@@ -32,11 +32,12 @@ set a seed for reproducibility:
|
|
32 |
>>> # the previous text.
|
33 |
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
|
34 |
>>> set_seed(42)
|
35 |
-
>>> # This example also illustrates that sometimes our model generates
|
36 |
-
>>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
|
37 |
-
>>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
|
38 |
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
|
39 |
-
|
|
|
|
|
|
|
|
|
40 |
```
|
41 |
|
42 |
Here is how to use this model to get the features of a given text in PyTorch:
|
@@ -52,7 +53,7 @@ output = model(**encoded_input)
|
|
52 |
|
53 |
## Dataset
|
54 |
|
55 |
-
The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](
|
56 |
The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
|
57 |
The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
|
58 |
|
|
|
32 |
>>> # the previous text.
|
33 |
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
|
34 |
>>> set_seed(42)
|
|
|
|
|
|
|
35 |
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
|
36 |
+
[{'generated_text': "Hello, I'm a language model, but you want to know if I have a language in that language. Is this possible? Please explain"},
|
37 |
+
{'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The C++ API is becoming more and more popular for"},
|
38 |
+
{'generated_text': "Hello, I'm a language model, I'm not trying to learn or understand a new tool, my job is to be as happy as"},
|
39 |
+
{'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just a curious guy.\n"},
|
40 |
+
{'generated_text': "Hello, I'm a language model, I'm not doing anything that needs to be done for the current time (or previous)."}]
|
41 |
```
|
42 |
|
43 |
Here is how to use this model to get the features of a given text in PyTorch:
|
|
|
53 |
|
54 |
## Dataset
|
55 |
|
56 |
+
The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
|
57 |
The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
|
58 |
The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
|
59 |
|