danlou commited on
Commit
a3e6ae4
1 Parent(s): 84f1842

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Twitter June 2022 (RoBERTa-base, 154M)
2
+
3
+ This is a RoBERTa-base model trained on 153.86M tweets until the end of June 2022 (15M tweets increment).
4
+ More details and performance scores are available in the [TimeLMs paper](https://arxiv.org/abs/2202.03829).
5
+
6
+ Below, we provide some usage examples using the standard Transformers interface. For another interface more suited to comparing predictions and perplexity scores between models trained at different temporal intervals, check the [TimeLMs repository](https://github.com/cardiffnlp/timelms).
7
+
8
+ For other models trained until different periods, check this [table](https://github.com/cardiffnlp/timelms#released-models).
9
+
10
+ ## Preprocess Text
11
+ Replace usernames and links for placeholders: "@user" and "http".
12
+ If you're interested in retaining verified users which were also retained during training, you may keep the users listed [here](https://github.com/cardiffnlp/timelms/tree/main/data).
13
+ ```python
14
+ def preprocess(text):
15
+ new_text = []
16
+ for t in text.split(" "):
17
+ t = '@user' if t.startswith('@') and len(t) > 1 else t
18
+ t = 'http' if t.startswith('http') else t
19
+ new_text.append(t)
20
+ return " ".join(new_text)
21
+ ```
22
+
23
+ ## Example Masked Language Model
24
+
25
+ ```python
26
+ from transformers import pipeline, AutoTokenizer
27
+
28
+ MODEL = "cardiffnlp/twitter-roberta-base-jun2022-15M-incr"
29
+ fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
30
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
31
+
32
+ def print_candidates():
33
+ for i in range(5):
34
+ token = tokenizer.decode(candidates[i]['token'])
35
+ score = candidates[i]['score']
36
+ print("%d) %.5f %s" % (i+1, score, token))
37
+
38
+ texts = [
39
+ "So glad I'm <mask> vaccinated.",
40
+ "I keep forgetting to bring a <mask>.",
41
+ "Looking forward to watching <mask> Game tonight!",
42
+ ]
43
+ for text in texts:
44
+ t = preprocess(text)
45
+ print(f"{'-'*30}\n{t}")
46
+ candidates = fill_mask(t)
47
+ print_candidates()
48
+ ```
49
+
50
+ Output:
51
+
52
+ ```
53
+ ------------------------------
54
+ So glad I'm <mask> vaccinated.
55
+ 1) 0.36928 not
56
+ 2) 0.29651 fully
57
+ 3) 0.15332 getting
58
+ 4) 0.04144 still
59
+ 5) 0.01805 all
60
+ ------------------------------
61
+ I keep forgetting to bring a <mask>.
62
+ 1) 0.06048 book
63
+ 2) 0.03458 backpack
64
+ 3) 0.03362 lighter
65
+ 4) 0.03162 charger
66
+ 5) 0.02832 pen
67
+ ------------------------------
68
+ Looking forward to watching <mask> Game tonight!
69
+ 1) 0.65149 the
70
+ 2) 0.14239 The
71
+ 3) 0.02432 this
72
+ 4) 0.00877 End
73
+ 5) 0.00866 Big
74
+ ```
75
+
76
+ ## Example Tweet Embeddings
77
+ ```python
78
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
79
+ import numpy as np
80
+ from scipy.spatial.distance import cosine
81
+ from collections import Counter
82
+
83
+ def get_embedding(text):
84
+ text = preprocess(text)
85
+ encoded_input = tokenizer(text, return_tensors='pt')
86
+ features = model(**encoded_input)
87
+ features = features[0].detach().cpu().numpy()
88
+ features_mean = np.mean(features[0], axis=0)
89
+ return features_mean
90
+
91
+
92
+ MODEL = "cardiffnlp/twitter-roberta-base-jun2022-15M-incr"
93
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
94
+ model = AutoModel.from_pretrained(MODEL)
95
+
96
+ query = "The book was awesome"
97
+ tweets = ["I just ordered fried chicken 🐣",
98
+ "The movie was great",
99
+ "What time is the next game?",
100
+ "Just finished reading 'Embeddings in NLP'"]
101
+
102
+ sims = Counter()
103
+ for tweet in tweets:
104
+ sim = 1 - cosine(get_embedding(query), get_embedding(tweet))
105
+ sims[tweet] = sim
106
+
107
+ print('Most similar to: ', query)
108
+ print(f"{'-'*30}")
109
+ for idx, (tweet, sim) in enumerate(sims.most_common()):
110
+ print("%d) %.5f %s" % (idx+1, sim, tweet))
111
+ ```
112
+ Output:
113
+
114
+ ```
115
+ Most similar to: The book was awesome
116
+ ------------------------------
117
+ 1) 0.98882 The movie was great
118
+ 2) 0.96087 Just finished reading 'Embeddings in NLP'
119
+ 3) 0.95450 I just ordered fried chicken 🐣
120
+ 4) 0.95300 What time is the next game?
121
+ ```
122
+
123
+ ## Example Feature Extraction
124
+
125
+ ```python
126
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
127
+ import numpy as np
128
+
129
+ MODEL = "cardiffnlp/twitter-roberta-base-jun2022-15M-incr"
130
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
131
+
132
+ text = "Good night 😊"
133
+ text = preprocess(text)
134
+
135
+ # Pytorch
136
+ model = AutoModel.from_pretrained(MODEL)
137
+ encoded_input = tokenizer(text, return_tensors='pt')
138
+ features = model(**encoded_input)
139
+ features = features[0].detach().cpu().numpy()
140
+ features_mean = np.mean(features[0], axis=0)
141
+ #features_max = np.max(features[0], axis=0)
142
+
143
+ # # Tensorflow
144
+ # model = TFAutoModel.from_pretrained(MODEL)
145
+ # encoded_input = tokenizer(text, return_tensors='tf')
146
+ # features = model(encoded_input)
147
+ # features = features[0].numpy()
148
+ # features_mean = np.mean(features[0], axis=0)
149
+ # #features_max = np.max(features[0], axis=0)
150
+ ```