File size: 2,885 Bytes
56156e6
 
4a10793
 
 
d8a15f0
4a10793
d8a15f0
 
8497bf7
 
 
 
 
 
d8a15f0
fc397f0
 
8497bf7
d8a15f0
 
070f80e
9a0d7c8
 
 
 
 
 
 
 
 
 
d8a15f0
 
 
 
9a0d7c8
 
 
 
 
 
 
 
2d92a3f
cf7a025
 
 
 
 
 
 
8497bf7
 
 
 
cf7a025
8497bf7
 
cf7a025
8497bf7
 
cf7a025
 
8497bf7
cf7a025
8497bf7
 
 
 
 
 
 
 
 
 
cf7a025
 
2e35e35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: mit
language:
- en
tags:
- sentiment
- finance
- sentiment-analysis
- financial-sentiment-analysis
- twitter
- tweets
- stocks
- stock-market
- crypto
- cryptocurrency
datasets:
- StephanAkkerman/stock-market-tweets-data
- StephanAkkerman/financial-tweets
- StephanAkkerman/crypto-stock-tweets
metrics:
- perplexity
widget:
- text: Paris is the [MASK] of France.
  example_title: Generic 1
- text: The goal of life is [MASK].
  example_title: Generic 2
- text: AAPL is a [MASK] sector stock.
  example_title: AAPL
- text: I predict that this stock will go [MASK].
  example_title: Stock Direction
- text: $AAPL is the ticker for the company named [MASK].
  example_title: Ticker
base_model: yiyanghkust/finbert-pretrain
model-index:
- name: FinTwitBERT
  results:
  - task:
      type: financial-tweet-prediction
      name: Financial Tweet Prediction
    dataset:
      name: Stock Market Tweets Data
      type: finance
    metrics:
    - type: Perplexity
      value: 5.022
---

# FinTwitBERT

FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.

## Dataset
FinTwitBERT is pre-trained on several financial tweets datasets, consisting of tweets mentioning stocks and cryptocurrencies:
- [StephanAkkerman/crypto-stock-tweets](https://huggingface.co/datasets/StephanAkkerman/crypto-stock-tweets): 8,024,269 tweets
- [StephanAkkerman/stock-market-tweets-data](https://huggingface.co/datasets/StephanAkkerman/stock-market-tweets-data): 923,673 tweets
- [StephanAkkerman/financial-tweets](https://huggingface.co/datasets/StephanAkkerman/financial-tweets): 263,119 tweets

## Model Details
Based on the [FinBERT](https://huggingface.co/yiyanghkust/finbert-pretrain) model and tokenizer, FinTwitBERT includes additional masks (`@USER` and `[URL]`) to handle common elements in tweets. The model underwent 10 epochs of pre-training, with early stopping to prevent overfitting.

## More Information
For a comprehensive overview, including the complete training setup details and more, visit the [FinTwitBERT GitHub repository](https://github.com/TimKoornstra/FinTwitBERT).

## Usage
Using [HuggingFace's transformers library](https://huggingface.co/docs/transformers/index) the model and tokenizers can be converted into a pipeline for masked language modeling.

```python
from transformers import pipeline

pipe = pipeline(
    "fill-mask",
    model="StephanAkkerman/FinTwitBERT",
    tokenizer="StephanAkkerman/FinTwitBERT",
)
print(pipe("Bitcoin is a [MASK] coin."))
```

## License
This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details.