File size: 2,885 Bytes
56156e6 4a10793 d8a15f0 4a10793 d8a15f0 8497bf7 d8a15f0 fc397f0 8497bf7 d8a15f0 070f80e 9a0d7c8 d8a15f0 9a0d7c8 2d92a3f cf7a025 8497bf7 cf7a025 8497bf7 cf7a025 8497bf7 cf7a025 8497bf7 cf7a025 8497bf7 cf7a025 2e35e35 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: mit
language:
- en
tags:
- sentiment
- finance
- sentiment-analysis
- financial-sentiment-analysis
- twitter
- tweets
- stocks
- stock-market
- crypto
- cryptocurrency
datasets:
- StephanAkkerman/stock-market-tweets-data
- StephanAkkerman/financial-tweets
- StephanAkkerman/crypto-stock-tweets
metrics:
- perplexity
widget:
- text: Paris is the [MASK] of France.
example_title: Generic 1
- text: The goal of life is [MASK].
example_title: Generic 2
- text: AAPL is a [MASK] sector stock.
example_title: AAPL
- text: I predict that this stock will go [MASK].
example_title: Stock Direction
- text: $AAPL is the ticker for the company named [MASK].
example_title: Ticker
base_model: yiyanghkust/finbert-pretrain
model-index:
- name: FinTwitBERT
results:
- task:
type: financial-tweet-prediction
name: Financial Tweet Prediction
dataset:
name: Stock Market Tweets Data
type: finance
metrics:
- type: Perplexity
value: 5.022
---
# FinTwitBERT
FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.
## Dataset
FinTwitBERT is pre-trained on several financial tweets datasets, consisting of tweets mentioning stocks and cryptocurrencies:
- [StephanAkkerman/crypto-stock-tweets](https://huggingface.co/datasets/StephanAkkerman/crypto-stock-tweets): 8,024,269 tweets
- [StephanAkkerman/stock-market-tweets-data](https://huggingface.co/datasets/StephanAkkerman/stock-market-tweets-data): 923,673 tweets
- [StephanAkkerman/financial-tweets](https://huggingface.co/datasets/StephanAkkerman/financial-tweets): 263,119 tweets
## Model Details
Based on the [FinBERT](https://huggingface.co/yiyanghkust/finbert-pretrain) model and tokenizer, FinTwitBERT includes additional masks (`@USER` and `[URL]`) to handle common elements in tweets. The model underwent 10 epochs of pre-training, with early stopping to prevent overfitting.
## More Information
For a comprehensive overview, including the complete training setup details and more, visit the [FinTwitBERT GitHub repository](https://github.com/TimKoornstra/FinTwitBERT).
## Usage
Using [HuggingFace's transformers library](https://huggingface.co/docs/transformers/index) the model and tokenizers can be converted into a pipeline for masked language modeling.
```python
from transformers import pipeline
pipe = pipeline(
"fill-mask",
model="StephanAkkerman/FinTwitBERT",
tokenizer="StephanAkkerman/FinTwitBERT",
)
print(pipe("Bitcoin is a [MASK] coin."))
```
## License
This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details. |