File size: 3,083 Bytes
56156e6
 
4a10793
 
 
d8a15f0
4a10793
d8a15f0
 
 
 
 
 
070f80e
 
77b5bea
070f80e
77b5bea
 
 
070f80e
d8a15f0
867c502
 
d8a15f0
 
 
 
 
 
 
 
 
 
 
 
 
cf7a025
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e35e35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: mit
language:
- en
tags:
- sentiment
- finance
- sentiment-analysis
- financial-sentiment-analysis
datasets:
- Stock-Market-Tweets-Data
metrics:
- perplexity
widget:
- text: "Paris is the [MASK] of France."
  example_title: "Generic 1"
- text: "The goal of life is [MASK]."
  example_title: "Generic 2"
- text: "AAPL is a [MASK] sector stock."
  example_title: "AAPL"
- text: "I predict that this stock will go [MASK]."
  example_title: "Stock Direction"
- text: "$AAPL is the ticker for the company named [MASK]."
  example_title: "Ticker"
base_model: yiyanghkust/finbert-pretrain
model-index:
- name: FinTwitBERT
  results:
    - task:
        type: financial-tweet-prediction
        name: Financial Tweet Prediction
      dataset:
        name: Stock Market Tweets Data
        type: finance
      metrics:
        - type: Perplexity
          value: 5.156
---

# FinTwitBERT

FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.

## Table of Contents
- [Dataset](#dataset)
- [Model Details](#model-details)
- [Installation](#installation)
- [Usage](#usage)
- [Training](#training)
- [Evaluation](#evaluation)
- [Contributing](#contributing)
- [License](#license)

## Dataset
FinTwitBERT is pre-trained on Taborda et al.'s [Stock Market Tweets Data](https://ieee-dataport.org/open-access/stock-market-tweets-data) consisting of 943,672 tweets, including 1,300 labeled tweets. All labeled tweets are used for evaluation of the pre-trained model, using perplexity as a measurement. The other tweets are used for pre-training with 10% being used for model validation.

## Model details
We use the [FinBERT](https://huggingface.co/ProsusAI/finbert) model and tokenizer from ProsusAI as our base. We added two masks to the tokenizer: `@USER` for user mentions and `[URL]` for URLs in tweets. The model is then pre-trained for 10 epochs using loss at the metric for the best model. We apply early stopping to prevent overfitting the model.

The latest pre-trained model and tokenizer can be found here on huggingface: https://huggingface.co/StephanAkkerman/FinTwitBERT.

## Installation
```bash
# Clone this repository
git clone https://github.com/TimKoornstra/FinTwitBERT
# Install required packages
pip install -r requirements.txt
```
## Usage
The model can be finetuned for specific tasks such as sentiment classification. For more information about it, you can visit our other repository: https://github.com/TimKoornstra/stock-sentiment-classifier.

## Contributing
Contributions are welcome! If you have a feature request, bug report, or proposal for code refactoring, please feel free to open an issue on GitHub. I appreciate your help in improving this project.

## License
This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details.