StephanAkkerman commited on
Commit
8497bf7
1 Parent(s): 7e021a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -27
README.md CHANGED
@@ -7,12 +7,16 @@ tags:
7
  - finance
8
  - sentiment-analysis
9
  - financial-sentiment-analysis
 
 
 
 
 
 
10
  datasets:
11
  - StephanAkkerman/stock-market-tweets-data
12
  - StephanAkkerman/financial-tweets
13
- - StephanAkkerman/financial-tweets-crypto
14
- - StephanAkkerman/financial-tweets-stocks
15
- - StephanAkkerman/financial-tweets-other
16
  metrics:
17
  - perplexity
18
  widget:
@@ -45,36 +49,31 @@ model-index:
45
 
46
  FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.
47
 
48
- ## Table of Contents
49
- - [Dataset](#dataset)
50
- - [Model Details](#model-details)
51
- - [Installation](#installation)
52
- - [Usage](#usage)
53
- - [Training](#training)
54
- - [Evaluation](#evaluation)
55
- - [Contributing](#contributing)
56
- - [License](#license)
57
-
58
  ## Dataset
59
- FinTwitBERT is pre-trained on Taborda et al.'s [Stock Market Tweets Data](https://ieee-dataport.org/open-access/stock-market-tweets-data) consisting of 943,672 tweets, including 1,300 labeled tweets. All labeled tweets are used for evaluation of the pre-trained model, using perplexity as a measurement. The other tweets are used for pre-training with 10% being used for model validation.
 
 
 
60
 
61
- ## Model details
62
- We use the [FinBERT](https://huggingface.co/ProsusAI/finbert) model and tokenizer from ProsusAI as our base. We added two masks to the tokenizer: `@USER` for user mentions and `[URL]` for URLs in tweets. The model is then pre-trained for 10 epochs using loss at the metric for the best model. We apply early stopping to prevent overfitting the model.
63
 
64
- The latest pre-trained model and tokenizer can be found here on huggingface: https://huggingface.co/StephanAkkerman/FinTwitBERT.
 
65
 
66
- ## Installation
67
- ```bash
68
- # Clone this repository
69
- git clone https://github.com/TimKoornstra/FinTwitBERT
70
- # Install required packages
71
- pip install -r requirements.txt
72
- ```
73
  ## Usage
74
- The model can be finetuned for specific tasks such as sentiment classification. For more information about it, you can visit our other repository: https://github.com/TimKoornstra/stock-sentiment-classifier.
75
 
76
- ## Contributing
77
- Contributions are welcome! If you have a feature request, bug report, or proposal for code refactoring, please feel free to open an issue on GitHub. I appreciate your help in improving this project.
 
 
 
 
 
 
 
 
78
 
79
  ## License
80
  This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details.
 
7
  - finance
8
  - sentiment-analysis
9
  - financial-sentiment-analysis
10
+ - twitter
11
+ - tweets
12
+ - stocks
13
+ - stock-market
14
+ - crypto
15
+ - cryptocurrency
16
  datasets:
17
  - StephanAkkerman/stock-market-tweets-data
18
  - StephanAkkerman/financial-tweets
19
+ - StephanAkkerman/crypto-stock-tweets
 
 
20
  metrics:
21
  - perplexity
22
  widget:
 
49
 
50
  FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.
51
 
 
 
 
 
 
 
 
 
 
 
52
  ## Dataset
53
+ FinTwitBERT is pre-trained on several financial tweets datasets, consisting of tweets mentioning stocks and cryptocurrencies:
54
+ - [StephanAkkerman/crypto-stock-tweets](https://huggingface.co/datasets/StephanAkkerman/crypto-stock-tweets): 8,024,269 tweets
55
+ - [StephanAkkerman/stock-market-tweets-data](https://huggingface.co/datasets/StephanAkkerman/stock-market-tweets-data): 923,673 tweets
56
+ - [StephanAkkerman/financial-tweets](https://huggingface.co/datasets/StephanAkkerman/financial-tweets): 263,119 tweets
57
 
58
+ ## Model Details
59
+ Based on the [FinBERT](https://huggingface.co/yiyanghkust/finbert-pretrain) model and tokenizer, FinTwitBERT includes additional masks (`@USER` and `[URL]`) to handle common elements in tweets. The model underwent 10 epochs of pre-training, with early stopping to prevent overfitting.
60
 
61
+ ## More Information
62
+ For a comprehensive overview, including the complete training setup details and more, visit the [FinTwitBERT GitHub repository](https://github.com/TimKoornstra/FinTwitBERT).
63
 
 
 
 
 
 
 
 
64
  ## Usage
65
+ Using [HuggingFace's transformers library](https://huggingface.co/docs/transformers/index) the model and tokenizers can be converted into a pipeline for masked language modeling.
66
 
67
+ ```python
68
+ from transformers import pipeline
69
+
70
+ pipe = pipeline(
71
+ "fill-mask",
72
+ model="StephanAkkerman/FinTwitBERT",
73
+ tokenizer="StephanAkkerman/FinTwitBERT",
74
+ )
75
+ print(pipe("Bitcoin is a [MASK] coin."))
76
+ ```
77
 
78
  ## License
79
  This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details.