StephanAkkerman
/

FinTwitBERT

Model card Files Files and versions Community

StephanAkkerman commited on Feb 21

Commit

8497bf7

•

1 Parent(s): 7e021a6

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -27

README.md CHANGED Viewed

@@ -7,12 +7,16 @@ tags:
 - finance
 - sentiment-analysis
 - financial-sentiment-analysis
 datasets:
 - StephanAkkerman/stock-market-tweets-data
 - StephanAkkerman/financial-tweets
-- StephanAkkerman/financial-tweets-crypto
-- StephanAkkerman/financial-tweets-stocks
-- StephanAkkerman/financial-tweets-other
 metrics:
 - perplexity
 widget:
@@ -45,36 +49,31 @@ model-index:
 FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.
-## Table of Contents
-- [Dataset](#dataset)
-- [Model Details](#model-details)
-- [Installation](#installation)
-- [Usage](#usage)
-- [Training](#training)
-- [Evaluation](#evaluation)
-- [Contributing](#contributing)
-- [License](#license)
 ## Dataset
-FinTwitBERT is pre-trained on Taborda et al.'s [Stock Market Tweets Data](https://ieee-dataport.org/open-access/stock-market-tweets-data) consisting of 943,672 tweets, including 1,300 labeled tweets. All labeled tweets are used for evaluation of the pre-trained model, using perplexity as a measurement. The other tweets are used for pre-training with 10% being used for model validation.
-## Model details
-We use the [FinBERT](https://huggingface.co/ProsusAI/finbert) model and tokenizer from ProsusAI as our base. We added two masks to the tokenizer: `@USER` for user mentions and `[URL]` for URLs in tweets. The model is then pre-trained for 10 epochs using loss at the metric for the best model. We apply early stopping to prevent overfitting the model.
-The latest pre-trained model and tokenizer can be found here on huggingface: https://huggingface.co/StephanAkkerman/FinTwitBERT.
-## Installation
-```bash
-# Clone this repository
-git clone https://github.com/TimKoornstra/FinTwitBERT
-# Install required packages
-pip install -r requirements.txt
-```
 ## Usage
-The model can be finetuned for specific tasks such as sentiment classification. For more information about it, you can visit our other repository: https://github.com/TimKoornstra/stock-sentiment-classifier.
-## Contributing
-Contributions are welcome! If you have a feature request, bug report, or proposal for code refactoring, please feel free to open an issue on GitHub. I appreciate your help in improving this project.
 ## License
 This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details.

 - finance
 - sentiment-analysis
 - financial-sentiment-analysis
+- twitter
+- tweets
+- stocks
+- stock-market
+- crypto
+- cryptocurrency
 datasets:
 - StephanAkkerman/stock-market-tweets-data
 - StephanAkkerman/financial-tweets
+- StephanAkkerman/crypto-stock-tweets
 metrics:
 - perplexity
 widget:
 FinTwitBERT is a language model specifically pre-trained on a large dataset of financial tweets. This specialized BERT model aims to capture the unique jargon and communication style found in the financial Twitter sphere, making it an ideal tool for sentiment analysis, trend prediction, and other financial NLP tasks.
 ## Dataset
+FinTwitBERT is pre-trained on several financial tweets datasets, consisting of tweets mentioning stocks and cryptocurrencies:
+- [StephanAkkerman/crypto-stock-tweets](https://huggingface.co/datasets/StephanAkkerman/crypto-stock-tweets): 8,024,269 tweets
+- [StephanAkkerman/stock-market-tweets-data](https://huggingface.co/datasets/StephanAkkerman/stock-market-tweets-data): 923,673 tweets
+- [StephanAkkerman/financial-tweets](https://huggingface.co/datasets/StephanAkkerman/financial-tweets): 263,119 tweets
+## Model Details
+Based on the [FinBERT](https://huggingface.co/yiyanghkust/finbert-pretrain) model and tokenizer, FinTwitBERT includes additional masks (`@USER` and `[URL]`) to handle common elements in tweets. The model underwent 10 epochs of pre-training, with early stopping to prevent overfitting.
+## More Information
+For a comprehensive overview, including the complete training setup details and more, visit the [FinTwitBERT GitHub repository](https://github.com/TimKoornstra/FinTwitBERT).
 ## Usage
+Using [HuggingFace's transformers library](https://huggingface.co/docs/transformers/index) the model and tokenizers can be converted into a pipeline for masked language modeling.
+```python
+from transformers import pipeline
+pipe = pipeline(
+    "fill-mask",
+    model="StephanAkkerman/FinTwitBERT",
+    tokenizer="StephanAkkerman/FinTwitBERT",
+)
+print(pipe("Bitcoin is a [MASK] coin."))
+```
 ## License
 This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details.