Model Card for Model Geo-BERT-multilingual
This model predicts the geolocation of short texts (less than 500 words) in a form of two-dimensional distributions also referenced as the Gaussian Mixture Model (GMM).
Model Details
Number of predicted points: 5 Custom transformers pipeline and result visualization: https://github.com/K4TEL/geo-twitter/tree/predict
Model Description
This project was aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements BERT-based neural networks for NLP to estimate the location in a form of two-dimensional GMMs (longitude, latitude, weight, covariance). The base model has been finetuned on a Twitter dataset containing text content and metadata context of the tweets.
- Developed by: Kateryna Lutsai
- Model type: regression
- Language(s) (NLP): multilingual
- Finetuned from model: bert-base-multilingual-cased
Model Sources
- Repository: https://github.com/K4TEL/geo-twitter
- Paper: https://arxiv.org/pdf/2303.07865.pdf
- Demo: https://github.com/K4TEL/geo-twitter/blob/predict/prediction.ipynb
Uses
Geo-tagging of Big data
Direct Use
Per-tweet geolocation prediction
Out-of-Scope Use
Per-tweet geolocation prediction without "user" metadata is expected to show lower accuracy of predictions.
Bias, Risks, and Limitations
Risk for unethical use on the basis of data that is not publicly available.
The limitation of text length is dictated by the BERT-based model's capacity of 500 tokens (words).
How to Get Started with the Model
Use the code below to get started with the model:
https://github.com/K4TEL/geo-twitter/tree/predict
A short startup guide is given in the repository branch description.
Training Details
Training Data
The Twitter dataset contained tweets with their text content, metadata ("user" and "place") context, and geolocation coordinates.
Training Procedure
Information about the model training on the user-defined data could be found in the GitHub repository: https://github.com/K4TEL/geo-twitter
Training Hyperparameters
- Learning rate start: 1e-5
- Learning rate end: 1e-6
- Learning rate scheduler: cosine
- Number of epochs: 3
- Batch size: 10
- Optimizer: Adam
- Intra-feature loss: mean
- Inter-feature loss: mean
- Neg log-likelihood domain: positive
- Features: NON-GEO + GEO-ONLY
Evaluation
All performance metrics and results are demonstrated in the Results section of the article pre-print: https://arxiv.org/pdf/2303.07865.pdf
Testing Data, Factors & Metrics
Testing Data
Worldwide dataset of tweets with TEXT-ONLY and NON-GEO features
Metrics
Spatial metrics: mean and median Simple Accuracy Error (SAE), Acc@161 Probabilistic metrics: mean and median Cumulative Accuracy Error (CAE), mean and median Prediction Area Region (PRA) for 95% density area, Coverage of PRA
Results
Tweet geolocation prediction task
- TEXT-ONLY: mean 1588 km and median 50 km, 61% of Acc@161
- NON-GEO: mean 800 km and median 25 km, 80% of Acc@161
User home geolocation prediction task
- TEXT-ONLY: mean 892 km and median 31 km, 74% of Acc@161
- NON-GEO: mean 567 km and median 26 km, 82% of Acc@161
Model Architecture and Objective
Implemented wrapper layer of liner regression with a custom number of output variables that operates with classification token generated by the base BERT model.
Hardware
NVIDIA GeForce GTX 1080 Ti
Software
Python IDE
Model Card Contact
- Downloads last month
- 92