BERTopic_ArXiv / README.md
MaartenGr's picture
Update README.md
43dd6fb
---
tags:
- bertopic
library_name: bertopic
pipeline_tag: text-classification
---
# BERTopic_ArXiv
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
This pre-trained model demonstrates the use of several representation models that can be used within BERTopic. This model was trained on ~30000 ArXiv abstracts with the following topic representation methods (`bertopic.representation`):
* POS
* KeyBERTInspired
* MaximalMarginalRelevance
* KeyBERT + MaximalMarginalRelevance
* ChatGPT labels
* ChatGPT summaries
An example of the default c-TF-IDF representations:
!["multiaspect.png"](visualization_multiaspect.png)
An example of labels generated by ChatGPT (`gpt-3.5-turbo`):
!["multiaspect.png"](visualization_multiaspect_openai.png)
To generate these images, you can follow along with this tutorial: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)
## Usage
To use this model, please install BERTopic:
```
pip install -U bertopic
pip install -U safetensors
```
You can use the model as follows:
```python
from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")
topic_model.get_topic_info()
```
To view all different topic representations (keywords, labels, summary, etc.) you can run the following:
```python
>>> topic_model.get_topic(0, full=True)
{'Main': [['dialogue', 0.02704485163341523],
['dialog', 0.01677038224466311],
['response', 0.011692640237477233],
['responses', 0.01002788412923778],
['intent', 0.00990720856306287],
['oriented', 0.009217253131615378],
['slot', 0.009177118721490055],
['conversational', 0.009129311385144046],
['systems', 0.009101146153425574],
['conversation', 0.008845392252307181]],
'POS': [['dialogue', 0.02704485163341523],
['dialog', 0.01677038224466311],
['response', 0.011692640237477233],
['responses', 0.01002788412923778],
['intent', 0.00990720856306287],
['slot', 0.009177118721490055],
['conversational', 0.009129311385144046],
['systems', 0.009101146153425574],
['conversation', 0.008845392252307181],
['user', 0.008753551043296965]],
'KeyBERTInspired': [['task oriented dialogue', 0.6559894680976868],
['dialogue systems', 0.6249060034751892],
['oriented dialogue', 0.5788208246231079],
['dialog systems', 0.530449628829956],
['dialogue state', 0.5167528390884399],
['response generation', 0.5143576860427856],
['spoken language understanding', 0.46739083528518677],
['oriented dialog', 0.4600704610347748],
['dialog', 0.4534587264060974],
['dialogues', 0.44082391262054443]],
'MMR': [['dialogue', 0.02704485163341523],
['dialog', 0.01677038224466311],
['response', 0.011692640237477233],
['responses', 0.01002788412923778],
['intent', 0.00990720856306287],
['oriented', 0.009217253131615378],
['slot', 0.009177118721490055],
['conversational', 0.009129311385144046],
['systems', 0.009101146153425574],
['conversation', 0.008845392252307181]],
'KeyBERT + MMR': [['task oriented dialogue', 0.6559894680976868],
['dialogue systems', 0.6249060034751892],
['oriented dialogue', 0.5788208246231079],
['dialog systems', 0.530449628829956],
['dialogue state', 0.5167528390884399],
['response generation', 0.5143576860427856],
['spoken language understanding', 0.46739083528518677],
['oriented dialog', 0.4600704610347748],
['dialog', 0.4534587264060974],
['dialogues', 0.44082391262054443]],
'OpenAI_Label': [['Challenges and Approaches in Developing Task-oriented Dialogue Systems',
1]],
'OpenAI_Summary': [['Task-oriented dialogue systems and their components, such as dialogue policy, natural language understanding, dialogue state tracking, response generation, and end-to-end training using neural networks. These components are crucial in assisting users to complete various activities such as booking tickets and restaurant reservations through spoken language understanding dialogue. The challenge lies in tracking dialogue states of multiple domains and obtaining annotations for training. Effective SLU is achieved by utilizing context from the prior dialogue history.',
1]]}
```
## Topic overview
* Number of topics: 107
* Number of training documents: 33189
<details>
<summary>Click here for an overview of all topics.</summary>
| Topic ID | Topic Keywords | Topic Frequency | Label |
|----------|----------------|-----------------|-------|
| -1 | language - models - model - data - based | 20 | -1_language_models_model_data |
| 0 | dialogue - dialog - response - responses - intent | 14247 | 0_dialogue_dialog_response_responses |
| 1 | speech - asr - speech recognition - recognition - end | 1833 | 1_speech_asr_speech recognition_recognition |
| 2 | tuning - tasks - prompt - models - language | 1369 | 2_tuning_tasks_prompt_models |
| 3 | summarization - summaries - summary - abstractive - document | 1109 | 3_summarization_summaries_summary_abstractive |
| 4 | question - answer - qa - answering - question answering | 893 | 4_question_answer_qa_answering |
| 5 | sentiment - sentiment analysis - aspect - analysis - opinion | 837 | 5_sentiment_sentiment analysis_aspect_analysis |
| 6 | clinical - medical - biomedical - notes - patient | 691 | 6_clinical_medical_biomedical_notes |
| 7 | translation - nmt - machine translation - neural machine - neural machine translation | 586 | 7_translation_nmt_machine translation_neural machine |
| 8 | generation - text generation - text - language generation - nlg | 558 | 8_generation_text generation_text_language generation |
| 9 | hate - hate speech - offensive - speech - detection | 484 | 9_hate_hate speech_offensive_speech |
| 10 | news - fake - fake news - stance - fact | 455 | 10_news_fake_fake news_stance |
| 11 | relation - relation extraction - extraction - relations - entity | 450 | 11_relation_relation extraction_extraction_relations |
| 12 | ner - named - named entity - entity - named entity recognition | 376 | 12_ner_named_named entity_entity |
| 13 | parsing - parser - dependency - treebank - parsers | 370 | 13_parsing_parser_dependency_treebank |
| 14 | event - temporal - events - event extraction - extraction | 314 | 14_event_temporal_events_event extraction |
| 15 | emotion - emotions - multimodal - emotion recognition - emotional | 300 | 15_emotion_emotions_multimodal_emotion recognition |
| 16 | word - embeddings - word embeddings - embedding - words | 292 | 16_word_embeddings_word embeddings_embedding |
| 17 | explanations - explanation - rationales - rationale - interpretability | 212 | 17_explanations_explanation_rationales_rationale |
| 18 | morphological - arabic - morphology - languages - inflection | 204 | 18_morphological_arabic_morphology_languages |
| 19 | topic - topics - topic models - lda - topic modeling | 200 | 19_topic_topics_topic models_lda |
| 20 | bias - gender - biases - gender bias - debiasing | 195 | 20_bias_gender_biases_gender bias |
| 21 | law - frequency - zipf - words - length | 185 | 21_law_frequency_zipf_words |
| 22 | legal - court - law - legal domain - case | 182 | 22_legal_court_law_legal domain |
| 23 | adversarial - attacks - attack - adversarial examples - robustness | 181 | 23_adversarial_attacks_attack_adversarial examples |
| 24 | commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning | 180 | 24_commonsense_commonsense knowledge_reasoning_knowledge |
| 25 | quantum - semantics - calculus - compositional - meaning | 171 | 25_quantum_semantics_calculus_compositional |
| 26 | correction - error - error correction - grammatical - grammatical error | 161 | 26_correction_error_error correction_grammatical |
| 27 | argument - arguments - argumentation - argumentative - mining | 160 | 27_argument_arguments_argumentation_argumentative |
| 28 | sarcasm - humor - sarcastic - detection - humorous | 157 | 28_sarcasm_humor_sarcastic_detection |
| 29 | coreference - resolution - coreference resolution - mentions - mention | 156 | 29_coreference_resolution_coreference resolution_mentions |
| 30 | sense - word sense - wsd - word - disambiguation | 153 | 30_sense_word sense_wsd_word |
| 31 | knowledge - knowledge graph - graph - link prediction - entities | 149 | 31_knowledge_knowledge graph_graph_link prediction |
| 32 | parsing - semantic parsing - amr - semantic - parser | 146 | 32_parsing_semantic parsing_amr_semantic |
| 33 | cross lingual - lingual - cross - transfer - languages | 146 | 33_cross lingual_lingual_cross_transfer |
| 34 | mt - translation - qe - quality - machine translation | 139 | 34_mt_translation_qe_quality |
| 35 | sql - text sql - queries - spider - schema | 138 | 35_sql_text sql_queries_spider |
| 36 | classification - text classification - label - text - labels | 136 | 36_classification_text classification_label_text |
| 37 | style - style transfer - transfer - text style - text style transfer | 136 | 37_style_style transfer_transfer_text style |
| 38 | question - question generation - questions - answer - generation | 129 | 38_question_question generation_questions_answer |
| 39 | authorship - authorship attribution - attribution - author - authors | 127 | 39_authorship_authorship attribution_attribution_author |
| 40 | sentence - sentence embeddings - similarity - sts - sentence embedding | 123 | 40_sentence_sentence embeddings_similarity_sts |
| 41 | code - identification - switching - cs - code switching | 121 | 41_code_identification_switching_cs |
| 42 | story - stories - story generation - generation - storytelling | 118 | 42_story_stories_story generation_generation |
| 43 | discourse - discourse relation - discourse relations - rst - discourse parsing | 117 | 43_discourse_discourse relation_discourse relations_rst |
| 44 | code - programming - source code - code generation - programming languages | 117 | 44_code_programming_source code_code generation |
| 45 | paraphrase - paraphrases - paraphrase generation - paraphrasing - generation | 114 | 45_paraphrase_paraphrases_paraphrase generation_paraphrasing |
| 46 | agent - games - environment - instructions - agents | 111 | 46_agent_games_environment_instructions |
| 47 | covid - covid 19 - 19 - tweets - pandemic | 108 | 47_covid_covid 19_19_tweets |
| 48 | linking - entity linking - entity - el - entities | 107 | 48_linking_entity linking_entity_el |
| 49 | poetry - poems - lyrics - poem - music | 103 | 49_poetry_poems_lyrics_poem |
| 50 | image - captioning - captions - visual - caption | 100 | 50_image_captioning_captions_visual |
| 51 | nli - entailment - inference - natural language inference - language inference | 96 | 51_nli_entailment_inference_natural language inference |
| 52 | keyphrase - keyphrases - extraction - document - phrases | 95 | 52_keyphrase_keyphrases_extraction_document |
| 53 | simplification - text simplification - ts - sentence - simplified | 95 | 53_simplification_text simplification_ts_sentence |
| 54 | empathetic - emotion - emotional - empathy - emotions | 95 | 54_empathetic_emotion_emotional_empathy |
| 55 | depression - mental - health - mental health - social media | 93 | 55_depression_mental_health_mental health |
| 56 | segmentation - word segmentation - chinese - chinese word segmentation - chinese word | 93 | 56_segmentation_word segmentation_chinese_chinese word segmentation |
| 57 | citation - scientific - papers - citations - scholarly | 85 | 57_citation_scientific_papers_citations |
| 58 | agreement - syntactic - verb - grammatical - subject verb | 85 | 58_agreement_syntactic_verb_grammatical |
| 59 | metaphor - literal - figurative - metaphors - idiomatic | 83 | 59_metaphor_literal_figurative_metaphors |
| 60 | srl - semantic role - role labeling - semantic role labeling - role | 82 | 60_srl_semantic role_role labeling_semantic role labeling |
| 61 | privacy - private - federated - privacy preserving - federated learning | 82 | 61_privacy_private_federated_privacy preserving |
| 62 | change - semantic change - time - semantic - lexical semantic | 82 | 62_change_semantic change_time_semantic |
| 63 | bilingual - lingual - cross lingual - cross - embeddings | 80 | 63_bilingual_lingual_cross lingual_cross |
| 64 | political - media - news - bias - articles | 77 | 64_political_media_news_bias |
| 65 | medical - qa - question - questions - clinical | 75 | 65_medical_qa_question_questions |
| 66 | math - mathematical - math word - word problems - problems | 73 | 66_math_mathematical_math word_word problems |
| 67 | financial - stock - market - price - news | 69 | 67_financial_stock_market_price |
| 68 | table - tables - tabular - reasoning - qa | 69 | 68_table_tables_tabular_reasoning |
| 69 | readability - complexity - assessment - features - reading | 65 | 69_readability_complexity_assessment_features |
| 70 | layout - document - documents - document understanding - extraction | 64 | 70_layout_document_documents_document understanding |
| 71 | brain - cognitive - reading - syntactic - language | 62 | 71_brain_cognitive_reading_syntactic |
| 72 | sign - gloss - language - signed - language translation | 61 | 72_sign_gloss_language_signed |
| 73 | vqa - visual - visual question - visual question answering - question | 59 | 73_vqa_visual_visual question_visual question answering |
| 74 | biased - biases - spurious - nlp - debiasing | 57 | 74_biased_biases_spurious_nlp |
| 75 | visual - dialogue - multimodal - image - dialog | 55 | 75_visual_dialogue_multimodal_image |
| 76 | translation - machine translation - machine - smt - statistical | 54 | 76_translation_machine translation_machine_smt |
| 77 | multimodal - visual - image - translation - machine translation | 52 | 77_multimodal_visual_image_translation |
| 78 | geographic - location - geolocation - geo - locations | 51 | 78_geographic_location_geolocation_geo |
| 79 | reasoning - prompting - llms - chain thought - chain | 48 | 79_reasoning_prompting_llms_chain thought |
| 80 | essay - scoring - aes - essay scoring - essays | 45 | 80_essay_scoring_aes_essay scoring |
| 81 | crisis - disaster - traffic - tweets - disasters | 45 | 81_crisis_disaster_traffic_tweets |
| 82 | graph - text classification - text - gcn - classification | 44 | 82_graph_text classification_text_gcn |
| 83 | annotation - tools - linguistic - resources - xml | 43 | 83_annotation_tools_linguistic_resources |
| 84 | entity alignment - alignment - kgs - entity - ea | 43 | 84_entity alignment_alignment_kgs_entity |
| 85 | personality - traits - personality traits - evaluative - text | 42 | 85_personality_traits_personality traits_evaluative |
| 86 | ad - alzheimer - alzheimer disease - disease - speech | 40 | 86_ad_alzheimer_alzheimer disease_disease |
| 87 | taxonomy - hypernymy - taxonomies - hypernym - hypernyms | 39 | 87_taxonomy_hypernymy_taxonomies_hypernym |
| 88 | active learning - active - al - learning - uncertainty | 37 | 88_active learning_active_al_learning |
| 89 | reviews - summaries - summarization - review - opinion | 36 | 89_reviews_summaries_summarization_review |
| 90 | emoji - emojis - sentiment - message - anonymous | 35 | 90_emoji_emojis_sentiment_message |
| 91 | table - table text - tables - table text generation - text generation | 35 | 91_table_table text_tables_table text generation |
| 92 | domain - domain adaptation - adaptation - domains - source | 35 | 92_domain_domain adaptation_adaptation_domains |
| 93 | alignment - word alignment - parallel - pairs - alignments | 34 | 93_alignment_word alignment_parallel_pairs |
| 94 | indo - languages - indo european - names - family | 34 | 94_indo_languages_indo european_names |
| 95 | patent - claim - claim generation - chemical - technical | 32 | 95_patent_claim_claim generation_chemical |
| 96 | agents - emergent - communication - referential - games | 32 | 96_agents_emergent_communication_referential |
| 97 | graph - amr - graph text - graphs - text generation | 31 | 97_graph_amr_graph text_graphs |
| 98 | moral - ethical - norms - values - social | 29 | 98_moral_ethical_norms_values |
| 99 | acronym - acronyms - abbreviations - abbreviation - disambiguation | 27 | 99_acronym_acronyms_abbreviations_abbreviation |
| 100 | typing - entity typing - entity - type - types | 27 | 100_typing_entity typing_entity_type |
| 101 | coherence - discourse - discourse coherence - coherence modeling - text | 26 | 101_coherence_discourse_discourse coherence_coherence modeling |
| 102 | pos - taggers - tagging - tagger - pos tagging | 25 | 102_pos_taggers_tagging_tagger |
| 103 | drug - social - social media - media - health | 25 | 103_drug_social_social media_media |
| 104 | gender - translation - bias - gender bias - mt | 24 | 104_gender_translation_bias_gender bias |
| 105 | job - resume - skills - skill - soft | 21 | 105_job_resume_skills_skill |
</details>
## Training Procedure
The model was trained as follows:
```python
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import PartOfSpeech, KeyBERTInspired, MaximalMarginalRelevance, OpenAI
# Prepare sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)
# Summarization with ChatGPT
summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]
Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""
summarization_model = OpenAI(model="gpt-3.5-turbo", chat=True, prompt=summarization_prompt, nr_docs=5, exponential_backoff=True, diversity=0.1)
# Representation models
representation_models = {
"POS": PartOfSpeech("en_core_web_lg"),
"KeyBERTInspired": KeyBERTInspired(),
"MMR": MaximalMarginalRelevance(diversity=0.3),
"KeyBERT + MMR": [KeyBERTInspired(), MaximalMarginalRelevance(diversity=0.3)],
"OpenAI_Label": OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, diversity=0.1),
"OpenAI_Summary": [KeyBERTInspired(), summarization_model],
}
# Fit BERTopic
topic_model= BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model=representation_models,
verbose=True
).fit(docs)
```
## Training hyperparameters
* calculate_probabilities: False
* language: None
* low_memory: False
* min_topic_size: 10
* n_gram_range: (1, 1)
* nr_topics: None
* seed_topic_list: None
* top_n_words: 10
* verbose: True
## Framework versions
* Numpy: 1.22.4
* HDBSCAN: 0.8.29
* UMAP: 0.5.3
* Pandas: 1.5.3
* Scikit-Learn: 1.2.2
* Sentence-transformers: 2.2.2
* Transformers: 4.29.2
* Numba: 0.56.4
* Plotly: 5.13.1
* Python: 3.10.11