|
|
|
--- |
|
tags: |
|
- bertopic |
|
library_name: bertopic |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# ArXiv |
|
|
|
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. |
|
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. |
|
|
|
## Usage |
|
|
|
To use this model, please install BERTopic: |
|
|
|
``` |
|
pip install -U bertopic |
|
``` |
|
|
|
You can use the model as follows: |
|
|
|
```python |
|
from bertopic import BERTopic |
|
topic_model = BERTopic.load("OSN2/ArXiv") |
|
|
|
topic_model.get_topic_info() |
|
``` |
|
|
|
## Topic overview |
|
|
|
* Number of topics: 171 |
|
* Number of training documents: 12693 |
|
|
|
<details> |
|
<summary>Click here for an overview of all topics.</summary> |
|
|
|
| Topic ID | Topic Keywords | Topic Frequency | Label | |
|
|----------|----------------|-----------------|-------| |
|
| -1 | the - and - to - of - in | 15 | -1_the_and_to_of | |
|
| 0 | recipe - food - recipes - pizza - salad | 3814 | 0_recipe_food_recipes_pizza | |
|
| 1 | trump - election - law - the - that | 849 | 1_trump_election_law_the | |
|
| 2 | anaysa - fashion - pants - swimwear - sneakers | 393 | 2_anaysa_fashion_pants_swimwear | |
|
| 3 | arsenal - liverpool - rugby - match - haaland | 382 | 3_arsenal_liverpool_rugby_match | |
|
| 4 | weather - bengal - storm - west - snow | 271 | 4_weather_bengal_storm_west | |
|
| 5 | crypto - bitcoin - cryptocurrency - gaming - trading | 172 | 5_crypto_bitcoin_cryptocurrency_gaming | |
|
| 6 | her - she - was - on - related | 143 | 6_her_she_was_on | |
|
| 7 | 420m - dog - animal - animals - dogs | 138 | 7_420m_dog_animal_animals | |
|
| 8 | god - lord - prayer - jesus - church | 127 | 8_god_lord_prayer_jesus | |
|
| 9 | cars - sale - used - under - for | 119 | 9_cars_sale_used_under | |
|
| 10 | pro - vivo - v23 - phone - google | 117 | 10_pro_vivo_v23_phone | |
|
| 11 | news - iptv - tv - interview - latest | 110 | 11_news_iptv_tv_interview | |
|
| 12 | art - museum - artists - artist - of | 108 | 12_art_museum_artists_artist | |
|
| 13 | my - nephews - nieces - poetry - love | 107 | 13_my_nephews_nieces_poetry | |
|
| 14 | film - review - his - as - but | 102 | 14_film_review_his_as | |
|
| 15 | bike - helmet - bikes - mountain - pilots | 98 | 15_bike_helmet_bikes_mountain | |
|
| 16 | hair - bite - steel - care - haircut | 97 | 16_hair_bite_steel_care | |
|
| 17 | police - rhonda - mcdowell - was - said | 90 | 17_police_rhonda_mcdowell_was | |
|
| 18 | property - room - bedrooms - bedroom - home | 86 | 18_property_room_bedrooms_bedroom | |
|
| 19 | ukraine - russia - russian - putin - news | 86 | 19_ukraine_russia_russian_putin | |
|
| 20 | business - jobs - income - data - part | 86 | 20_business_jobs_income_data | |
|
| 21 | vaccinated - vaccine - covid - va - unvaccinated | 84 | 21_vaccinated_vaccine_covid_va | |
|
| 22 | music - band - students - orchestra - tickets | 83 | 22_music_band_students_orchestra | |
|
| 23 | workout - abs - workouts - fitness - exercise | 83 | 23_workout_abs_workouts_fitness | |
|
| 24 | school - teachers - dmc - 804 - children | 83 | 24_school_teachers_dmc_804 | |
|
| 25 | women - robotics - bali - spanish - lutheran | 82 | 25_women_robotics_bali_spanish | |
|
| 26 | lima - tourism - parks - urban - our | 79 | 26_lima_tourism_parks_urban | |
|
| 27 | godzilla - movies - movie - spider - marvel | 77 | 27_godzilla_movies_movie_spider | |
|
| 28 | fishing - backpacks - fish - packs - swimming | 74 | 28_fishing_backpacks_fish_packs | |
|
| 29 | yoga - stretching - kru - nidra - oct | 74 | 29_yoga_stretching_kru_nidra | |
|
| 30 | researchers - species - of - the - university | 72 | 30_researchers_species_of_the | |
|
| 31 | wholesale - market - saree - delhi - software | 71 | 31_wholesale_market_saree_delhi | |
|
| 32 | skin - acne - cream - blackheads - whitening | 70 | 32_skin_acne_cream_blackheads | |
|
| 33 | rodents - pets - pest - dogs - animals | 70 | 33_rodents_pets_pest_dogs | |
|
| 34 | books - book - salinger - fiction - literary | 67 | 34_books_book_salinger_fiction | |
|
| 35 | class - pst - exams - preparation - test | 66 | 35_class_pst_exams_preparation | |
|
| 36 | 5g - airlines - bsnl - flight - network | 64 | 36_5g_airlines_bsnl_flight | |
|
| 37 | treetops - dementia - children - people - barbara | 62 | 37_treetops_dementia_children_people | |
|
| 38 | lottery - thai - thailand - lotto - win | 62 | 38_lottery_thai_thailand_lotto | |
|
| 39 | wedding - weddings - survival - gift - day | 61 | 39_wedding_weddings_survival_gift | |
|
| 40 | quantum - solar - energy - material - light | 61 | 40_quantum_solar_energy_material | |
|
| 41 | beauty - makeup - products - sephora - skin | 60 | 41_beauty_makeup_products_sephora | |
|
| 42 | games - xbox - game - solitaire - free | 60 | 42_games_xbox_game_solitaire | |
|
| 43 | insurance - insurers - insurer - company - aig | 59 | 43_insurance_insurers_insurer_company | |
|
| 44 | green - saf - haiti - industry - solar | 58 | 44_green_saf_haiti_industry | |
|
| 45 | diet - meat - foods - plant - body | 55 | 45_diet_meat_foods_plant | |
|
| 46 | edinburgh - tour - royal - travel - castle | 55 | 46_edinburgh_tour_royal_travel | |
|
| 47 | horses - horse - friesian - goëngamieden - post | 54 | 47_horses_horse_friesian_goëngamieden | |
|
| 48 | your - you - mental - health - anal | 51 | 48_your_you_mental_health | |
|
| 49 | weight - obesity - loss - lose - fat | 51 | 49_weight_obesity_loss_lose | |
|
| 50 | estate - real - property - home - you | 50 | 50_estate_real_property_home | |
|
| 51 | camping - surfing - guess - landmark - lego | 50 | 51_camping_surfing_guess_landmark | |
|
| 52 | dorm - sex - birthday - my - joy | 50 | 52_dorm_sex_birthday_my | |
|
| 53 | covid - 19 - vaccinated - vaccine - cases | 50 | 53_covid_19_vaccinated_vaccine | |
|
| 54 | spain - morocco - gas - energy - industry | 49 | 54_spain_morocco_gas_energy | |
|
| 55 | gardening - garden - grow - plants - fertilizer | 49 | 55_gardening_garden_grow_plants | |
|
| 56 | tenant - transport - apartments - department - condos | 49 | 56_tenant_transport_apartments_department | |
|
| 57 | cricket - england - engw - indw - vs | 48 | 57_cricket_england_engw_indw | |
|
| 58 | trump - election - party - votes - former | 48 | 58_trump_election_party_votes | |
|
| 59 | tesla - marine - electric - musk - ev | 47 | 59_tesla_marine_electric_musk | |
|
| 60 | surf - surfing - ski - swimming - lessons | 47 | 60_surf_surfing_ski_swimming | |
|
| 61 | disabled - disability - thailand - scholarship - scholarships | 47 | 61_disabled_disability_thailand_scholarship | |
|
| 62 | programming - udemy - svelte - language - courses | 44 | 62_programming_udemy_svelte_language | |
|
| 63 | diy - ideas - desk - wood - woodworking | 43 | 63_diy_ideas_desk_wood | |
|
| 64 | wrestling - pearson - tiga - wwe - nfl | 43 | 64_wrestling_pearson_tiga_wwe | |
|
| 65 | smart - gadgets - appliances - home - kitchen | 42 | 65_smart_gadgets_appliances_home | |
|
| 66 | experiments - fu - kung - xxxtentacion - copyright | 40 | 66_experiments_fu_kung_xxxtentacion | |
|
| 67 | job - small - businesses - hiring - business | 40 | 67_job_small_businesses_hiring | |
|
| 68 | hiv - health - care - hospital - hospice | 40 | 68_hiv_health_care_hospital | |
|
| 69 | he - was - it - empire - movie | 38 | 69_he_was_it_empire | |
|
| 70 | beat - type - ringtone - lofi - beats | 37 | 70_beat_type_ringtone_lofi | |
|
| 71 | castellvi - marines - marine - corps - county | 37 | 71_castellvi_marines_marine_corps | |
|
| 72 | casino - xbox - game - games - poker | 37 | 72_casino_xbox_game_games | |
|
| 73 | bellanaijaweddings - bride - handmadepaper - weddingplanner - makeup | 36 | 73_bellanaijaweddings_bride_handmadepaper_weddingplanner | |
|
| 74 | music - jsem - bushcraft - se - festival | 36 | 74_music_jsem_bushcraft_se | |
|
| 75 | gemini - tarot - horoscope - september - pisces | 35 | 75_gemini_tarot_horoscope_september | |
|
| 76 | career - husni - magazines - magazine - employees | 35 | 76_career_husni_magazines_magazine | |
|
| 77 | his - film - movie - review - but | 34 | 77_his_film_movie_review | |
|
| 78 | gps - aircraft - trucks - vehicles - electric | 34 | 78_gps_aircraft_trucks_vehicles | |
|
| 79 | raya - merch - magazines - cards - kongamidyearshoppingfestival | 34 | 79_raya_merch_magazines_cards | |
|
| 80 | baby - she - birth - says - women | 34 | 80_baby_she_birth_says | |
|
| 81 | covid - 19 - uk - health - interventions | 33 | 81_covid_19_uk_health | |
|
| 82 | climate - gore - dm - eastman - change | 33 | 82_climate_gore_dm_eastman | |
|
| 83 | buhari - anambra - apc - anyim - chief | 32 | 83_buhari_anambra_apc_anyim | |
|
| 84 | orchestra - hotel - janice - chicago - symphony | 31 | 84_orchestra_hotel_janice_chicago | |
|
| 85 | ramen - pierre - soulz - magic - westfieldcarousel | 31 | 85_ramen_pierre_soulz_magic | |
|
| 86 | interior - design - home - decorate - bedroom | 30 | 86_interior_design_home_decorate | |
|
| 87 | hindi - movie - explained - hollywood - lankybox | 30 | 87_hindi_movie_explained_hollywood | |
|
| 88 | xbox - playstation - game - card - console | 30 | 88_xbox_playstation_game_card | |
|
| 89 | insurance - car - policy - feener - policyworld | 30 | 89_insurance_car_policy_feener | |
|
| 90 | share - nepal - stock - market - analysis | 29 | 90_share_nepal_stock_market | |
|
| 91 | marketing - content - strategy - cart - your | 28 | 91_marketing_content_strategy_cart | |
|
| 92 | songs - kids - song - rhymes - hindi | 28 | 92_songs_kids_song_rhymes | |
|
| 93 | tax - cd - money - itr - 401 | 27 | 93_tax_cd_money_itr | |
|
| 94 | inflation - housing - prices - chorley - hydrow | 27 | 94_inflation_housing_prices_chorley | |
|
| 95 | venkat - spectre - spending - attacks - intel | 26 | 95_venkat_spectre_spending_attacks | |
|
| 96 | band - grammys - recording - musical - doo | 26 | 96_band_grammys_recording_musical | |
|
| 97 | drawing - draw - art - mandala - painting | 26 | 97_drawing_draw_art_mandala | |
|
| 98 | shop - insurance - design - restaurant - food | 26 | 98_shop_insurance_design_restaurant | |
|
| 99 | kamran - feride - iqiyi - drama - selim | 26 | 99_kamran_feride_iqiyi_drama | |
|
| 100 | poetry - prize - mondaymotivation - publication - apologize | 26 | 100_poetry_prize_mondaymotivation_publication | |
|
| 101 | jobs - tcs - part - job - work | 25 | 101_jobs_tcs_part_job | |
|
| 102 | card - credit - rewards - cash - tracking | 25 | 102_card_credit_rewards_cash | |
|
| 103 | vlog - vlogs - dexerto - video - blog | 25 | 103_vlog_vlogs_dexerto_video | |
|
| 104 | brother - 5½ - burge - poetry - thank | 25 | 104_brother_5½_burge_poetry | |
|
| 105 | anime - manga - disney - animes - recap | 25 | 105_anime_manga_disney_animes | |
|
| 106 | fox - news - msnbc - biden - business | 25 | 106_fox_news_msnbc_biden | |
|
| 107 | thoreau - wildness - maldives - malé - wildlife | 24 | 107_thoreau_wildness_maldives_malé | |
|
| 108 | condo - minutes - rent - condominium - เช | 24 | 108_condo_minutes_rent_condominium | |
|
| 109 | freshworks - sales - requirements - job - development | 24 | 109_freshworks_sales_requirements_job | |
|
| 110 | insurance - management - property - company - loans | 24 | 110_insurance_management_property_company | |
|
| 111 | aew - wrestling - highlights - esports - impact | 23 | 111_aew_wrestling_highlights_esports | |
|
| 112 | ctv - cbc - _x000d_ - news - bridge | 23 | 112_ctv_cbc__x000d__news | |
|
| 113 | ukrainian - music - lyatoshynsky - solos - concert | 23 | 113_ukrainian_music_lyatoshynsky_solos | |
|
| 114 | abc - ladzinski - campaign - carlton - news | 23 | 114_abc_ladzinski_campaign_carlton | |
|
| 115 | gaming - pc - headset - byte - cosmic | 23 | 115_gaming_pc_headset_byte | |
|
| 116 | climate - environmental - noaa - literacy - education | 23 | 116_climate_environmental_noaa_literacy | |
|
| 117 | game - players - sonic - its - the | 22 | 117_game_players_sonic_its | |
|
| 118 | olympic - olympics - chen - biles - medal | 22 | 118_olympic_olympics_chen_biles | |
|
| 119 | loans - loan - student - paying - naira | 22 | 119_loans_loan_student_paying | |
|
| 120 | nail - art - nails - compilation - acrylic | 22 | 120_nail_art_nails_compilation | |
|
| 121 | peppa - pig - wolfoo - nguyen - favorite | 21 | 121_peppa_pig_wolfoo_nguyen | |
|
| 122 | jazz - music - blues - heat - waves | 21 | 122_jazz_music_blues_heat | |
|
| 123 | rónán - march - composer - lyricist - tickets | 21 | 123_rónán_march_composer_lyricist | |
|
| 124 | olympic - beijing - olympics - china - athletes | 21 | 124_olympic_beijing_olympics_china | |
|
| 125 | smoking - breakover - smokers - heart - hind | 21 | 125_smoking_breakover_smokers_heart | |
|
| 126 | pets - animals - pet - panda - dog | 21 | 126_pets_animals_pet_panda | |
|
| 127 | cycling - gcn - bike - feroce - wheels | 21 | 127_cycling_gcn_bike_feroce | |
|
| 128 | musique - proposée - libre - par - la | 21 | 128_musique_proposée_libre_par | |
|
| 129 | male - girlfriend - roseanne - unagi - twohill | 20 | 129_male_girlfriend_roseanne_unagi | |
|
| 130 | gymnastics - moana - always - drugs - week | 20 | 130_gymnastics_moana_always_drugs | |
|
| 131 | musk - gambling - twitter - elon - deduction | 20 | 131_musk_gambling_twitter_elon | |
|
| 132 | lichfield - google - sat - stoke - mon | 20 | 132_lichfield_google_sat_stoke | |
|
| 133 | reasonable - greenhouse - accommodation - robots - ai | 20 | 133_reasonable_greenhouse_accommodation_robots | |
|
| 134 | icebox - maxo - theme - kream - koo | 19 | 134_icebox_maxo_theme_kream | |
|
| 135 | whio - ong - ang - canal - birds | 19 | 135_whio_ong_ang_canal | |
|
| 136 | codyfight - tattooing - brothers - marriage - extreme | 19 | 136_codyfight_tattooing_brothers_marriage | |
|
| 137 | nuro - gm - vehicle - vehicles - electric | 19 | 137_nuro_gm_vehicle_vehicles | |
|
| 138 | kcs - railroads - cn - rail - stb | 19 | 138_kcs_railroads_cn_rail | |
|
| 139 | strengths - music - grief - leisure - life | 19 | 139_strengths_music_grief_leisure | |
|
| 140 | drones - drone - uae - missile - dhabi | 19 | 140_drones_drone_uae_missile | |
|
| 141 | massage - dubai - jumeirah - japanese - oil | 18 | 141_massage_dubai_jumeirah_japanese | |
|
| 142 | bowl - super - bengals - bet - rams | 18 | 142_bowl_super_bengals_bet | |
|
| 143 | pension - 9news - pensions - pay - tax | 18 | 143_pension_9news_pensions_pay | |
|
| 144 | dog - toy - pet - supplies - toys | 18 | 144_dog_toy_pet_supplies | |
|
| 145 | english - travellers - students - course - syllabus | 18 | 145_english_travellers_students_course | |
|
| 146 | mentoring - cbs - mentor - mentors - teachers | 18 | 146_mentoring_cbs_mentor_mentors | |
|
| 147 | picnic - park - blankets - basket - acompañantes | 18 | 147_picnic_park_blankets_basket | |
|
| 148 | orig - 99 - amazon - prime - dollar | 18 | 148_orig_99_amazon_prime | |
|
| 149 | primary - english - genetics - wilanów - education | 18 | 149_primary_english_genetics_wilanów | |
|
| 150 | hardin - film - he - she - oscar | 17 | 150_hardin_film_he_she | |
|
| 151 | laptop - gaming - alienware - laptops - hp | 17 | 151_laptop_gaming_alienware_laptops | |
|
| 152 | ufc - tmz - owens - onlyfans - tonight | 17 | 152_ufc_tmz_owens_onlyfans | |
|
| 153 | basketball - vs - varsity - darien - canaan | 17 | 153_basketball_vs_varsity_darien | |
|
| 154 | workers - hanford - state - law - doe | 17 | 154_workers_hanford_state_law | |
|
| 155 | cdl - freight - broker - logistics - eldt | 17 | 155_cdl_freight_broker_logistics | |
|
| 156 | builders - connell - brenton - firm - wage | 17 | 156_builders_connell_brenton_firm | |
|
| 157 | bookstore - easter - my - menger - eastershelfie | 16 | 157_bookstore_easter_my_menger | |
|
| 158 | prince - royal - duke - charles - queen | 16 | 158_prince_royal_duke_charles | |
|
| 159 | ดตามเราได - จำก - มหาชน - voicetv - oppday | 16 | 159_ดตามเราได_จำก_มหาชน_voicetv | |
|
| 160 | nba - trades - stream - espn - live | 16 | 160_nba_trades_stream_espn | |
|
| 161 | school - students - science - brandon - twig | 16 | 161_school_students_science_brandon | |
|
| 162 | morning - sleep - your - kaplan - routine | 16 | 162_morning_sleep_your_kaplan | |
|
| 163 | kat - author - desires - louise - charmaine | 16 | 163_kat_author_desires_louise | |
|
| 164 | movie - recapped - uche - academia - dizzyeight | 16 | 164_movie_recapped_uche_academia | |
|
| 165 | awka - religion - suspects - anambra - echeng | 15 | 165_awka_religion_suspects_anambra | |
|
| 166 | wrc - f1 - rally - championship - formula1 | 15 | 166_wrc_f1_rally_championship | |
|
| 167 | hillstream - algae - scape - goby - aquarium | 15 | 167_hillstream_algae_scape_goby | |
|
| 168 | skin - filler - touche - éclat - dermal | 15 | 168_skin_filler_touche_éclat | |
|
| 169 | pets - cats - hopkins - cat - niblo | 15 | 169_pets_cats_hopkins_cat | |
|
|
|
</details> |
|
|
|
## Training hyperparameters |
|
|
|
* calculate_probabilities: False |
|
* language: None |
|
* low_memory: False |
|
* min_topic_size: 10 |
|
* n_gram_range: (1, 1) |
|
* nr_topics: None |
|
* seed_topic_list: None |
|
* top_n_words: 10 |
|
* verbose: True |
|
* zeroshot_min_similarity: 0.7 |
|
* zeroshot_topic_list: None |
|
|
|
## Framework versions |
|
|
|
* Numpy: 1.23.5 |
|
* HDBSCAN: 0.8.33 |
|
* UMAP: 0.5.5 |
|
* Pandas: 1.5.3 |
|
* Scikit-Learn: 1.2.2 |
|
* Sentence-transformers: 2.2.2 |
|
* Transformers: 4.36.0 |
|
* Numba: 0.58.1 |
|
* Plotly: 5.15.0 |
|
* Python: 3.10.12 |
|
|