UCTopic

This repository contains the code of model UCTopic and an easy-to-use tool UCTopicTool used for Topic Mining, Unsupervised Aspect Extractioin or Phrase Retrieval.

Our ACL 2022 paper UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.

Quick Links

Overview
Pretrained Model
Getting Started
- UCTopic Model
- UCTopicTool
Experiments in Paper
Pretraining
Contact
Citation

Overview

We propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly.

Pretrained Model

Our released model:

Model	Note
uctopic-base	Pretrained UCTopic model based on LUKE-BASE

Unzip to get uctopic-base folder.

Getting Started

We provide an easy-to-use phrase representation tool based on our UCTopic model. To use the tool, first install the uctopic package from PyPI

pip install uctopic

Or directly install it from our code

python setup.py install

UCTopic Model

After installing the package, you can load our model by just two lines of code

from uctopic import UCTopic
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

The model will automatically download pre-trained parameters from HuggingFace's models. If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = UCTopic.from_pretrained({PATH TO THE DOWNLOAD MODEL}).

To get pre-trained phrase representations, our model inputs are same as LUKE. Note: please input only ONE span each time, otherwise, will have performance decay according to our empirical results.

from uctopic import UCTopicTokenizer, UCTopic

tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"

inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)

phrase_repr is the phrase embedding (size [768]) of the phrase Los Angeles. outputs has the same format as the outputs from LUKE.

UCTopicTool

We provide a tool UCTopicTool built on UCTopic for efficient phrase encoding, topic mining (or unsupervised aspect extraction) or phrase retrieval.

Initialization

UCTopicTool is initialized by giving the model_name_or_path and device.

from uctopic import UCTopicTool

topic_tool = UCTopicTool('JiachengLi/uctopic-base', device='cuda:0')

Phrase Encoding

Phrases are encoded by our method UCTopicTool.encode in batches, which is more efficient than UCTopic.

phrases = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)],
           ["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

embeddings = topic_tool.encode(phrases) # len(embeddings) is equal to len(phrases)

Note: Each instance in phrases contains only one sentence and one span (character-level position) in format [sentence, span].

Arguments for UCTopicTool.encode are as follows,

phrase (List) - A list of [sentence, span] to be encoded.
return_numpy (bool, optional, defaults to False) - Return numpy.array or torch.Tensor.
normalize_to_unit (bool, optional, defaults to True) - Normalize all embeddings to unit vectors.
keepdim (bool, optional, defaults to True) - Keep dimension size [instance_number, hidden_size].
batch_size (int, optional, defaults to 64) - The size of mini-batch in the model.

Topic Mining and Unsupervised Aspect Extraction

The method UCTopicTool.topic_mining can mine topical phrases or conduct aspect extraction from sentences with or without spans.

sentences = ["This place is so much bigger than others!",
             "It was totally packed and loud.",
             "Service was on the slower side.",
             "I ordered 2 mojitos: 1 lime and 1 mango.",
             "The ingredient weren't really fresh."]

spans = [[(0, 10)],                       # This place
         [(15, 21), (26, 30)],            # packed; loud
         [(0, 7)],                        # Service
         [(12, 19), (21, 27), (32, 39)],  # mojitos; 1 lime; 1 mango
         [(4, 14)]]                       # ingredient
# len(sentences) is equal to len(spans)
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
                                                   n_clusters=[15, 25])

# predict topic for new phrases
phrases = [["The food here is amazing!", (4, 8)],
           ["Lovely ambiance with live music!", (21, 31)]]

topics = tool.predict_topic(phrases)

Note: If spans is not given, UCTopicTool will extract noun phrases by spaCy.

Arguments for UCTopicTool.topic_mining are as follows,

Data arguments:

sentences (List) - A List of sentences for topic mining.
spans (List, optional, defaults to None) - A list of span list corresponding sentences, e.g., [[(0, 9), (5, 7)], [(1, 2)]] and len(sentences)==len(spans). If None, automatically mine phrases from noun chunks.

Clustering arguments:

n_clusters (int or List, optional, defaults to 2) - The number of topics. When n_clusters is a list, n_clusters[0] and n_clusters[1] will be the minimum and maximum numbers to search, n_clusters[2] is the search step length (if not provided, default to 1).
meric (str, optional, defaults to "cosine") - The metric to measure the distance between vectors. "cosine" or "euclidean".
batch_size (int, optional, defaults to 64) - The size of mini-batch for phrase encoding.
max_iter (int, optional, defaults to 300) - The maximum iteration number of kmeans.

CCL-finetune arguments:

ccl_finetune (bool, optional, defaults to True) - Whether to conduct CCL-finetuning in the paper.
batch_size_finetune (int, optional, defaults to 8) - The size of mini-batch for finetuning.
max_finetune_num (int, optional, defaults to 100000) - The maximum number of training instances for finetuning.
finetune_step (int, optional, defaults to 2000) - The number of training steps for finetuning.
contrastive_num (int, optional, defaults to 5) - The number of negatives in contrastive learning.
positive_ratio (float, optional, defaults to 0.1) - The ratio of the most confident instances for finetuning.
n_sampling (int, optional, defaults to 10000) - The number of sampled examples for cluster number confirmation and finetuning. Set to -1 to use the whole dataset.
n_workers (int, optional, defaults to 8) - The number of workers for preprocessing data.

Returns for UCTopicTool.topic_mining are as follows,

output_data (List) - A list of sentences and corresponding phrases and topic numbers. Each element is [sentence, [[start1, end1, topic1], [start2, end2, topic2]]].
topic_phrase_dict (Dict) - A dictionary of topics and the list of phrases under a topic. The phrases are sorted by their confidence scores. E.g., {topic: [[phrase1, score1], [phrase2, score2]]}.

The method UCTopicTool.predict_topic predicts the topic ids for new phrases based on your training results from UCTopicTool.topic_mining. The inputs of UCTopicTool.predict_topic are same as UCTopicTool.encode and returns a list of topic ids (int).

Phrase Similarities and Retrieval

The method UCTopicTool.similarity compute the cosine similarities between two groups of phrases:

phrases_a = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)]]

phrases_b = [["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

similarities = tool.similarity(phrases_a, phrases_b)

Arguments for UCTopicTool.similarity are as follows,

queries (List) - A list of [sentence, span] as queries.
keys (List or numpy.array) - A list of [sentence, span] as keys or phrase representations (numpy.array) from UCTopicTool.encode.
batch_size (int, optional, defaults to 64) - The size of mini-batch in the model.

UCTopicTool.similarity returns a numpy.array contains the similarities between phrase pairs in two groups.

The methods UCTopicTool.build_index and UCTopicTool.search are used for phrase retrieval:

phrases = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)],
           ["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

# query multiple phrases
query1 = [["The food here is amazing!", (4, 8)],
           ["Lovely ambiance with live music!", (21, 31)]]  

# query single phrases
query2 = ["The food here is amazing!", (4, 8)]

tool.build_index(phrases)
results = tool.search(query1, top_k=3)
# or
results = tool.search(query2, top_k=3)

We also support faiss, an efficient similarity search library. Just install the package following instructions here and UCTopicTool will automatically use faiss for efficient search.

UCTopicTool.search returns the ranked top k phrases for each query.

Save and Load finetuned UCTopicTool

The methods UCTopicTool.save and UCTopicTool.load are used for save and load all paramters of UCTopicTool.

Save:

tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
# finetune UCTopic with CCL
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
                                                   n_clusters=[15, 25])

tool.save(**your directory**)

Load:

tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
tool.load(**your directory**)

The loaded parameters will be used for all methods (for encoding, topic mining, phrase similarities and retrieval) introduced above.

Experiments

In this section, we re-implement experiments in our paper.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.9.0 version corresponding to your platforms/CUDA versions.

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Download en_core_web_sm model from spacy,

python -m spacy download en_core_web_sm

Datasets

The downstream datasets used in our experiments can be downloaded from here.

Entity Clustering

The config file of entity clustering is clustering/consts.py and most arguments are self-explained. Please setup --gpu and --data_path before running. The clustering scores will be printed.

Clustering with our pre-trained phrase embeddings.

python clustering.py --gpu 0

Clustering with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.

python clustering_ccl_finetune.py --gpu 0

Topic Mining

The config file of entity clustering is topic_modeling/consts.py.

Key Argument Table

Arguments	Description
--num_classes	Min and Max number of classes, e.g., `[5, 15]`. Our model will find the class number by silhouette_score.
--sample_num_cluster	Number of sampled phrases to confirm class number.
--sample_num_finetune	Number of sampled phrases for CCL finetuning.
--contrastive_num	Number of negative classes for CCL finetuning.
--finetune_step	CCL finetuning steps (maximum global steps for finetuning).

Tips: Please tune --batch_size or --contrastive_num for suitable GPU memory usage.

Topic mining with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.

python find_topic.py --gpu 0

Outputs

We output three files under topic_results:

File Name	Description
`merged_phraes_pred_prob.pickle`	A dictionary of phrases and their topic number and prediction probability. A topic of a phrase is merged from all phrase mentioins. `{phrase: [topic_id, probability]}`, e.g., {'fair prices': [0, 0.34889686]}
`phrase_instances_pred.json`	A list of all mined phrase mentions. Each element is `[[doc_id, start, end, phrase_mention], topic_id]`.
`topics_phrases.json`	A dictionary of topics and corresponding phrases sorted by probability. `{'topic_id': [[phrase1, prob1], [phrase2, prob2]]}`

Pretraining

Data

For unsupervised pretraining of UCTopic, we use article and span with links from English Wikipedia and Wikidata. Our processed dataset can be downloaded from here.

Training scripts

We provide example training scripts and our default training parameters for unsupervised training of UCTopic in run_example.sh.

bash run_example.sh

Arguments description can be found in pretrain.py. All the other arguments are standard Huggingface's transformers training arguments.

Convert models

Our pretrained checkpoints are slightly different from the checkpoint uctopic-base. Please refer convert_uctopic_parameters.py to convert it.

Contact

If you have any questions related to the code or the paper, feel free to email Jiacheng ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use UCTopic in your work:

@inproceedings{Li2022UCTopicUC,
    title = "{UCT}opic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining",
    author = "Li, Jiacheng  and
      Shang, Jingbo  and
      McAuley, Julian",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.426",
    doi = "10.18653/v1/2022.acl-long.426",
    pages = "6159--6169"
}