Contra-Topic-bottleneck-t5-large: Linear Topic Extraction using Bottleneck T5

A lightweight approach to topic extraction leveraging the Bottleneck T5 autoencoder architecture with learned transformation matrices. This project provides three specialized transformation matrices for mapping content embeddings to topic embeddings across different domains.

Check out the blog

TL;DR: Transform content embeddings into topic embeddings using domain-specific 1024×1024 transformation matrices, trained on three distinct datasets. Built on top of the Bottleneck T5 architecture for efficient, training-free topic extraction.

Motivation

Large Language Models (LLMs) have become the go-to solution for many NLP tasks, including topic extraction and classification. However, they come with significant overhead:

High computational requirements
Large memory footprint
Considerable inference latency
Complex deployment needs
Limited to pre-specified classes

This project offers a lightweight alternative specifically for topic extraction by leveraging the semantic structure of the Bottleneck T5's latent space. Instead of training a new model or fine-tuning existing ones, we learn a simple linear transformation between content and topic embeddings, providing:

Fast inference (milliseconds)
Minimal memory footprint (single 1024×1024 matrix per domain)
Simple deployment (basic matrix multiplication)
No need for GPU at inference time
Generator in nature

Architecture Overview

Base Model

Uses Bottleneck T5 Large (thesephist/contra-bottleneck-t5-large-wikipedia)
Fixed embedding dimension: 1024
Pre-trained on Wikipedia data
Autoencoder architecture with attention pooling

Transformation Layers

Three domain-specific transformation matrices (1024×1024 each)
Linear mapping from content to topic space
Learned using simple Mean Squared Error optimization
Total additional parameters: ~3M per domain

Datasets and Performance Metrics

1. ArXiv Abstracts Dataset (ankitagr01/dynamic_topic_modeling_arxiv_abstracts)

Scientific paper abstracts paired with their research topics, providing a test bed for academic content classification.

Performance Metrics:

Training MSE: 0.00225 (error on samples used to learn transformation)
Testing MSE: 0.00268 (error on held-out validation set)
Inter-topic MSE: 0.00620 (minimum distance between different topic embeddings)

Use Cases:

Automated paper categorization
Research trend analysis
Academic content recommendation

2. TopicSUM Dataset (knkarthick/topicsum)

241,171 dialogue samples with human-annotated topic labels, ideal for conversational content analysis.

Performance Metrics:

Training MSE: 0.00252
Testing MSE: 0.00255
Inter-topic MSE: 0.00737

Use Cases:

Meeting summarization
Customer service dialogue categorization
Chat log analysis

3. MSD Manual Topics (nuvocare/MSD_manual_topics_user_base)

Medical content from Merck's Manual, featuring both professional and patient-oriented content.

Performance Metrics:

Training MSE: 0.00174
Testing MSE: 0.00197
Inter-topic MSE: 0.00566

Use Cases:

Medical document classification
Healthcare content organization
Patient information routing

Understanding the Metrics

Computational Requirements

Resource	Requirement	Notes
Storage	~9MB per matrix	1024×1024 float32 values
Memory	~27MB total	All three domain matrices
Inference Time	~10ms	On CPU, per text sample
Training Hardware	P100 GPU	Free tier on Kaggle
Training Time	~4 hours total	Mostly embedding generation
Base Model	~770M parameters	Loaded only during embedding creation

Performance Metrics Explained

Training MSE (Mean Squared Error)
- Measures how well the transformation matrix maps content to topic embeddings
- Calculated on the 80% training split
- Lower values indicate better alignment between transformed content and actual topic embeddings
Testing MSE
- Same metric but on 20% held-out test set
- Indicates generalization capability
- Similar values between train/test suggest good generalization. Slightly higher than training MSE is expected and healthy
Inter-topic MSE
- Minimum squared distance between any pair of topic embeddings
- Higher values indicate better topic separation
- Critical for preventing topic confusion
- Example: MSD's 0.00566 means medical topics maintain distinct representations

Comparative Analysis

MSD dataset shows best training performance (0.00174 MSE)
- Likely due to well-structured medical vocabulary
- Clear topic boundaries in medical domain
TopicSUM has highest inter-topic MSE (0.00737)
- Reflects diverse nature of conversational topics
- Important for distinguishing between varied dialogue contexts
ArXiv results balance between the two
- Scientific content has natural overlap between fields
- Still maintains good topic separation (0.00620 inter-topic MSE)

Implementation

Try it out here: (https://colab.research.google.com/drive/1_SuTiL3QS-PUYjSrugqqD5mQlMv8Hbfc?usp=sharing)

1. Base Model Wrapper

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path, 
            trust_remote_code=True
        ).to(device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

2. Topic Mapper

Transformations Available:

url = 'https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt'
file_path = 'transformation_matrix.pt'
with open(file_path, 'wb') as f:
    f.write(requests.get(url).content)
transformation_matrix = torch.load(file_path, weights_only=False).float()
print(transformation_matrix.shape, type(transformation_matrix))

3. Final Conversion


autoencoder = BottleneckT5Autoencoder(model_path=model_path, device=device)
content_embedding = autoencoder.embed(content)
topic_embedding = content_embedding @ transformation_matrix
topic = = autoencoder.generate_from_latent(topic_embedding)
print(topic)

Limitations and Future Work

Representation Quality
- System inherits Bottleneck T5's encoding limitations
- Performance depends on input text fitting model's training distribution
Domain Specificity
- Each matrix is domain-optimized
- Cross-domain performance not guaranteed
- Future work: Investigate domain adaptation techniques
Fixed Dimensionality
- Locked to Bottleneck T5's 1024D space
- Potential future work: Dimension reduction studies
Linear Transformation Limitations
- Assumes linear relationship between content and topic spaces
- Future work: Explore non-linear transformations

Memory and Computation Requirements

Transformation Matrix: 1024 × 1024 × 4 bytes ≈ 9MB per domain
Inference Time: ~10ms on CPU (matrix multiplication)
Total Model Size: ~27MB (all three domains)
Base Model: ~770M parameters (loaded only during embedding creation)

Acknowledgments

Special thanks to:

Linus Lee (@thesephist) for the Bottleneck T5 model
The T5 team at Google Research
Dataset providers:
- @ankitagr01 for the ArXiv abstracts dataset
- @knkarthick for the TopicSUM dataset
- @nuvocare for the MSD Manual topics dataset
Kaggle for providing free P100 GPU resources

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

AmanPriyanshu
/

Contra-Topic-bottleneck-t5-large

Contra-Topic-bottleneck-t5-large: Linear Topic Extraction using Bottleneck T5

Motivation

Architecture Overview

Base Model

Transformation Layers

Datasets and Performance Metrics

1. ArXiv Abstracts Dataset (ankitagr01/dynamic_topic_modeling_arxiv_abstracts)

2. TopicSUM Dataset (knkarthick/topicsum)

3. MSD Manual Topics (nuvocare/MSD_manual_topics_user_base)

Understanding the Metrics

Computational Requirements

Performance Metrics Explained

Comparative Analysis

Implementation

1. Base Model Wrapper

2. Topic Mapper

3. Final Conversion

Limitations and Future Work

Memory and Computation Requirements

Acknowledgments

License

Contributing

Model tree for AmanPriyanshu/Contra-Topic-bottleneck-t5-large

Datasets used to train AmanPriyanshu/Contra-Topic-bottleneck-t5-large