dleemiller commited on
Commit
0e1a7fc
1 Parent(s): 8c5dfed

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +167 -39
README.md CHANGED
@@ -1,50 +1,85 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- datasets:
6
- - sentence-transformers/all-nli
7
- - sentence-transformers/gooaq
8
- ---
9
- # wordllama
10
-
11
- ## Installation
12
-
13
- Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
14
- ```python
15
- pip install wordllama
16
- ```
17
 
18
- ## Intended Use
 
 
19
 
20
- This model is intended for use in natural language processing applications that require text embeddings, such as text classification, sentiment analysis, and document clustering.
21
- It's a token embedding model that is comparable to word embedding models, but substantionally smaller in size (16mb default 256-dim model).
 
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ```python
24
  from wordllama import WordLlama
25
 
 
26
  wl = WordLlama.load()
 
 
27
  similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
28
  print(similarity_score) # Output: 0.06641249096796882
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
 
31
- ## Model Architecture
 
 
 
 
 
32
 
33
- Wordllama is based on token embedding codebooks extracted from large language models.
34
- It is trained like a general embedding, with MultipleNegativesRankingLoss using the sentence transformers library,
35
- using Matryoshka Representation Learning so that embeddings can be truncated to 64, 128, 256, 512 or 1024 dimensions.
36
 
37
- To create WordLlama L2 "supercat", we extract and concatenate the token embedding codebooks from several large language models that
38
- use the llama2 tokenizer vocabulary (32k vocab size). This includes models like Llama2 70B and Phi-3 Medium.
39
- Then we add a trainable token weight parameter and initialize stopwords to a smaller value (0.1). Finally, we
40
- train a projection from the large, concatenated codebook down to a smaller dimension and average pool.
41
 
42
- We use popular embeddings datasets from sentence transformers, and matryoshka representation learning (MRL) so that
43
- dimensions can be truncated. For "binary" models, we train using a straight through estimator, so that the embeddings
44
- can be binarized eg, (x>0).sign() and packed into integers for hamming distance computation.
45
 
46
- After training, we save a new, small token embedding codebook, which is analogous to vectors of a word embedding.
47
 
 
 
 
 
48
 
49
  ## MTEB Results (l2_supercat)
50
 
@@ -58,11 +93,104 @@ After training, we save a new, small token embedding codebook, which is analogou
58
  | CQA DupStack | 18.76 | 22.54 | 24.12 | 24.59 | 24.83 | 15.47 | 16.79 | 41.32 |
59
  | SummEval | 30.79 | 29.99 | 30.99 | 29.56 | 29.39 | 28.87 | 30.49 | 30.81 |
60
 
61
- ---
62
- license: mit
63
- datasets:
64
- - sentence-transformers/all-nli
65
- - sentence-transformers/gooaq
66
- language:
67
- - en
68
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ # WordLlama
3
+
4
+ **WordLlama** is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.
5
 
6
+ <p align="center">
7
+ <img src="wordllama.png" alt="Word Llama" width="50%">
8
+ </p>
9
 
10
+
11
+ ## Table of Contents
12
+ - [Quick Start](#quick-start)
13
+ - [What is it?](#what-is-it)
14
+ - [MTEB Results](#mteb-results-l2_supercat)
15
+ - [Embed Text](#embed-text)
16
+ - [Training Notes](#training-notes)
17
+ - [Roadmap](#roadmap)
18
+ - [Extracting Token Embeddings](#extracting-token-embeddings)
19
+ - [Citations](#citations)
20
+ - [License](#license)
21
+
22
+ ## Quick Start
23
+
24
+ Install:
25
+ ```bash
26
+ pip install wordllama
27
+ ```
28
+
29
+ Load the 256-dim model.
30
  ```python
31
  from wordllama import WordLlama
32
 
33
+ # Load the default WordLlama model
34
  wl = WordLlama.load()
35
+
36
+ # Calculate similarity between two sentences
37
  similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
38
  print(similarity_score) # Output: 0.06641249096796882
39
+
40
+ # Rank documents based on their similarity to a query
41
+ query = "i went to the car"
42
+ candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
43
+ ranked_docs = wl.rank(query, candidates)
44
+ print(ranked_docs)
45
+ # Output:
46
+ # [
47
+ # ('i went to the vehicle', 0.7441646856486314),
48
+ # ('i went to the truck', 0.2832691551894259),
49
+ # ('i went to the shop', 0.19732814982305436),
50
+ # ('i went to the park', 0.15101404519322253)
51
+ # ]
52
+
53
+ # additional inference methods
54
+ wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
55
+ wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
56
+ wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
57
+ wl.topk(query, candidates, k=3) # return topk strings based on query
58
  ```
59
 
60
+ ## What is it?
61
+
62
+ WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText).
63
+ WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.
64
+
65
+ WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (**16MB default model** @ 256-dim vs >2GB).
66
 
67
+ Features of WordLlama include:
 
 
68
 
69
+ 1. **Matryoshka Representations**: Truncate embedding dimension as needed.
70
+ 2. **Low Resource Requirements**: A simple token lookup with average pooling, enables this to operate fast on CPU.
71
+ 3. **Binarization**: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
72
+ 4. **Numpy-only inference**: Lightweight and simple.
73
 
74
+ For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512.
75
+ For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).
 
76
 
77
+ The final weights are saved *after* weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.
78
 
79
+ It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering.
80
+ I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows.
81
+ You can perform your own llm surgery and train your own model on consumer GPUs in a few hours.
82
+ Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.
83
 
84
  ## MTEB Results (l2_supercat)
85
 
 
93
  | CQA DupStack | 18.76 | 22.54 | 24.12 | 24.59 | 24.83 | 15.47 | 16.79 | 41.32 |
94
  | SummEval | 30.79 | 29.99 | 30.99 | 29.56 | 29.39 | 28.87 | 30.49 | 30.81 |
95
 
96
+ The [l2_supercat](https://huggingface.co/dleemiller/word-llama-l2-supercat) is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens).
97
+ Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).
98
+
99
+ ### Other Models
100
+ [Results](wordllama/RESULTS.md)
101
+
102
+ Llama3-based: [l3_supercat](https://huggingface.co/dleemiller/wordllama-l3-supercat)
103
+
104
+ ## Embed Text
105
+
106
+ Here’s how you can load pre-trained embeddings and use them to embed text:
107
+
108
+ ```python
109
+ from wordllama import WordLlama
110
+
111
+ # Load pre-trained embeddings
112
+ # truncate dimension to 64
113
+ wl = WordLlama.load(trunc_dim=64)
114
+
115
+ # Embed text
116
+ embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
117
+ print(embeddings.shape) # (2, 64)
118
+ ```
119
+
120
+ Binary embedding models can be used like this:
121
+
122
+ ```python
123
+ # Binary embeddings are packed into uint64
124
+ # 64-dims => array of 1x uint64
125
+ wl = WordLlama.load(trunc_dim=64, binary=True) # this will download the binary model from huggingface
126
+ wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)
127
+
128
+ # load binary trained model trained with straight through estimator
129
+ wl = WordLlama.load(dim=1024, binary=True)
130
+
131
+ # Uses the hamming similarity to binarize
132
+ similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
133
+ print(similarity_score) # Output: 0.57421875
134
+
135
+ ranked_docs = wl.rank("i went to the car", ["van", "truck"])
136
+
137
+ wl.binary = False # turn off hamming and use cosine
138
+
139
+ # load a different model class
140
+ wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF
141
+ ```
142
+
143
+ ## Training Notes
144
+
145
+ Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.
146
+
147
+ L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.
148
+
149
+ ## Roadmap
150
+
151
+ - Working on adding inference features:
152
+ - Semantic text splitting
153
+ - Add example notebooks
154
+ - DSPy evaluators
155
+ - RAG pipelines
156
+
157
+ ## Extracting Token Embeddings
158
+
159
+ To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:
160
+
161
+ ```python
162
+ from wordllama.extract import extract_safetensors
163
+
164
+ # Extract embeddings for the specified configuration
165
+ extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")
166
+ ```
167
+
168
+ HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.
169
+
170
+ For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).
171
+ ```
172
+ $ pip install wordllama[train]
173
+ $ python train.py train --config your_new_config
174
+ (training stuff happens)
175
+ $ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
176
+ (saves 1 model per matryoshka dim)
177
+ ```
178
+
179
+ ## Citations
180
+
181
+ If you use WordLlama in your research or project, please consider citing it as follows:
182
+
183
+ ```bibtex
184
+ @software{miller2024wordllama,
185
+ author = {Miller, D. Lee},
186
+ title = {WordLlama: Recycled Token Embeddings from Large Language Models},
187
+ year = {2024},
188
+ url = {https://github.com/dleemiller/wordllama},
189
+ version = {0.2.5}
190
+ }
191
+ ```
192
+
193
+ ## License
194
+
195
+ This project is licensed under the MIT License.
196
+