Salesforce/SFR-Embedding-Mistral · Embedding Dimensions

Feb 23

Is there a way to control the embedding dimension size or is that fixed? A vector index I'd like to use is limited to 2048.

nihil117

Apr 4

from sklearn.decomposition import PCA
import numpy as np

big_embeddings = np.array([...])
pca = PCA(n_components=2048)
small_embeddings = pca.fit_transform(embeddings_matrix)

Answering this for posterity

The embedding dimension size of a model like the one you're using from Salesforce's SFR-Embedding-Mistral is fixed, determined by the architecture of the model itself. These dimensions are a core part of how the model is structured and trained, making them an intrinsic characteristic that can't be adjusted post-training without altering the model's performance or intended function.

For models trained using transformers, the embedding size is closely tied to the size of the model's layers and its overall architecture. For instance, smaller models may have fewer parameters and thus smaller embedding sizes, while larger models will have more parameters and potentially larger embedding sizes. The embedding size affects the model's ability to capture complex patterns and the nuances in the data it was trained on.

If the vector index you intend to use is limited to 2048 dimensions and the model produces embeddings larger than this limit, you can use reduction.

There will, of course, be a loss of data.

mfreeman451

Apr 4

Brilliant, thank you for the response.

mfreeman451 changed discussion status to closed Apr 4