`IndexError: index out of range in self` when creating embeddings
I'm using T-Systems-onsite/cross-en-de-roberta-sentence-transformer
as the embedding model for creating a privateGPT chatbot.
My setup is
privateGPT Setup | Used/Parameter |
---|---|
Hardware | Ubuntu Server with 48 CPUs |
Source Documents | One PDF, around 100pages |
llm_hf_repo_id |
TheBloke/Leo-Mistral-Hessianai-7B-Chat-GGUF |
llm_hf_model_file |
leo-mistral-hessianai-7b-chat.Q4_K_M.gguf |
embedding_hf_model_name |
T-Systems-onsite/cross-en-de-roberta-sentence-transformer |
Now, my problem is:
When I ingest the .pdf
file creating the embeddings, after around 70% of the ingestion, I run into the following error:
File "/*****/.cache/pypoetry/virtualenvs/private-gpt-igPs2cci-py3.11/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self
Does that mean that T-Systems-onsite/cross-en-de-roberta-sentence-transformer
cannot handle long pdfs?
Or do I need to set some parameters/options?
Is this a problem of privateGPT
of T-Systems-onsite/cross-en-de-roberta-sentence-transformer
?
Can you please use Sentence Transformers to load this model and do some tests?
Code can be found here: https://www.sbert.net/docs/usage/semantic_textual_similarity.html
I guess this is an issue specific to privateGPT.
Hope that helps. If you still have problems please give me code to to reproduce the error.
Is there an update on this?
I'm currently having a similair issue.
Edit:
So in Llamaindex you can set the max token legnth to max 512 which solves the problem for me.
# Embeddings model
embed_model_base = HuggingFaceEmbedding(
model_name=config.EMBEDDING_MODEL_NAME,
max_length=512
)
Has the problem been solved or not, because I have the same problem now.
Thank you very much