Facing Issues with Model Output and Inference Times
I am implementing RAG architecture with ChromaDB as my Vector Store and Falcon-7B as my LLM. I have used Langchains retriever to tie these together. While testing with a single PDF and search results set to return the top 3 matches, I face a number of issues.
- The returned answers are not accurate (Tried different Temperature settings)
- The model takes a long time and then responds with the same sentence repeated multiple times. (Increasing repetition penalty mitigated this to an extent)
- Model does not return with an answer for extended period of times, sometimes greater than 10-15 mins.
- Model response is slow. 5X slow in some cases when compared to models like Llama-2 7B or 13B
I reduced the returned search results from 3 to 1, which improved parts of the accuracy and time, however the model stops responding after being queried 3-4 times.
All of these issues have been reported in some form or the other previously
Wrong Output
while giving a input but getting the wrong output for the particular input
falcon-7b-instruct is answering out of context
Repeats the same sentence
any success in In-context question-answering?
Model keeps generating multiple rounds of conversation
Model is Slow or does not give output
Slow inference
4th inference in a row does not work for Falcon7B in 8 or 4 bit
I am using the 16bit version of the model and running on two T4 GPUs on AWS.
Please let me know if there are any workarounds or fixes for the above.
When I set the returned search results from VecDB to 3(larger prompt), the model takes
1st Question(Answer is wrong)CPU times: user 2min 39s, sys: 391 ms, total: 2min 39s
2nd Question (Answer is wrong)CPU times: user 47.4 s, sys: 7.26 ms, total: 47.4 s
and then does not respond from the third onwards
When I decrease the results to 1 (smaller prompt)
1st question takes(Answer is right)CPU times: user 44.6 s, sys: 288 ms, total: 44.9 s
2nd Question takes(Answer is somewhat right)CPU times: user 17.4 s, sys: 0 ns, total: 17.4 s
and then does not respond from the third onwards as above.