Santiago Viquez
AI & ML interests
Articles
Organizations
santiviquez's activity
Exactly. But now, try to do the same, but this time by imagining/drawing an extra dimension perpendicular to the three spatial dimensions we see.
Oh thanks! I really appreciate it π«Ά
I'm writing a book on ML metrics.
Together with Wojtek Kuberski, weβre creating the missing piece of every ML university program and online course: a book solely dedicated to Machine Learning metrics!
The book will cover the following types of metrics:
β’ Regression
β’ Classification
β’ Clustering
β’ Ranking
β’ Vision
β’ Text
β’ GenAI
β’ Bias and Fairness
π check out the book: https://www.nannyml.com/metrics
NannyML: hold my beer...
https://huggingface.co/blog/santiviquez/data-drift-estimate-model-performance
For these experiments, I built a technique that relies on drift signals to estimate model performance. I compared its results against the current SoTA performance estimation methods and checked which technique performs best.
The plot below summarizes the general results. It measures the quality of performance estimation versus the absolute performance change. (The lower, the better).
Full experiment: https://www.nannyml.com/blog/data-drift-estimate-model-performance
In it, I describe the setup, datasets, models, benchmarking methods, and the code used in the project.
Any suggestions?
Performance estimation is currently the best way to quantify the impact of data drift on model performance. π‘
I've been benchmarking performance estimation methods (CBPE and M-CBPE) against data drift signals.
I'm using drift results as features for many regression algorithms, and then I'm taking those to estimate the model's performance. Finally, I'm measuring the Mean Absolute Error (MAE) between the regression models' predictions and actual performance.
So far, for all my experiments, performance estimation methods do better than drift signals. π¨βπ¬
Bear in mind that these are some early results, I'm running the flow on more datasets as we speak.
Hopefully, by next week, I will have more results to share π
I'm working on a benchmarking analysis, and I'm currently doing the following:
- Get univariate and multivariate drift signals and measure their correlation with realized performance.
- Use drift signals as features of a regression model to predict the model's performance.
- Use drift signals as features of a classification model to predict a performance drop.
- Compare all the above experiments with results from Performance Estimation algorithms.
Any other ideas?
Nicee, I'll take a look π
Next week we'll be hosting our first Post-Deployment Data Science Meetup in Paris!
My boss will be talking about Quantifying the Impact of Data Drift on Model
Performance. π
The event is completely free, and there's only space for 50 people, so if you are interested, RSVP as soon as possible π€
ποΈ Thursday, March 14
π 5:30 PM - 8:30 PM GMT+1
π RSVP: https://lu.ma/postdeploymentparis
Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.
How can we detect Concept Drift? π€
All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. π§
This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.
βBut how do we know if there is a new Concept in our data?
βOr, more important, how do we measure if the new Concept is affecting the model's performance?
π‘ We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.
π£ Step-by-Step solution:
1οΈβ£ We start by training an internal model on a chunk of the latest data. β‘οΈ This allows us to learn the new possible Concept presented in the data.
2οΈβ£ Next, we use the internal model to make predictions on the reference dataset.
3οΈβ£ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.
4οΈβ£ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.
To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. β‘οΈ This is what the plot below shows. The change of the F1-score due to Concept drift! π¨
This process is repeated for every new chunk of data that we get. π
* check the image to get the joke π
This paper breaks down LLM hallucinations into six different types:
1οΈβ£ Entity: Involves errors in nouns. Changing that single entity can make the sentence correct.
2οΈβ£ Relation: Involves errors in verbs, prepositions, or adjectives. They can be fixed by correcting the relation.
3οΈβ£ Contradictory: Sentences that contradict factually correct information.
4οΈβ£ Invented: When the LLM generates sentences with concepts that don't exist in the real world.
5οΈβ£ Subjective: When the LLM generates sentences influenced by personal beliefs, feelings, biases, etc.
6οΈβ£ Unverifiable: When the LLM comes up with sentences containing information that can't be verified. E.g., Personal or private matters.
The first two types of hallucinations are relatively easy to correct, given that we can rewrite them by changing the entity or relation. However, the other four would mostly need to be removed to make the sentence factually correct.
Paper: Fine-grained Hallucination Detection and Editing for Language Models (2401.06855)
omg this is super cool! Definitely ping me when you have a demo.
@gsarti curious to know if you have seen something like this. It is very similar to a weighted version of UQ, but not exactly... haha
The premise is that not all output tokens of a generated response share the same importance. Hallucinations are more dangerous in the form of a noun, date, number, etc.
The idea is to have a "token selection" layer that filters the output token probabilities sequence. Then, we use only the probabilities of the relevant tokens to calculate uncertainty quantification metrics.
The big question is how we know which tokens are the relevant ones. π€
My idea is to use the output sequence (decoded one) and use an NLP model (it doesn't need to be a fancy one) to do entity recognition and part-of-speech tagging to the output sequence and then do uncertainty quantification only on the entities that we have set as relevant (nouns, dates, numbers, etc).
What are your thoughts? Have you seen anyone try this before?
Curious to see if anyone has tried this before and know if this would have an impact on the correlation with human-annotated evaluations.
I found out about this paper thanks to @gsarti 's post from last week; I got curious, so I want to post my take on it. π€
The paper proposes a new metric called EigenScore to detect LLM hallucinations. π
Their idea is that given an input question, they generate K different answers, take their internal embedding states, calculate a covariance matrix with them, and use it to calculate an EigenScore.
We can think of the EigenScore as the mean of the eigenvalues of the covariance matrix of the embedding space of the K-generated answers.
βBut why eigenvalues?
Well, if the K generations have similar semantics, the sentence embeddings will be highly correlated, and most eigenvalues will be close to 0.
On the other hand, if the LLM hallucinates, the K generations will have diverse semantics, and the eigenvalues will be significantly different from 0.
The idea is pretty neat and shows better results when compared to other methods like sequence probabilities, length-normalized entropy, and other uncertainty quantification-based methods.
π What I'm personally missing from the paper is that they don't compare their results with other methods like LLM-Eval and SelfcheckGPT. They do mention that EigenScore is much cheaper to implement than SelfcheckGPT, but that's all on the topic.
Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)
Here is a black-box method for hallucination detection that shows strong correlation with human annotations. π₯
π‘ The idea is the following: ask GPT, or any other powerful LLM, to sample multiple answers for the same prompt, and then ask it if these answers align with the statements in the original output. Make it say yes/no and measure the frequency with which the generated samples support the original statements.
This method is called SelfCheckGPT with Prompt and shows very nice results. π
The downside, we have to do many LLM calls just to evaluate a single generated paragraph... π
More details and variations of this method are in the paper: SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning (2308.00436)
santiviquez/llm-hallucination-detection-papers-65c4d2399096960aa80776d3
Which one should I add to the list?
Retrieval Augmented Generation (RAG) is a strategy to alleviate LLM hallucinations and improve the quality of generated responses.
A standard RAG architecture has two main blocks: a Retriever and a Generator.
1οΈβ£ When the system receives an input sequence, it uses the Retriever to retrieve the top-K most relevant documents associated with the input sequence. These documents typically come from an external source (e.g., Wikipedia) and are then concatenated to the original input's context.
2οΈβ£ It then uses the Generator to generate a response given the gathered information in the first step.
But what happens if the retrieval goes wrong and the retrieved documents are of very low quality?
Well, in such cases, the generated response will probably be of low quality, too. π«
But here is where CRAG (Corrective RAG) *might* help. I say it might help because the paper is very new β only one week old, and I don't know if someone has actually tried this in practice π
However, the idea is to add a Knowledge Correction block between the Retrieval and Generation steps to evaluate the retrieved documents and correct them if necessary.
This step goes as follows:
π’ If the documents are correct, they will be refined into more precise knowledge strips and concatenated to the original context to generate a response.
π΄ If the documents are incorrect, they will be discarded, and instead, the system searches the web for complementary knowledge. This external knowledge is then concatenated to the original context to generate a response.
π‘ If the documents are ambiguous, a combination of the previous two resolutions is triggered.
The experimental results from the paper show how the CRAG strategy outperforms traditional RAG approaches in both short and long-form text generation tasks.
Paper: Corrective Retrieval Augmented Generation (2401.15884)
π₯
π
ageML is a Python library I've been building to study the temporal performance degradation of ML models.
The goal of the project is to facilitate the exploration of performance degradation by providing tools for people to easily test how their models would evolve over time when trained and evaluated on different subsets of their data.
β Check it out: https://github.com/santiviquez/ageml
BARTScore is a text-generation evaluation metric that treats model evaluation as a text-generation task π
Other metrics approach the evaluation problem from different ML task perspectives; for instance, ROUGE and BLUE formulate it as an unsupervised matching task, BLUERT and COMET as a supervised regression, and BEER as a supervised ranking task.
Meanwhile, BARTScore formulates it as a text-generation task. Its idea is to leverage BART's pre-trained contextual embeddings to return a score that measures either the faithfulness, precision, recall, or F-score response of the main text-generation model.
For example, if we want to measure faithfulness, the way it works is that we would take the source and the generated text from our model and use BART to calculate the log token probability of the generated text given the source; we can then weight those results and return the sum.
BARTScore correlates nicely with human scores, and it is relatively simple to implement.
π Here is the original BARTScore paper: BARTScore: Evaluating Generated Text as Text Generation (2106.11520)
π§βπ» And the GitHub repo to use this metric: https://github.com/neulab/BARTScore
First, the two main ideas used in the experimentsβusing token probabilities and LLM-Eval scoresβare taken from these three papers:
1. Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation (2208.05309)
2. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models (2303.08896)
3. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models (2305.13711)
In the first two, the authors claim that computing the average of the sentence-level token probabilities is the best heuristic for detecting hallucinations. And from my results, we do see a weak positive correlation between average token probabilities and ground truth. π€
The nice thing about this method is that it comes with almost no implementation cost since we only need the output token probabilities from the generated text, so it is straightforward to implement.
The third paper proposes an evaluation shema where we do an extra call to an LLM and kindly ask it to rate on a scale from 0 to 5 how good the generated text is on a set of different criteria. ππ€
I was able to reproduce similar results to those in the paper. There is a moderate positive correlation between the ground truth scores and the ones produced by the LLM.
Of course, this method is much more expensive since we would need one extra call to the LLM for every prediction that we would like to evaluate, and it is also very sensitive to prompt engineering. π€·
Yes, of course, I was actually gonna add the explanation as a comment, but I forgot π
The idea is that models have confident and less confident areas. The confidence is influenced by the characteristics and distribution of the training data.
In the example above, during testing, the model classifies all data points almost perfectly. And we observe only a small portion of them gathering in the center (the model's less confident area).
However, in production, more and more examples start coming from the conflicted region. A shift like that one will definitely translate into a performance drop.
So, you need monitoring to realize that the model might be underperforming.
The issue is that monitoring performance changes in production is hard because we rarely have ground truth there. The good news is that we could monitor the estimated performance instead!
Text generation tasks are challenging because a sentence can be written in multiple ways but still preserve its meaning.
For instance, "France's capital is Paris" means the same as "Paris is France's capital." π«π·
In uncertainty quantification, we often look at token-level probabilities to quantify how "confident" an LLM is about its output. However, in this paper, the authors look at uncertainty at a meaning level.
Their motivation is that meanings are especially important for LLMs' trustworthiness; a system can be reliable even with many different ways to say the same thing, but answering with inconsistent meanings shows poor reliability.
To estimate semantic uncertainty, they introduce an algorithm for clustering sequences that mean the same thing, based on the principle that two sentences mean the same thing if we can infer one from the other. ππ€
Then, they determine the likelihood of each meaning and estimate the semantic entropy by summing probabilities that share a meaning.
There's a lot more to it, but their results look quite nice when compared with non-semantic approaches.
Paper: Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation (2302.09664)
Nice! Thank you, I'll take a look
Ohh thatβs so cool! I actually played with the space last week when I was reading the paper. Donβt remember how I found it π€
A simple average of the log probabilities of the output tokens from an LLM might be all it takes to tell if the model is hallucinating.π«¨
The idea is that if a model is not confident (low output token probabilities), the model may be inventing random stuff.
In these two papers:
1. https://aclanthology.org/2023.eacl-main.75/
2. https://arxiv.org/abs/2303.08896
The authors claim that this simple method is the best heuristic for detecting hallucinations. The beauty is that it only uses the generated token probabilities, so it can be implemented at inference time β‘
ohhh @victor can you add me on the list too? π
helloooo π