Stick To Your Role! About

Motivation

Benchmarks usually compare models with MANY QUESTIONS from A SINGLE MINIMAL CONTEXT, e.g. as multiple choices questions. This kind of evaluation is little informative of LLMs' behavior in deployment when exposed to new contexts (especially when we consider the LLMs highly context-dependant nature). We argue that CONTEXT-DEPENDENCE can be seen as a PROPERTY of LLMs: a dimension of LLM comparison alongside others like size, speed, or knowledge. We evaluate LLMs by asking the SAME QUESTIONS from MANY DIFFERENT CONTEXTS .

LLMs are often used to simulate personas and populations. We study the coherence of simulated populations over different contexts (conversations on different topics). To do that we leverage the psychological methodology to study the interpersonal stability of personal value expression of those simulated populations. We adopt the Schwartz Theory of Basic Personal Values that defines 10 values: Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, and Universalism, to evaluate their expression we use the associated questionnaires: PVQ-40, and SVS.

Administering a questionnaire to a simulated persona in context

To evaluate the stability on a population level we need to be able to evaluate a value profile expressed by a simulated individual in a specific context (conversation topic). We do with the following procedure:

The Tested model is instructed to simulate a persona
A separate model instance - The Interlocutor - is instructed to simulate a “human using a chatbot”
A conversation topic is induced by manually setting the first Interlocutor’s message (e.g. Tell me a joke)
A conversation is simulated
A question from the questionnaire is set as the last Interlocutor’s last message and The Tested model’s response is recorded (this is repeated for every item in the questionnaire)
The questionnaire is scored to obtain scores for the 10 personal values

Contexts

We aim to score the expressed value profile for each simulated persona in different contexts. More precisely a population (50 personas) is evaluated with a context chunk (50 topics: one per persona). Then, the simulated population in one context chunk is compared to the same population in another context chunk. Here are the considered context chunks:

no_conv : no conversation is simulated the questions from the PVQ-40 questionnaire are given directly
no_conv_svs : no conversation is simulated the questions from the SVS questionnaire are given directly
chunk_0-chunk-4 : 50 reddit posts used as the initial Interlocutor model messages (one per persona). chunk_0 contains the longest posts, chunk_4 the shortest.
chess : "1. e4" is given as the initial message to all personas, but for each persona the Interlocutor model is instructed to simulate a different persona (instead of a human user)
grammar : like chess, but "Can you check this sentence for grammar? \n Whilst Jane was waiting to meet hers friend their nose started bleeding." is given as the initial message.

Validation

Validity refers to the extent the questionnaire measures what it purports to measure. It can be seen the questionnaire's accuracy in measuring the intended factors, i.e. values. Following the recommendations in this paper, the validation consists of two phases: Theory-Based Multidimensional Scaling (MDS) and Confirmatory Factor Analysis (CFA).

Theory-Based Multidimensional Scaling (MDS) tests that the expressed values are organized in a circular structure as predicted by the theory. Values should be ordered in a circle in the same order as shown on the figure below (Tradition and Conformity should be on the same angle with Tradition closer to the center). To compute the structure in our data, we calculate the intercorrelations between different items (questions). This provides us with 40 points in a 40D space (for PVQ-40), which is space is then reduced to 2D by MDS. Crucially, MDS is initialized with the theoretical circular value structure, i.e. items corresponding to the same value are assigned the same angle. When MDS is fit, it provides the Stress (↓) metric ('Stress-1 index') indicating the goodness of the fit. A value of 0 indicates 'perfect' fit, 0.025 excellent, 0.05 good, 0.1 fair, and 0.2 poor. It is common to also qualitatively analyze this structure to see if the items are organized in distinct regions. Since a leaderboard cannot contain qualitative measures, we construct a quantitative measure with the same intuition. The Separability (↑) metric is the accuracy of a linear SVM OvO classifier. The intuition is that all values should be linearly separable.

Confirmatory Factor Analysis (CFA) fits a model on the data. The model is defined according to the theory and the fit of this model is used as a metric. Due to the circular structure of basic personal values, it is recommended to employ a Magnifying glass CFA strategy. Four separate models are fit, one for each of the high level values (consisting of several low-level values): Conservation (security, conformity, tradition), Openness to Change (self-direction, stimulation, hedonism), Self-transcendence (benevolence, universalism), Self-enhancement (achievement, power). Fit is measured with three standard metrics: Comparative Fit Index - CFI (↑) - compares the fit of a model to a more restricted baseline model (>.90 considered acceptable fit). Standardized root mean square residual - SRMR (↓) compares the sample variances and covariances to the estimated ones. (<.05 considered good fit, <.08 considered reasonable fit). Root mean square error of approximation - RMSEA (↓) reflects the degree to which a model fits the population covariance matrix, while taking into account the degrees of freedom and sample size (<.05 considered good fit; < .08 considered reasonable fit).

Rank-Order stability

Rank-Order stability (↑) is used to estimate the stability of some value inside a population. In psychology, it is computed as the correlation in the order of individuals at two points in time (individuals are ordered based on their expression of that value). Intuitively, this can be seen as addressing the following question: "Does Jack always value Tradition more than Jane does?". As shown below, instead of comparing two points in time, we compare the simulated population in different contexts (simulated conversations of different topics). We then average over different context pairs and values to obtain the final estimate.

Aggregate Metrics

To rank models, we aggregate the rank-order and validity metrics in two ways :

Cardinal - Score (↑) - the score is averaged over all metrics (with descending metrics inverted), context pairs (for stability) and contexts (for validity metrics)
Ordinal - Win rate (↑) - for each metric, each context pair (for stability) and each context (for validity metrics) is considered as a game between two models, the win rate of a model is the percentage of won games against all models

Following this paper and associated benchbench library, we can compute the diversity and the sensitivity of the two ranking methods. A benchmark is considered diverse if different tasks order models in different ways. We use the reversed Kendall’s coefficient of concordance (W) diversity metric. A benchmark is considered sensitive if the model ordering is sensitive to the addition of new irrelevant models (for ordinal benchmarks), or to the label noise (for cardinal benchmarks). We use the max rank change (MRC) sensitivity metric.

Differences with the paper

This leaderboard is grounded in the methodology presented in our research paper. The paper contains various experiments which are not included in the leaderboard such as: multiple populations, within-person stability, stability on downstream tasks, correlations of value expression and behavior on downstream tasks, and so on. The leaderboard focused on population-level stability (Rank-Order) and contains various additions to the methodology. These changes were made to keep up with the newly released model and to make the evaluation more detailed. We describe additions made in the leaderboard here for clarity:

a new population was created and was balanced with respect to gender
context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context
more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the webis dataset (the dataset was cleaned to exclude posts from NSFW subreddits)
different interlocutors - chess and grammar topic were still introduced as in the paper (same context for all participants), but the interlocutor model was instructed to simulate a random persona from the same population (as opposed to a human user in other settings)
evaluations were also done without simulating conversations (no_conv setting)
evaluations were also done with the SVS questionnaire (in the no_conv setting)
validation metrics - Stress, Separability, CFI, SRMR, RMSEA metrics were introduced
cardinal and ordinal ordering with sensitivity and diversity estimates were added
newer models were evaluated

Stick To Your Role! Leaderboard