What are the online evaluation metrics for Information Retrieval Systems?
Information retrieval (IR) aims to satisfy an information need (e.g., user query) by finding the most relevant set of documents (effectiveness or quality goal) as quickly as possible (efficiency or speed goal). Therefore, an IR application is designed to respond to efficiency and effectiveness, as two fundamental requirements.
How do we ensure we can achieve the goals of efficiency and effectiveness?
Evaluation helps us to measure and monitor effectiveness and efficiency, which is key in making progress in building better IR systems. Without evaluation, it is quite difficult to examine new retrieval techniques or make informed deployment decisions. Online evaluation (a.k.a. interactive evaluation) is one of the most prevalent methods to measure effectiveness of an IR system. Recording and analyzing user behavior using log data is a main part of the evaluation. In this blog, we focus on online evaluation and provide a summary of methods and challenges in the application.
What is an online evaluation?
Online evaluation is based on implicit measurement of real users’ experience of an IR system. Implicit measurement is the by-product of users’ natural interaction, such as clicks or dwell time. Online evaluation uses a specific set of tools and methods complementary to other evaluation approaches utilized in academic and industry research settings. Comparing to offline evaluation (using human relevance judgments), online evaluation is more realistic as it addresses questions about actual users’ experience with an IR system.
Approaches
Online evaluations are carried out in controlled experiments where we can identify causal effects (e.g., effects of changes in algorithm or parameters) on user metrics. These experiments can be categorized depending on how we define the quality (effectiveness) and at what granularity level we measure it.
In terms of quality, experimental approaches are divided into absolute and relative types. In an absolute quality experiment, one is interested in measuring the performance of a single IR system while in relative quality experiments, two IR system are compared which is more challenging to draw general conclusion over time. Figure 1 demonstrates an analogy to better understand absolute and relative evaluation types. As shown in this figure, an absolute evaluation could be when we look at a single tree and the question is how tall is the tree? In this case, we need to have a metric to measure the height and that output would be a single valued number. We can track the tree height over time by measuring it on a time interval. On the other hand, a relative evaluation could be when we compare the height of two trees using one metric. Relative evaluation could be challenging as the transitivity and performance comparisons are not straightforward sometimes to draw.
Absolute online evaluation is usually used with A/B testing which is a user experience research methodology that includes a randomized experiment with two variant, ‘A’ and ‘B’, of the same application. It determines which variant drives more user conversions. Different segments of users are selected for the experimentation. Relative online evaluation uses interleaving comparison, which is a popular technique for evaluating IR systems based on implicit user feedback. The basic idea of the different variations of the interleaving approach is to do paired online comparisons of two rankings. This is done by merging the two rankings into one interleaved ranking and present it to the user in an interactive way. The goal of this technique is to be fair and unbiased in interpreting user clicks data, as well as comparison judgments. Also, interleaving aims to eliminate the post-hoc interpretation of observational data. Table 1 provides a summary of comparing absolute and relative online evaluation.
Absolute Quality | Relative Quality |
---|---|
|
|
In terms of granularity, one might be interested in overall quality of a ranking system (list level) or the quality of individual returned documents for a given query (result level). Therefore, approaches to online evaluation can vary based on the questions we want to address. Table 2 structures the possible research questions on two different dimensions: granularity level and absolute/relative evaluation. This table helps us to identify the correct level of assessments in our experiment and then using those dimensions, we can select the appropriate metrics for evaluation tasks.
Granularity | Absolute | Relative |
---|---|---|
Document | Are documents returned by this system relevant? | Which documents returned by this system are best? |
Ranking | How good is this system for an average query? | Which of these IR systems is better on average? |
Session | How good is this system for an average task? | Which of these IR systems leads to better sessions on average? |
Here is a summary of available metrics for both absolute and relative approaches at different assessment levels.
1) Absolute evaluation metrics
Absolute evaluation is used when we calculate an absolute score that tells us how much the ranked list of documents returned by the system are relevant. We can track this score over time for a single IR system or run multiple IR systems to compute this score and then compare them together over time.
As shown in Table 2, depending on the questions we want to address, we may choose to perform absolute online evaluation at any level of document (general quality of results), ranking (position of top relevant results on average), and session (specific task or goal). To quantify an IR system performance, we need to know what measures to use. To have a better idea of metrics, we introduce a list of metrics for each level of assessment along with a short definition and examples. The goal is to provide a concise and high-level overview of available metrics to those who want to start their own absolute online evaluation experiment.
Document level: Are documents returned by this system relevant?
Click-through rate (CTR):
- the simplest click-based metric. It is commonly used as a baseline.
- represents the average number of clicks a given document receives when shown on the Search Engine Results Page (SERP) for some query.
- the data collected is noisy and strongly biased, particularly due to document position.
Dwell time:
- is frequently used to improve over simple click metrics.
- is estimation of satisfied clicks with dwell time cut-offs based on results from log analysis, common choice 30 seconds.
- aims to reflect the quality of the document rather than the caption produced by a search engine.
Learned click satisfaction metrics:
- combine several features and information sources to obtain more accurate estimates of document-level user satisfaction.
- example: combinations of dwell time, scrolling behavior, and characteristics of the result document in a Bayesian network model.
- example: combine query (e.g., query length, frequency in logs), ranking (e.g., number of ads, diversity of results), and session (e.g., number of queries, time in session so far) features to predict user satisfaction at the click level.
- example: a sophisticated query-dependent satisfaction classifier, that improves prediction of user satisfaction by modeling dwell time in relation to query topic and document complexity.
- example: contribution of dwell time to learned metrics in more detail. Considering dwell time across a search trail and measuring server-side dwell time over the use of client-side dwell time for predicting document-level satisfaction.
Click behavior models:
- take learned click satisfaction further, learning a latent relevance score for individual documents from clicks and possibly other observations.
- example: learning a dynamic Bayesian click model based on observed actions.
- example: estimate the relevance of documents from log data by training a probabilistic model of observation and click action.
Ranking level: How good is this system for an average query?
Click rank:
- the most basic such metric: It measures the position of the clicked document. Therefore, placing clicked documents at a higher position is better than the lower mean click rank, the higher the performance of the IR system. However, low mean click rank may arrive at a wrong conclusion when comparing two ranking systems one with few relevant documents and the other with many ones.
- variant is reciprocal rank (inverse click rank).
CTR@k:
- Click-through rate within the top k positions.
pSkip:
- the probability of a user skipping over any result and clicking on one that is lower.
- a more advanced variant of click position.
- the lower the probability of skipping, the higher the quality of search ranking produced.
Time to click:
- overall time from the SERP being shown to the further user interactions with IR system.
- example: time to first click and to last click.
- the shorter the spent time to click the better the performance
Abandonment:
- it is recognized that lack of interaction does not necessarily means dissatisfaction.
- good abandonment can capture satisfaction without users needing to interact with the search system.
- machine learning approaches can help better account for good abandonment, resulting from more sophisticated search result page elements such as factoids and entity panes.
Learned metrics at absolute rank level:
- historically less investigated than learning absolute user satisfaction at either the click or session level.
- example: satisfaction model that considers follow-on query reformulations to decide whether the interactions with results for the previous query was indicative of success or not.
Session level: How good is this system for an average task?
Simple session level measures:
- such as number of queries per session, session length, time to first click… these measures may be counter-intuitive and unreliable.
Learned measures:
- combine several session-level and lower-level user interactions (query count, average dwell time, number of clicked documents...) to obtain reliable estimates of search success.
Detection of searcher frustration, success:
- consider success separately from frustration: Whether satisfying an information need was more difficult than the user considers it should have been
Loyalty measures:
- account for users who repeatedly engage with an IR system. This is a common long-term goal for commercial IR systems.
- how long it takes until a user returns, can be modeled with survival analysis.
- queries per user: users who find a system effective will engage with it more.
- session per user, daily sessions per user.
- success rate per user.
- these measures usually change slowly as users establish habits, making them difficult to apply.
Abandonment:
- it is recognized that lack of interaction does not necessarily means dissatisfaction.
- good abandonment can capture satisfaction without users needing to interact with the search system.
- machine learning approaches can help better account for good abandonment, resulting from more sophisticated search result page elements such as factoids and entity panes.
Learned metrics at absolute rank level:
- historically less investigated than learning absolute user satisfaction at either the click or session level.
- example: satisfaction model that considers follow-on query reformulations to decide whether the interactions with results for the previous query was indicative of success or not.
User engagement metrics:
- designed to capture engagement over multiple weeks
2) Relative evaluation metrics
Relative evaluation is used when the research question is simpler, and we want to relatively compare two IR systems and better performance. This experiment is typically harder to generalize over time. An example would be given the fact that two systems A and B perform better than system C will not help us to figure out the relative performance between systems A and B.
Like absolute online evaluation, relative online evaluation is based on the research question, and one may choose to run experiments at either document or ranking level (Table 2). Note there is no metric for relative evaluation at session level. We cannot blend two system to produce a final session as done by interleaving comparison. Also, for relative comparison on ranking level, instead of comparing an absolute score from each system, we could use an alternative approach called interleaving. The intuition behind interleaving is that rather than showing each user the results from just one of the systems, the results are combined in an unbiased way so that every user can observe results produced by both systems. In this way, if one system returns better results on average, users can select them more often. This approach is more efficient, and users are not blind on results produced from other systems. Here, we briefly cover available metrics at each level of document and ranking, with a short description of each metric too.
Document level: Which documents returned by this system are best?
Click-skip:
- can be described as relative preferences among the document in the SERP, (click > skip above rule).
- assume that the user scans the SERP from top to bottom.
- when users skip a document to click on a lower ranked one, they are expressing a preference for the lower ranked document over the higher ranked document.
FairPairs:
- randomizing the document presentation order such that every adjacent pair of documents is shown to users in both possible orders.
- observing which is more often clicked when at the lower position.
Hybrid relative-absolute approach
Ranking level: Which of these IR systems is better on average?
Interleaving with simple click scoring:
- balanced interleaving or Team Draft algorithm: the preferred ranking is the one whose documents receives more clicks.
- extensions to Team Draft algorithm to account for more than two rankings, or with different mixing policies.
Interleaving with learned optimized click scoring:
- for sensitivity.
- for position bias.
- for agreement with absolute metrics.
Challenges with online evaluation
- Relevance: The user feedback is implicit and not fully representative of user behavior. For instance, user clicks are not relevance scores (although they are corelated). Therefore, it is challenging to link online metrics to user satisfaction or relevance.
- Biases: Factors such as the position of documents on the result page can affect user behavior leading to biased user feedback, such as clicked documents.
- Experiment effects: How to balance experimenting with a single ranker versus exploring other rankers?
- Reusability: Unlike using labelled data for offline evaluation, collected online data cannot be confidently re-used for evaluating other rankers
Summary
One of the core problems in IR research and development is the evaluation of IR systems. Having a reliable methodology to measure the effectiveness of an IR system in response to the users' information need is critical. There are two main categories of approaches: 1) offline approaches which are successful for quantifying topical relevance but not great at capturing contextual information such as user’s interaction history and changes in the user information need. 2) online approaches to quantify the actual utility of the IR system through implicit user feedback. Due to this implicit nature of assessment, online evaluation needs a different set of methods from offline evaluation.
In this blog, we offered a high-level summary of available methods for online evaluation of an IR system, extending from simple click-based metrics to composite metrics of engagement and user satisfaction. Table 3 illustrates all the discussed methods. There is not a single metric that everyone should use. It depends on the question and the goal of experiment. Various metrics may be needed for various IR applications. As IR systems are moving towards more conversational settings, long term metrics become more important to consider. Longer term metrics (for weeks or months) allow us to measure the effects of learning and gaining the information by aggregating findings of all search tasks. User engagement metrics are examples of long-term metrics. As for short term metrics, the challenge is to define short term measures that can effectively predict the long-term impacts. That’s why picking up more than one measure to evaluate IR application would be a good idea. Also, it is important to ensure that measure agrees what to optimize in the problem.
There are a couple of advantages to online evaluation. It lets the user interact with the IR system and from that interaction we can collect information such as click data that we can utilize in different types of evaluations. It can also provide a large dataset in an inexpensive way as it does not need to hire experts to assign relevance scores to results. On the flip side, the main disadvantage is that the data is often very noisy due to users’ behavior. Therefore, there is a need to eliminate noise, typically by collecting more data, data aggregation, filtering clicks approaches, or other noise removal policies.
Level | Absolute Measures | Relative Measures |
---|---|---|
Document |
|
|
Ranking |
|
|
Session |
|
|
Reference
- Hofmann, Katja, Lihong Li, and Filip Radlinski. "Online evaluation for information retrieval." Foundations and Trends in Information Retrieval 10, no. 1 (2016): 1-117.
- Maistro, Maria. ACM SIGIR/SIGKDD Africa Summer School on Machine Learning for Data Mining and Search (AFIRM 2019) 14-18 January 2019, Cape Town, South Africa
- Croft, W. Bruce, Donald Metzler, and Trevor Strohman. Search engines: Information retrieval in practice. Vol. 520. Reading: Addison-Wesley, 2015.
- Chapelle, Olivier, Thorsten Joachims, Filip Radlinski, and Yisong Yue. "Large-scale validation and analysis of interleaved search evaluation." ACM Transactions on Information Systems (TOIS) 30, no. 1 (2012): 1-41.