Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views50 pages

Generative AI in Search and Recommendations

Uploaded by

Ricky Pateal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views50 pages

Generative AI in Search and Recommendations

Uploaded by

Ricky Pateal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Generative AI and LLMs in

Search and Recommendation


engines
Andrei Lopatenko, PhD
VP AI and Engineering
What is this deck about
In this presentation, we will explore the benefits of generative AI for search and
recommender engines.

We'll start with an overview and then delve deeper into specific approaches that
have proven valuable in the industry, such as FreshLLMs/FreshPrompts and
LLMRank.

Additionally, we will discuss LLM evaluation methods that are particularly useful
for search and recommender engines.
What is this deck about
Strategic Focus

This discussion extends beyond immediate business metrics such as conversion


rates, transaction volumes, and cross-merchandising value—which, as evidenced
by industry practices, are comparatively straightforward to enhance.

Instead, our focus is on cultivating long-term value. For customers, this means
enabling better, more impactful decisions that save time and money, and enhance
value post-purchase. For the company, it involves fostering sustained engagement
with customers, ensuring a lasting relationship that extends well beyond initial
transactions.
Current State of AI in Search
Generative AI has demonstrated tremendous value for search, as evidenced by recent launches

from OpenAI, Google, and Microsoft. In this presentation, we will explore the fundamental

technologies behind these advancements and discuss how generative AI can enhance the

search and recommender engines within your company.

In this presentation, we will explore the potential contributions of Large Language Models

(LLMs) and Generative AI to enhance value for customers looking to make purchases, bookings,

or rentals. We'll focus on how these technologies can not only refine the search process but also

fundamentally improve the decision-making experience for users.


About Andrei Lopatenko
● I bring over two decades of expertise in developing search and
recommendation engines that have served billions of users at major
corporations including Google, Apple (specifically Maps, App Store, iTunes),
Walmart, eBay, and Zillow.
● Since 2008, I have been involved with language models, expanding my
expertise to the BERT family in 2019 and more recently to large language
models (LLMs).
● PhD in Computer Science (The University of Manchester, UK), 1600 citations to
my publications, 29 patents
● And please, check my LLM Evaluation Compendium
https://github.com/alopatenko/LLMEvaluation
About Srijan Kumar
● Co-founder and CEO of Lighthouz AI, a multiagent AI platform for Gen AI
development and evaluations
● Assistant Professor at Georgia Tech working on AI, NLP, Graph Neural
Networks, Recommender Systems
● 60+ publications, 5500+ citations
● 10+ years in AI/ML
● Previously worked at Stanford, Google, Univ. Maryland
Gen AI for Recommender Systems

Fig 3 From

https://arxiv.org/abs/2306.05817
Search & LLMs
New paradigms
● Generating vs listing
● Conversational search (multi-turn)
● Search integrated into the whole customer journey
● Expert knowledge, question answering and decision making
● In-search task solving
Model improvements for better experience but within the existing paradigm (silent heroes)
● Better retrieval and ranking and query understanding
● Multi modality
● Deeper understanding of customers and items and queries
● Better navigation elements (facets, navigational panels)
● Evaluation
FreshLLMs / FreshPrompt
Google / OpenAI / Univ of Mass. Amherst Oct 2023

https://arxiv.org/abs/2310.03214

Perplexity AI launched the system inspired by FreshLLMs one month after

https://www.perplexity.ai/hub/blog/introducing-pplx-online-llms

Experiments were conducted in April 2023

Gemini, Perplexity.AI, You.com, Contextual.AI are inspired by this approach


FreshLLMs / FreshPrompt
There are multiple aspects of quality of answer for search (including search based
on LLMs): Factuality, Authority, Helpfulness, Freshness etc
This approach is focused on Freshness
Different type of queries require different freshness of the answer: in FreshLLMs
paper, the authors talk about never changing, slow changing, fast changing and
false premise queries.
It’s an over simplified approach, there are more types of temporal behavior in
search, but this model helps to solve many freshness problems
Let’s call it Freshness V0
FreshLLMs / FreshPrompt
From

FreshLLMs
FreshLLMs/ Plain LLM Question Answering
FreshPrompt
Algorithm: (high level)

1 use search engine to retrieve relevant information

2 use LLM to extract the answer

1 in their case, retrieve all information boxes from Google, get all results as
(source, date, title,snippet, highlighted words)

At this stage, there are several aggregation subroutines such as sort the
responses from the oldest to newest (by date) to help the model to get more fresh
results
LLMs and Recommender systems
Can LLMs be used for Recommender engines?

Retrieval, ranking, user understanding (sequential), item understanding (using text


and images), explanations of recommendations

Can LLMs leveraging large amount of information from the pre training serve as
ranking models
LLMs and Recommender systems
LLMRank
LLM as a zero-shot ranker
We assume that we have a set of candidate items to be recommended, the
retrieval is a separate system
The recommendations are conditions on user history, previous interactions with
items
There are significantly better rankers, the target here is to make efficient zero-shot
ranker. Also, there are other useful features of this framework that will be
discussed below
LLM Rank
LLM Rank
Several key features

1 Sequential prompting: encode user history of engagement with items into a


sequential prompt that list recent items

2 Recently focused prompting: emphasize the most recent interactions so LLM


can understand what items are the most recent (LLM does not do well with the
order in the sequential prompting)

3 In-Context Learning: using previous interactions, create examples to be learned


from aas a Context
LLM Rank
The retrieval service returns small number of items (up to 20)

Retrieved items are encoded sequentially in the prompt for ranking

Encode the ranking task in the ranking template


LLM Rank
Low sensitivity to the order in the history (negative feature): mitigation emphasize
the most recent items by prompting

Sensitivity to the order of the candidates (negative feature): mitigation:


bootstrapping

Too long history effects recommendation quality negatively (negative feature):


mitigation: cut the history to the most recent
LLM Rank Possible Extensions:
Recommendation of categories

Conditional recommendations (by attributes, users preferences)

Explanations of recommendations
LLM Evaluation
for
Search and Recommender Engines
LLM Evaluation for Search and Recommender engine
There are many types of evaluation needed. We will focus on few of them, that
from our experience are the most frequently needed

Hallucinations

RAGAs

LLM Embeddings
Hallucination Evaluation
HaluEval https://aclanthology.org/2023.emnlp-main.397.pdf
3 tasks : question answering, knowledge grounded dialog, summarization
Responses from the LLM labeled by human annotators
The set is focused on understanding what hallucinations can be produced by LLM and the LLM is
asked to produce to wrong answer through hallucination patterns (one pass instruction and
conversation schema )
● four types of hallucination patterns for question answering (i.e., comprehension,
factualness, specificity, and inference)
● three types of hallucination patterns for knowledge grounded dialogue (i.e., extrinsic-soft,
extrinsic-hard, and extrinsic-grouped)
● three types of hallucination patterns for text summarization (i.e., factual, non-factual, and
intrinsic)
Focus: Understanding Hallucination patterns and understanding if LLM can detect hallucinations
Hallucination Evaluation
Search-Augmented Factuality Evaluator (SAFE). An evaluation framework from
Google DeepMind and Stanford

Available in open source

● Send a request, get an answer


● Break answer into atomic facts
● Check each fact with the verifier (in their case Google Search, and it can be
your system)

https://arxiv.org/pdf/2403.18802.pdf
Hallucination Evaluation
Hallucination Leaderboard https://arxiv.org/abs/2404.05904

Variety of tasks: close-book open-domain qa, summarization, reading


comprehension, instruction following, fact checking, hallucination identification
Evaluation of RAG as example of LLM App evaluation

RAG architectures will be one of the most frequent industrial pattern of LLM usage
● Correctness
● Comprehensiveness
● Readability
● Novelty/Actuality
● Quality of information
● Factual answering correctness
● Depth
● Other metrics
Similar to a traditional search engine evaluation as we evaluate a requested information but there
is a substantial difference as we evaluate generate response rather than external documents
Traditional IR architecture: retrieval -> ranker, RAG architecture : retrieval -> generator, different
type of evaluation
Evaluation of RAG
A problem, comparison of generated text (answer) with the reference answer. Semantic
similarity problem

Old lexical metrics (BLEU, ROUGE) are easy to compute but give little usable answers

BERTScore, BARTScore, BLEURT and other text similarity functions

Calls to external LLMs


Lighthouz evaluations - via multiagent evaluators
Eval criteria: hallucinations / correct responses
Eval criteria: completeness
Other evaluation criteria:
● Helpfulness
● Creativity
● Summary correctness
● Summary completeness
● Code interpretability
● Math reasoning

You can build any criteria you like!


Agents at work for evaluations
Quick demo
RAG Evaluation 3 typical metrics
Context Relevance

Answer Relevance

Groundedness (precision)

(common across multiple frameworks RAGAs, ARES, RAG Triad of metrics)

but
Evaluation of RAG - RAGAs
Zero Shot LLM Evaluation
4 metrics:
● Faithfulness,
● Answer relevancy
● Context precision
● Context recall
Important to enhance to what represents your RAG intents for your customers
https://arxiv.org/abs/2309.15217
https://docs.ragas.io/en/stable/
RAGAs framework integrated with llamaindex and LangChain
RAGAs

Answer Relevance Faithfulness Generation


c(q) -> a_c(q)

Ground Truth
Question Answer Context (human
curated)

Retrieval
q-> c(q)
Context Precision Context Recall
Evaluation of RAG - RAGAs
Faithfulness consistency of the answer Context Relevance how is the retrieved
with the context (but no query!) context “focused” on the answer , the
Two LLM calls, get context that was used to amount of relevant info vs noise info , uses
derive to answer, check if the statement LLM to compute relevance of sentences /
supported by the context total number of retrieved sentences

Answer Relevancy is the answer relevant Context Recall (ext, optional) if all relevant
to the query , LLM call, get queries that sentences are retrieved , assuming
may generate answer, verify if they are existence of ground_truth answer
similar to the original query
RAGAs
Faithfulness : LLM prompt to decompose Context Relevance: LLM Prompt to
answer and context into statements decompose contexts into sentences and
F = supported statements / total statements evaluate if the sentences are relevant to the
question
CR = sentences in the context / relevant
sentences

Answer Relevance: LLM prompt to Context Recall:


generate questions for the answer. For CR = GT sentences attributed to recall / GT
each question generate embedding. sentences
Compute the average semantic similarity
score between original query and all
generated queries
AR = 1/n \sum sim(q, q_i)
Evaluation of RAGs - RAGAs
Prompts should be well tuned, hard to move to another context, or LLM, requires a lot of
work on tuning of prompts
Each metrics: faithfulness, answer relevancy, context relevance, context recall can be
dependent on your domain/business. It requires tuning to measure that your business
depends upon
Available in open source : https://docs.ragas.io/en/stable/ integrated with key RAG
frameworks
More metrics in new version (Aspect Critique, Answer Correctness etc )
LLM Embedding evaluation
MTEB ( Massive Text Embedding Benchmark) is a good example of the
embedding evaluation benchmark

https://arxiv.org/pdf/2210.07316.pdf

https://huggingface.co/spaces/mteb/leaderboard

It’s relatively comprehensive - 8 core embedding tasks: Bitext mining,


classification, clustering, pair classification, reranking, retrieval, semantic text
similarity (sts), summarization and open source
LLM Embedding Evaluation
MTEB evaluation - Learning: no clear winner model across all tasks, the same
most probably will be in your case, you ll find different model-winners for different
tasks and may need to make your system multi-model

MTEB: Easy to plugin new models through a very simple API (mandatory
requirement for model development)

MTEB: Easy to plugging new data set for existing tasks (mandatory requirement
for model development, your items are very different from public dataset items)
LLM Embedding Evaluation
There are standard ‘traditional IR’ methods to evaluate embeddings such as recall
@ k, precision @ k, ndcg @ k that are easy to implement on your own and create
dataset representing your data

They are even supported by ML tools such as MLFlow LLM Evaluate

Important learning: find metrics that truly matches customer expectations (for ex
NDCG is very popular but it’s old and it was built in different setting, one most
probably needs to turn it to their system, to represent the way users interact with
their system)
LLM Embedding Evaluation

Another critical part of LLM evaluation for embedding evaluation is software


performance/operations evaluation of the model
Cost, latency, throughput
In many cases, embeddings must be generate for every item on every update (ex:
100M+ updates per day), or for every query (1000000+ qps with latency limits
such as 50ms)
The number of model calls and the latency/throughput requirements are different
for embedding tasks rather than other LLM tasks. Most embedding tasks are high
load tasks
LLM Embedding evaluation
All traditional search evaluation data set requirement are still valid
Your evaluation set must represent users queries or documents (what you
measure) with similar distribution (proper sampling for documents, query logs) ,
represent different topics, popular, tail queries, languages, types of queries (with
proper metrics)
Queries and document are change over time, the evaluation set must reflect these
changes
Take into account raters disagreement (typically high in retrieval and ranking and
use techniques to diminish subjectivity (pairwise comparison, control of the
number of raters etc))
LLM Embeddings
In most cases, core tasks are serving several tasks facing customers/business.
For example, text similarity might be a part of discovery/recommendation engine
(text similarity of items as a one of features for the similarity of items ) or ranking
(query similarity as if historical performance one of query is applicable as click
signals for another query).

Evaluation is not only text similarity LLM output, but the whole end-to-end
rank,recommendation etc output

You might also like