FOUNDATIONS OF
INFORMATION RETRIEVAL
Lynda Tamine-Lechani
[email protected]
https://www.irit.fr/~Lynda.Tamine-Lechani/
FOUNDATIONS OF INFORMATION RETRIEVAL
•Course description
Study the theory, design, and implementation of information retrieval systems
from the perspectives of:
ü information representation: focus on texts
ü theoretical information retrieval model: focus on language model and learning-
based models
ü Performance evaluation: focus on system-centred evaluation
•Learning objectives
ü Index and represent textual information;
ü Recall and discuss well-known information retrieval models;
ü Design, implement and evaluate the performance of information retrieval
systems using retrieval algorithms and models discussed in class.
2
© L. Tamine-Lechani
FOUNDATIONS OF INFORMATION RETRIEVAL
•Organization
o 12H course, 6H tutorial: Lynda Tamine-Lechani
o 10H hands-on work: Jesus-Lovon Melgajero, José G. Moréno and Lynda Tamine-
Lechani
•Prerequisites
o Python programming
o Basics in probability and statistics
•Course material
o Copies of the lecture slides are posted on the MOODLE site
o Book and readings references are provided
•Grading
o 1st session
üHands-on experience with techniques discussed in class: assignment of 30% of the final score
üFinal written exam in class: assignment of 70% of the final score
o 2nd session
üFinal written exam in class: assignment of 100% of the final score
© L. Tamine-Lechani 3
FOUNDATIONS OF INFORMATION RETRIEVAL
•Schedule
Lecture Topic
1 Course Introduction; Text indexing, vector semantics
2 Static embeddings, contextual embeddings
3 Infomation retrieval (IR) models: query reformulation, learning to
rank
4 Tutorial 1: Text indexing and representation
5 Neural models for IR
6 Page Rank, Performance evaluation
7 Tutorial 2: information retrieval techniques and models
8 Question answering systems and chatbots
9 Tutorial 3: performance evaluation
4
© L. Tamine-Lechani
FOUNDATIONS OF INFORMATION RETRIEVAL
Books
Information retrieval: Algorithms and Heuristics
David A. Grossamnn, Ophir Frieder, Kluwer
Academic Publishers, 1998
Modern information retrieval
R.B Yates, R. Neto, ACM Press Addisson Wesley, 1999
Recherche d'information, applications, modèles et algorithmes
M.R Amini et E. Gaussier, Eyrolles 2012
Search engines in practice
B. Croft, D. Metzler, T. Trohman, Pearson 2010
5
© L. Tamine-Lechani
Introduction
Information Retrieval (IR): definitions
Calvin Mooers 1951 :
Information retrieval (IR) is the name for the process or method whereby a prospective user of
information is able to convert his need for information into an actual list of citations to documents in
storage containing information useful to him. .. Information retrieval is crucial to documentation and
organization of knowledge". (Mooers, 1951, p. 25)
Salton, 1980 :
Information retrieval systems are designed to help analyze and describe the items stored in a file, to
organize them and search among them, and finally to retrieve them in response to a user's query.
Designing and using a retrieval system involves four major activities: information analysis, information
organization and search, query formulation, and information retrieval and dissemination.
Information retrieval (IR) in computing and information science is the process of obtaining
information system resources that are relevant to an information need from a
collection of those resources. Searches can be based on full-text or other content-based
indexing.
6
© L. Tamine-Lechani
Introduction
Definitions refer to ....well-known search engines ?
...Yes, but also refer to:
- Search in digital libraries
- Search in campany corpus
- Search in specialized corpus (health, legal, biological –related resources)
- Search for a location
- Search for answers
- Recommend items
- Summarize reviews
- ...
7
© L. Tamine-Lechani
Introduction
...and different forms of user-system interactions
• Wide-variety of search systems, interaction environments
o Web search engines ...with voice only!
o Conversational agents
o E commmerce: Amazon, AirBnb, ...
From search to
conversation
o Media recommendation: Netflix, Spotify, ...
Heatmaps on SERP
Cross-device search
Search and navigate on
maps
© L. Tamine-Lechani 8
Introduction
Focus in this lecture
(Web) search systems that select from a corpus of texts documents those that are
relevant to a user information need experssed by the user using a query.
Information
need
Corpus
Query Documents
Selection
System's
answer to the query
9
© L. Tamine-Lechani
Introduction
Basic notions: Document
• Document: information unit being searched
- Document
- Paragraph
- Phrase
- Structure unit (section, chapter,...)
Structure
•Different views 1. Introduction Metadata
Information
retrieval.... Date : 15/01/2013
Content Author : Albert
2. Basics Langue : Français
This course introdues The notion of ….
the basics of query…
information retrieval
10
© L. Tamine-Lechani
Introduction
Basic notions: Document
• Different media Multimedia
Text (monomedia)
Image
Video
11
© L. Tamine-Lechani
Introduction
Basic notions: Document
•Different forms
-Document
-Blog
-Tweet
-News
-Presentation
-E-mail
--..
12
© L. Tamine-Lechani
Introduction
Basic notions: information need, query
• What the user seeks for: an information need
• How the user expresses his information need : a query
In this course: a query is a
list of keywords
13
© L. Tamine-Lechani
Introduction
Basic notions: Relevance
• A key concept in information retrieval
A document is relevant if it matches the information need. Numerous types of
relevance:
o Topical (aboutness) relevance: the document covers the query topic
o Situational relevance: the document matches the user's situation (e.g., task,
location, ...)
o Cognitive relevance : the documents fits with the user's knowledge state
o ...
and numerous criteria of relevance:
- Novelty
- Fresheness
- Language
- Specificity
- Trust
- ...
The main focus in this course is topical relevance: useful and "easy" to define and to
measure, but it does not cover everything related to relevance
14
© L. Tamine-Lechani
Introduction
What makes information retrieval challenging ?
© NIST (TREC)
© L. Tamine-Lechani 15
Introduction
What makes information retrieval challenging ?
•Deluge of information
o Large-scale information
o Often little ratio of information is relevant and/or useful for a query
o Information is noisy
o Information is not always trusty
o Hetrogeneous information forms and sources
o ...
© L. Tamine-Lechani 16
Introduction
Information is every where
Increasing volumes of information
available on increasing information sources: social applications
mobile devices, sensors, ...
1972 1990 1994 1995 1998 199920012001 20032003
ARPANET WWW E-commerce Annuaire Recherche BlogsWiki WikiRéseaux sociaux
Réseaux sociaux
Source : Infographic
17
© L. Tamine-Lechani
Introduction
Focus on Web 3.0: The digital world today
•1st place: platforms for
publication/sharing of texts (mostly),
newsletters, podcasts, videos, photos,
o Wikipedia, Blogger, Google Poadcast,
youtube, Flickr, TripAdvisor, ...
•2nd place: platforms for messaging
o Facebook, Messenger, telegram,...
•3rd place: platforms for conversations
o Quora, StackExchange, Reddit, Facebook image credit https://fredcavazza.net/2021/05/06/panorama-des-
medias-sociaux-2021/
groups, Google Groups, ...
•4th place: platforms for collaboration
o Facebook workplace, TeamWork, Chatter,
...
© L.Tamine-Lechani 18
Introduction
Some statistics 2020-2021: information and users
• Google processes in 2020 more than 7 • Users and information
milliards of queries every day among shared in live 2021
which 15% have never been submitted
before (new queries)
• The number of users in the world is
estimated as 2.77 milliards on social
media, 2.46 milliards in 2017
• 51%, or more than 240 milliards of
dollars, de tout l'argent publicitaire
dépensé dans le monde en 2019 seront
basés sur les médias numériques.
• Les ventes en ligne devraient atteindre
3.45 billions de dollars de ventes en 2020
• 47.3% de la population mondiale devrait
acheter en ligne en 2020. image credit https://www.internetlivestats.com/
Statitistics on usage
of information 20032003
access systems Réseaux sociaux
Réseaux sociaux
2014-2020
Source : 19
https://datastudio.google.com/embed/reporting/1sImC_rjeWqNXdgQt5MtmrQMbH44qFjtA/page/1fzh
Introduction
What makes information retrieval challenging?
•Information needs are ambiguous
oQueries are generally short, ambiguous
oThe matching between queries and intents is M-N
Roi lion
1 Queryà N intents
- Master UPS Intelligence artificielle
- Université paul Sabatier IA M Queriesà 1 intent
- Formation IA Toulouse
- Matsre IAFA
..
© L. Tamine-Lechani 20
Introduction
What makes information retrieval challenging ?
•Relevance is subjective
o Relevance is subjective
ü User-dependent
ü Situation-dependant
ü Topicality is often the threshold relevance
•Relevance faces vocabulary mismatch between queries and
documents
o Matching as word overlap: is it really semantic overlap?
Q: "most jurisdictions exercise a high degree of regulation over banks" [financial institution]
D1: "I have been stolen when I withdrew the money from the bank" [Building]
D2: "fish lined the bank of the stream" [The land alongside or sloping down to a river or lake]
o Matching is not exact, rough matching between queries and documents
Q: "Presidential Elections in France"
D1 : "Election campaign is running"
[relevant, but missing ‘presidential’ and ‘France’]
D2 : "Macron, the President of France is attending COP21"
[irrelevant, and matching ‘France’ and ‘President’]
© L. Tamine-Lechani 21
Introduction
What makes information retrieval challenging ?
•Queries and documents vary in length
oModels must handle variable length input
oRelevant documents have irrelevant content
Q: "variant Omicron symptomes"
D: "Le variant Omicron a déjà atteint plusieurs patients en France après avoir fait son apparition en Afrique du Sud. S'il semble plus
transmissible, il ne serait pas plus virulent. Mais quels sont ses symptômes ?
Le 26 novembre dernier, l’Organisation mondiale de la Santé (OMS) qualifiait le variant Omicron, nouvellement apparu en Afrique
du Sud, de « préoccupant » sur la base de sa rapidité de propagation. De nombreux cas commencent depuis à émerger à travers le
monde, dont quelques-uns en France.
Mais concernant sa dangerosité ou ses symptômes, le grand flou règne. Alors, que savons-nous ?
En se basant sur les situations en Afrique du Sud et au Royaume-Unis, l'OMS a indiqué dans une mise au point technique que le
variant Omicron semble se propager plus vite que Delta.
Néanmoins, contrairement à ce dernier, les symptômes seraient moins sévères.
Pas de perte de goût ou d’odorat
Interrogée par la BBC, le Dr Angelique Coetzee, présidente de l’Association médicale sud-africaine, qui fut l’une des premières à
être confrontée à Omicron, a indiqué que les symptômes qu’elle a pu observer semblent moins spécifiques que ceux de la maladie
originelle. « Cela a débuté avec un patient de sexe masculin âgé d’environ 33 ans », a-t-elle expliqué lors de cet entretien.
« Il a déclaré qu’il était extrêmement fatigué ces derniers jours et se plaignait de courbatures et de légers maux de tête. » Mais
l’homme n’a pas perdu son sens du goût ni celui de l’odorat ; il avait la « gorge qui le grattait », et non pas un mal de gorge et
une toux comme avec les variants précédents.
Elle a également déclaré que les autres patients auscultés le même jour « présentaient les mêmes symptômes bénins ".
Source: https://www.leprogres.fr/magazine-sante/2021/12/13/variant-omicron-quels-sont-les-premiers-symptomes-
detectes
© L. Tamine-Lechani 22
Introduction
What makes information retrieval similar vs. different from data retrieval (Databases)?
Information retrieval Data retrieval
Information unit Information Data (attribute-value)
Query Vague expression of an Vague expressio
information need
Language of the query Natural language Formel language
Matching query-information Approximatif Exact
Selected information Information relevant to the All the data that satifies the
query query
© L. Tamine-Lechani 23
Introduction
The basic process of information retrieval
Documents Information need
Indexing Expression
Documents Query
representations
Matching
Selected documents
Feedback
Copyright L.Tamine-Lechani 24
FOUNDATIONS OF INFORMATION RETRIEVAL
• Lecture structure
oIntroduction
o Chapter 1: Text indexing and representation
"How to transform raw texts into machinable representations?
Keywords: indexation, words, documents, representation learning of texts
o Chapter 2: Information retrieval (IR) models
"How to score the relevance of a document as an answer to a user's
query?"
Keywords: relevance status value, retrieval model
o Chapter 3: Performance evaluation of an IR system
"How to measure the performance of an information retrieval system?"
Keywords: evaluation metrics, test collections
o Chapter 4: From question-answering systems to chatbots
"How to interact with systems while searching for information?"
Keywords: conversation, turn, clarification
25