0% found this document useful (0 votes)

60 views38 pages

Tutorial 3

Uploaded by

randomcomments0000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views38 pages

Tutorial 3

Uploaded by

randomcomments0000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Open Source IR Tools and Libraries

Giorgos Vasiliadis, [email protected]

CS-463 Information Retrieval Models

Computer Science Department
University of Crete

1
Outline

Google Search API

Lucene

Terrier

Lemur

2
Google Search API

3
Google Search API: Overview

The API exposes the Google engine to

developers.
You can write scripts that access the Google
search in real-time.
Google no longer issuing new API keys for
the SOAP Search API.
Instead, Google provides an AJAX Search
API.
You can put Google Search in your web pages
with JavaScript.

4
Google Search API: SOAP

Based on the Web Services Technology SOAP (the

XML-based Simple Object Access Protocol).
Developers write software programs that connect
remotely to the Google SOAP Search API service.
Developers can issue search requests to Google’s
index of billions of web pages and receive results as
structured data, access information in the Google
cache and check the spelling of words.
Limitations
Default limit of 1,000 queries per day.
Can only query for 10 results a time
Can only access Google Web Search (not Google Images,
Google Groups and so on).

5
Google Search API: AJAX

Lets you put Google Search in your web

pages with JavaScript.
Does not have a limit on the number of
queries per day.
Supports additional features like Video,
News, Maps, and Blog search results.

6
Google Search API: AJAX

Web Search
Incorporate results from Web Search, News
Search, and Blog Search

Local Search
Provides access to local search results from
Google Maps.

Video Search
Incorporate a simple search box
incorporate dynamic, search powered strips
of video and book thumbnails.

7
Google Search API: Demo

8
Google Search API: References

Google SOAP Search API

http://code.google.com/apis/soapsearch/
Google AJAX Search API
http://code.google.com/apis/ajaxsearch/
Google AJAX Search API Developer Guide
http://code.google.com/apis/ajaxsearch/documentation/
Google AJAX Search API Samples
http://code.google.com/apis/ajaxsearch/samples.html

9
Lucene

10
Lucene

Doug Cutting’s grandmother’s middle name

Cross-Platform API
Implemented in Java
Ported in C++, C#, Perl, Python
Offers scalable, high-performance indexing
Incremental indexing as fast as bath indexing
Index size roughly 20-30% the size of indexed text
Supports many powerful query types

11
Lucene: Modules

Analysis
Tokenization, Stop words, Stemming, etc.
Document
Unique ID for each document
Title of document, date modified, content, etc.
Index
Provides access and maintains indexes.
Query Parser
Search / Search Spans

12
Lucene: Indexing

A Document is a collection of Fields

Document: Field 1 Field 2 Field N

A Field is free text, keywords, dates, etc.

A Field can have several characteristics
indexed, tokenized, stored, term vectors
Apply Analyzer to alter Tokens during indexing
Stemming
Stop-word removal
Phrase identification

13
Lucene: Searching

Uses a modified Vector Space Model

We convert a user’s query into an internal
representation that can be searched against
the Index
Queries are usually analyzed in the same
manner as the index
Get back Hits and use in an application

14
Lucene: Query Parser Syntax

Terms Fuzzy Searches

Single terms and phrases Levenshtein Distance or
Fields Edit Distance algorithm
E.g. title:"Do it right" AND Range Searches
right mod_date:[20020101 TO
Wildcard Searches 20030101]
title:{Aida TO Carmen}
‘?’ for single character
‘*’ for multiple characters Boosting a Term
Proximity Searches E.g. jakarta^4 apache
“jakarta apache"~10 Boolean Operators

15
Lucene: More Advanced Options

Relevance Feedback
Manual
User selects which documents are relevant/non-relevant
Get the terms from the term vector for each of the
documents and construct a new query.
Automatic
Application assumes the top X documents are relevant
and the bottom Y are non-relevant and constructs a new
query based on the terms in those documents.
Span Queries
Phrase Matching
16
Lucene: Basic Demo

The latest version can be obtained from

http://www.apache.org/dyn/closer.cgi/lucene/java/

To build an index just type

java org.apache.lucene.demo.IndexFiles <dir>

To search from an index type:

java org.apache.lucene.demo.SearchFiles <index>

17
Terrier

18
Terrier: Overview

Stands for TERabyte RetrIEveR.

Open Source API (Mozilla Public Licence).
Written in cross-platform Java.
Highly compressed disk data structures.
Handling large-scale document collections.
Standard evaluation of TREC ad-hoc and
known-item search retrieval results.

19
Terrier: Indexing
Create your own Collection decoder and Document
implementation.
Centralized or distributed Setting.

Indexer iterates through the collection and creates the following data
structures
Direct Index
Document Index
Lexicon

20
Terrier: Indexing

Each document in the

collection is tokenized
and parsed.

21
Terrier: Indexing
In this way, we build the
direct and document
indices.

22
Terrier: Indexing

We also build
temporary lexicons in
order to reduce the
required memory during
indexing

23
Terrier: Indexing
The inverted index is
built from the existing
direct index, document
index and lexicon

24
Terrier: Retrieval

Parsing
Pre-processing
Matching
Post Processing
Post Filtering

Query Language
term1 term2

term1^2.3

+term1 -term2

"term1 term2"~n

25
Terrier: Retrieval

Remove stop words

and apply stemming to
the query.

26
Terrier: Retrieval
Terrier automatically
select the optimal
document weighting
model

27
Terrier: Retrieval

If Query Expansion is
applied an appropriate term
weighting model is selected
and the most informative
terms from the top ranked
documents are added to the
query

28
Terrier: Sample Applications

Trec Terrier
An application that allows Terrier to index and
retrieve from standard TREC test collections.
Instructions are available at
http://ir.dcs.gla.ac.uk/terrier/doc/trec_terrier.html

29
Terrier: Sample Applications

Desktop Search
A Swing (graphical) application that can be used
to index files from the local machine, and then
perform queries on them.
The scripts for running the desktop search
application are:
desktop_search.sh (Linux, Mac OSX)
desktop_search.bat (Windows)

30
Terrier: Sample Applications

Interactive Querying
A console application for performing simple
queries on an existing index and seeing which
documents are returned.
The scripts for running the console application
are:
interactive_terrier.sh (Linux, Mac OS X)
interactive_terrier.bat (Windows)

31
Terrier: Demo

32
Lemur

33
Lemur: Overview

Support for XML and structured document

retrieval
Interactive interfaces for Windows, Linux, and
Web
Cross-Platform, fast and modular code
written in C++
C++, Java and C# APIs
Free and open-source software

34
Lemur: API

Provides interfaces to Lemur classes that are

grouped at three different levels:
Utility level
Common utilities, such as memory management,
document parsing, etc.
Indexer level
Converts a raw text collection to data structures for
efficient retrieval.
Retrieval level
Abstract classes for a general retrieval architecture and
concrete classes for several specific information retrieval

35
Lemur: Indexing

Multiple indexing methods for small, medium

and large-scale (terabyte) collections.
Built-in support for English, Chinese and
Arabic text.
Porter and Krovetz word stemming.
Incremental indexing.

36
Lemur: Retrieval

Supports major language modelling

approaches such as Indri and KL-divergence,
as well as vector space, tf-idf, Ocapi and
InQuery
Relevance- and pseudo-relevance feedback
Wildcard term expansion (using Indri)
Supports arbitrary document priors (e.g.,
Page Rank, URL depth)

37
Questions ?

Apache Daisy
100% (2)
Apache Daisy
402 pages
Mettl Assessment Guidelines - DN2.0 - Pro
No ratings yet
Mettl Assessment Guidelines - DN2.0 - Pro
3 pages
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
0% (1)
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
37 pages
Lucene 4.0: Flexible Indexing Guide
No ratings yet
Lucene 4.0: Flexible Indexing Guide
35 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Logo 345 1649916914 Elasticsearch-Introductions
No ratings yet
Logo 345 1649916914 Elasticsearch-Introductions
86 pages
Lucene 4.0 Flex APIs & Codecs Overview
No ratings yet
Lucene 4.0 Flex APIs & Codecs Overview
41 pages
Luce Ne Bootcamp
No ratings yet
Luce Ne Bootcamp
83 pages
Lucene & Solr for Java Developers
No ratings yet
Lucene & Solr for Java Developers
35 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Chapter - 6 - Searching and Indexing
No ratings yet
Chapter - 6 - Searching and Indexing
44 pages
Lucene and Solr Search Engine Guide
No ratings yet
Lucene and Solr Search Engine Guide
6 pages
302-002-565 - Shut Down and Restart The System (Power Down and Power Up) PDF
No ratings yet
302-002-565 - Shut Down and Restart The System (Power Down and Power Up) PDF
4 pages
Demonstrating Concept of Component Usage
No ratings yet
Demonstrating Concept of Component Usage
12 pages
Lucene Software Architecture Lecture
No ratings yet
Lucene Software Architecture Lecture
11 pages
Musa Talukdar: Software Engineer 28 June, 2012
No ratings yet
Musa Talukdar: Software Engineer 28 June, 2012
19 pages
Chap 2
No ratings yet
Chap 2
29 pages
Welcome To Lucene!
No ratings yet
Welcome To Lucene!
11 pages
Bulu
No ratings yet
Bulu
47 pages
Student Project Management Guide
No ratings yet
Student Project Management Guide
4 pages
Install Exchange Server 2019 Step by Step From Scratch
No ratings yet
Install Exchange Server 2019 Step by Step From Scratch
43 pages
Apache Lucene 4: Search Library Insights
No ratings yet
Apache Lucene 4: Search Library Insights
8 pages
Apache Lucene
No ratings yet
Apache Lucene
19 pages
Mini Google
No ratings yet
Mini Google
34 pages
OS Search Engine Comparison
No ratings yet
OS Search Engine Comparison
46 pages
Experiment No.:-1: Aim:-Introduction To HTML
100% (1)
Experiment No.:-1: Aim:-Introduction To HTML
5 pages
Operating Systems: Lecture Notes
No ratings yet
Operating Systems: Lecture Notes
222 pages
4
No ratings yet
4
35 pages
Lemur Toolkit
No ratings yet
Lemur Toolkit
34 pages
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
No ratings yet
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
33 pages
Lucene Tutorial
100% (1)
Lucene Tutorial
189 pages
Bootcamp 3 Lemur
No ratings yet
Bootcamp 3 Lemur
26 pages
Chapter 5 1712934164766
No ratings yet
Chapter 5 1712934164766
13 pages
Getting Started With Agile ACS v1.2 - 20210114
No ratings yet
Getting Started With Agile ACS v1.2 - 20210114
67 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Traktor Kontrol S4 Manual English
No ratings yet
Traktor Kontrol S4 Manual English
215 pages
r22 r44 Emu Guide Jul2023
No ratings yet
r22 r44 Emu Guide Jul2023
26 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Lemur Toolkit Installation Guide
100% (1)
Lemur Toolkit Installation Guide
110 pages
Searching and Indexing
No ratings yet
Searching and Indexing
21 pages
Text
No ratings yet
Text
5 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
L01
No ratings yet
L01
33 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Info Retrieval for CS Students
No ratings yet
Info Retrieval for CS Students
18 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Atlas Search Workshop (English) - 16.05.23
No ratings yet
Atlas Search Workshop (English) - 16.05.23
58 pages
Information Retrieval Systems and Web Search Engin
No ratings yet
Information Retrieval Systems and Web Search Engin
4 pages
Untitled Document
No ratings yet
Untitled Document
9 pages
Final - CSE - V Semester - New - Syllabus PDF
No ratings yet
Final - CSE - V Semester - New - Syllabus PDF
13 pages
CyberJudas Manual DOS en
No ratings yet
CyberJudas Manual DOS en
110 pages
Question Bank IRS All Module - OS
No ratings yet
Question Bank IRS All Module - OS
5 pages
Zerto Virtual Replication Administration Guide
No ratings yet
Zerto Virtual Replication Administration Guide
269 pages
A Comparison of Open Source Search Engine
No ratings yet
A Comparison of Open Source Search Engine
46 pages
Elasticsearch and The Elk Stack For Monitoring and Data Analysis
No ratings yet
Elasticsearch and The Elk Stack For Monitoring and Data Analysis
46 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
How To Download Insta Pro
No ratings yet
How To Download Insta Pro
6 pages
Powerpoint
No ratings yet
Powerpoint
14 pages
Search Engine
No ratings yet
Search Engine
35 pages
MT6582 Android Scatter
No ratings yet
MT6582 Android Scatter
6 pages
Lucene and Solr
No ratings yet
Lucene and Solr
24 pages
Test Case Preparation Guide
No ratings yet
Test Case Preparation Guide
19 pages
Lucene Solr
No ratings yet
Lucene Solr
52 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
GRPC Services in Kong
No ratings yet
GRPC Services in Kong
29 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Comandi Moshell
No ratings yet
Comandi Moshell
12 pages
IR Project Guide for CS Students
No ratings yet
IR Project Guide for CS Students
15 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
Standard Web Search Engine Architecture: User Query
No ratings yet
Standard Web Search Engine Architecture: User Query
101 pages
User Manual
No ratings yet
User Manual
13 pages
SP 07271130
No ratings yet
SP 07271130
166 pages
Essential Tips On Using Microsoft Word For TMAs - Final Sep19
No ratings yet
Essential Tips On Using Microsoft Word For TMAs - Final Sep19
5 pages
USB Disk Security Update Log
No ratings yet
USB Disk Security Update Log
4 pages
Java Server and Servlet MCQ Questions
No ratings yet
Java Server and Servlet MCQ Questions
14 pages
Learn To Use OmegaT in 5 Minutes
No ratings yet
Learn To Use OmegaT in 5 Minutes
2 pages
How To Connect A Machine Via Heidenhain DNC Interface
No ratings yet
How To Connect A Machine Via Heidenhain DNC Interface
15 pages
05 Guidance Note - Adapting The NEAT
No ratings yet
05 Guidance Note - Adapting The NEAT
4 pages
Lucene 4 Guide for Developers
No ratings yet
Lucene 4 Guide for Developers
28 pages
01.what Is Full Stack Development
No ratings yet
01.what Is Full Stack Development
2 pages
Maharashtra Polytechnic CAP Round 3 Allotment 2022
No ratings yet
Maharashtra Polytechnic CAP Round 3 Allotment 2022
2 pages
G Suite Education Upgrade Request
No ratings yet
G Suite Education Upgrade Request
2 pages