Szymon Grabowski Jakub Swacha: [email protected] - PL

This document proposes a new method for compactly representing large collections of URLs to allow for fast access. The method combines front coding, phrase replacement on residuals, and Deflate compression. It achieves a compressed representation of about 5-9 bytes per URL with average extraction times of 150-600 microseconds. The technique divides the URLs into blocks that are compressed individually and stores common phrases separately for improved compression. Evaluation on real-world URL datasets shows this approach effectively balances compact representation with fast access to URLs.

Uploaded by

Jakub Swacha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views9 pages

Szymon Grabowski Jakub Swacha: [email protected] - PL

Uploaded by

Jakub Swacha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

http://compact.representation.of.URL.collections.with.fast.

access/ Szymon Grabowski1, Jakub Swacha2

1 Technical University of d, Computer Engineering Dept., al. Politechniki 11, 90-924 d. E-mail: [email protected] 2 University of Szczecin, Institute of Information Technology in Management, Mickiewicza 64, 71-101 Szczecin. E-mail: [email protected]

Sok k/Bechatowa, June 2011

trie

Classical string dictionaries

burst trie (Heinz et al. 2002)

minimal acyclic DFA (Ciura i Deorowicz 2001)

Old assumptions and when they fail

Dictionary size applying to Heaps law: (n), n text size (in words), usually in 0.40.6. Texts on the web: dictionaries may have 108+ terms. According to (Ahmad & Kondrak 2005) about 20% of all query words in web searchers are non-dictionary words, including typos, but typos are only a small fraction of them. Those terms are: numerous names (people, brands, product numbers, geographical names etc.), neologisms and e-speak.

Why URLs
Web graph analyses: graph structure PLUS node info, ie. their URLs, needed. Specific characteristics (Heaps law for NL doesnt apply). May be huge.

Modern ideas for URL representation

Belazzougui et al. 2009: minimal monotone perfect hashing. E.g. it is enough to spend about 6.5 bits / key (avg) for a 106M-key URL dataset (uk-2007-05), with fast access (about 30 s per key) apart from the keys themselves. If the keys (URLs) themselves are not needed, the average 6.5 bits per key is enough to map each key to its lexicographical position.

Brisaboa et al. 2011: experimental study, many algs tested. Two techniques most successful for URLs: grammar-based RePair algororithm and plain front coding accompanied with HuTucker coding of the remaining suffixes. HuTucker: optimal among those codes that preserve lexicographical order of the keys.
5

What we do (1 / 2)
Front coding (standard technique) + phrase replacement on the residuals + Deflate (zip). Phrases: popular URL segments separated with [.&=/_-], min. length 2. http://www.skwigly.co.uk/banner/abmc.asp?b=62&z=45 potential phrases: http: | www | skwigly | co | uk | banner | abmc | 62 | 45 (b and z are eliminated as being too short). Note that front coding is also likely to remove http:// or http://www. first. 127 most freq phrases in a superblock replaced with 1-byte symbols.

What we do (2 / 2)
General philosophy: different steams, block based. Indidivual blocks compressed, access entries to blocks given. Deflate (zip) compression used. Front coding: up to length 255. The prefix bytes sent to a separate stream, with blocks of bp size. Residuals: in blocks of b lines. Common phrases: represented on a superblock level, of sb lines (sb being a multiple of b). extract(i) queries: find the prefix block, decode it, find the phrase block, extract its phrases, find the residuals block, decode it, insert back phrases, attach prefixes, 7 refer to the required line.

Datasets and results

Results in brief: about 59 bytes / URL in compressed form with avg extract time about 150600 s (@ Intel Core 2 Duo 6420 2.13 GHz).

Datasets available from the WebGraph project: http://webgraph.dsi.unimi.it/

Future work
Speed-optimize. Experiments also on fully lex-sorted URL collections (better compression). Add support for locate queries (given a key, return its index in the structure, or -1 if it doesnt exist). Smarter phrase replacement?

Introduction C
100% (1)
Introduction C
28 pages
Roof Design
100% (2)
Roof Design
19 pages
Design of Water Supply Networks CIVL 5995 Project I
100% (1)
Design of Water Supply Networks CIVL 5995 Project I
36 pages
Final Project Report On Formulation of A Pesticide (LUBWAMA KENNETH)
No ratings yet
Final Project Report On Formulation of A Pesticide (LUBWAMA KENNETH)
39 pages
Geotechnical Study for Baghdad Site
No ratings yet
Geotechnical Study for Baghdad Site
20 pages
A Spacetime Curvature Model For The Three-Body Problem: A Novel Approach To Orbital Dynamics
No ratings yet
A Spacetime Curvature Model For The Three-Body Problem: A Novel Approach To Orbital Dynamics
8 pages
Waves Exam Q
0% (1)
Waves Exam Q
24 pages
Chemistry Recap for Class XII Students
No ratings yet
Chemistry Recap for Class XII Students
1 page
Dork Pack
100% (1)
Dork Pack
18 pages
Hydraulic Jack Chap 1
No ratings yet
Hydraulic Jack Chap 1
14 pages
Feed Mill Info
100% (1)
Feed Mill Info
33 pages
Steel Detaing Part1
No ratings yet
Steel Detaing Part1
114 pages
Car Audio Systems for Toyota, Honda, Kia
No ratings yet
Car Audio Systems for Toyota, Honda, Kia
68 pages
Search Engine
100% (2)
Search Engine
42 pages
Summary of A Search Engine
No ratings yet
Summary of A Search Engine
4 pages
Illus Strate Edp Arts List: S Spicer Tandem M Axles S
No ratings yet
Illus Strate Edp Arts List: S Spicer Tandem M Axles S
22 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Efficient In-Memory URL Compression
No ratings yet
Efficient In-Memory URL Compression
4 pages
Search Engine
No ratings yet
Search Engine
42 pages
Microsoft Excel MCQs
No ratings yet
Microsoft Excel MCQs
15 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
IR Unit 3
No ratings yet
IR Unit 3
66 pages
Comp250 hw4
No ratings yet
Comp250 hw4
6 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
Autocomplete From Scratch
No ratings yet
Autocomplete From Scratch
24 pages
Dokumen - Tips Basic Flowsheeting Principles Thermart Himmelblau D M and Riggs J B 2003 Basic
No ratings yet
Dokumen - Tips Basic Flowsheeting Principles Thermart Himmelblau D M and Riggs J B 2003 Basic
111 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Artificial Intelligence Lab Manual: (ACADEMIC YEAR: 2017-21) Semester
No ratings yet
Artificial Intelligence Lab Manual: (ACADEMIC YEAR: 2017-21) Semester
41 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Indexing and Compression Basics
No ratings yet
Indexing and Compression Basics
43 pages
On The Residual Strength of Rocks and Rockmasses
No ratings yet
On The Residual Strength of Rocks and Rockmasses
13 pages
Bda Final
No ratings yet
Bda Final
42 pages
DSA Hash
No ratings yet
DSA Hash
26 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Hungarian Mathematical Olympiad 1998/99: Final Round
No ratings yet
Hungarian Mathematical Olympiad 1998/99: Final Round
1 page
Pression
No ratings yet
Pression
44 pages
Unit 2
No ratings yet
Unit 2
14 pages
Adler. 2001
No ratings yet
Adler. 2001
10 pages
Lecture # 05b, 06a (Vertical Curves)
No ratings yet
Lecture # 05b, 06a (Vertical Curves)
27 pages
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
4 pages
Digital Search Tree
No ratings yet
Digital Search Tree
61 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Monotone Minimal Hashing
No ratings yet
Monotone Minimal Hashing
27 pages
Uris Don'T Change: People Change Them
No ratings yet
Uris Don'T Change: People Change Them
10 pages
Ir 5
No ratings yet
Ir 5
18 pages
Holy City Audio Forum: Modulated Delay
No ratings yet
Holy City Audio Forum: Modulated Delay
3 pages
Document 2
No ratings yet
Document 2
18 pages
Unit 2
No ratings yet
Unit 2
157 pages
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
No ratings yet
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
3 pages
01 Paper 03 3D Geometry
No ratings yet
01 Paper 03 3D Geometry
2 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Dorks With DonJuji
100% (1)
Dorks With DonJuji
4 pages
Python Web Crawler Guide
No ratings yet
Python Web Crawler Guide
10 pages
Hadoop Project
No ratings yet
Hadoop Project
2 pages
Midas Gen: 1. Design Information
No ratings yet
Midas Gen: 1. Design Information
1 page
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Apigee Web Api Design The Missing Link Ebook 1 5
No ratings yet
Apigee Web Api Design The Missing Link Ebook 1 5
5 pages
What Is Mapreduce
No ratings yet
What Is Mapreduce
19 pages
CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine
No ratings yet
CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine
5 pages
Google Search Engine Origins
No ratings yet
Google Search Engine Origins
27 pages
Computer Networks CS 552: Why High Speed Lookups?
No ratings yet
Computer Networks CS 552: Why High Speed Lookups?
10 pages
Who, What, Where, When, Wordlist: @tomnomnom
No ratings yet
Who, What, Where, When, Wordlist: @tomnomnom
30 pages
Efficient Search Autocomplete Algorithm
No ratings yet
Efficient Search Autocomplete Algorithm
14 pages
A Dynamic URL Assignment Method For Parallel Web Crawler: A.Guerriero F. Ragni, C. Martines
No ratings yet
A Dynamic URL Assignment Method For Parallel Web Crawler: A.Guerriero F. Ragni, C. Martines
5 pages
MapReduce & PageRank Explained
No ratings yet
MapReduce & PageRank Explained
19 pages
DS 2CD2T23G0 I520180404aawrc12389314 - 20221006123632
No ratings yet
DS 2CD2T23G0 I520180404aawrc12389314 - 20221006123632
26 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
BIOS Instructor Setup Rev 6 65
No ratings yet
BIOS Instructor Setup Rev 6 65
24 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
ICDE 2024 Managing The Future Route Planning Influence Evaluation in Transportation Systems
No ratings yet
ICDE 2024 Managing The Future Route Planning Influence Evaluation in Transportation Systems
15 pages
CS32 Student Search Engine Project
No ratings yet
CS32 Student Search Engine Project
26 pages
Assignment 1spring25
No ratings yet
Assignment 1spring25
3 pages
PageRank Algorithm Explained
No ratings yet
PageRank Algorithm Explained
9 pages
Procedure of Selant Application MC Teaching
No ratings yet
Procedure of Selant Application MC Teaching
2 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Web Search
No ratings yet
Web Search
49 pages
Text Database Challenges & Solutions
No ratings yet
Text Database Challenges & Solutions
44 pages
EECS 395/495 Lecture 5: Web Crawlers: Doug Downey
No ratings yet
EECS 395/495 Lecture 5: Web Crawlers: Doug Downey
23 pages
Banklogs Report
No ratings yet
Banklogs Report
3 pages
MT831 Installation Manual
No ratings yet
MT831 Installation Manual
92 pages
Algorithms
No ratings yet
Algorithms
49 pages
Tries: - Standard Tries - Compressed Tries - Suffix Tries
No ratings yet
Tries: - Standard Tries - Compressed Tries - Suffix Tries
11 pages
CS252 Powerpoint 3
No ratings yet
CS252 Powerpoint 3
3 pages

Szymon Grabowski Jakub Swacha: [email protected] - PL

Uploaded by

Szymon Grabowski Jakub Swacha: [email protected] - PL

Uploaded by

http://compact.representation.of.URL.collections.with.fast.

access/ Szymon Grabowski1, Jakub Swacha2

Sok k/Bechatowa, June 2011

Classical string dictionaries

minimal acyclic DFA (Ciura i Deorowicz 2001)

Old assumptions and when they fail

Modern ideas for URL representation

Datasets and results

Datasets available from the WebGraph project: http://webgraph.dsi.unimi.it/

You might also like