http://compact.representation.of.URL.collections.with.fast.
access/ Szymon Grabowski1, Jakub Swacha2
1 Technical University of d, Computer Engineering Dept., al. Politechniki 11, 90-924 d. E-mail:
[email protected] 2 University of Szczecin, Institute of Information Technology in Management, Mickiewicza 64, 71-101 Szczecin. E-mail:
[email protected]Sok k/Bechatowa, June 2011
trie
Classical string dictionaries
burst trie (Heinz et al. 2002)
minimal acyclic DFA (Ciura i Deorowicz 2001)
Old assumptions and when they fail
Dictionary size applying to Heaps law: (n), n text size (in words), usually in 0.40.6. Texts on the web: dictionaries may have 108+ terms. According to (Ahmad & Kondrak 2005) about 20% of all query words in web searchers are non-dictionary words, including typos, but typos are only a small fraction of them. Those terms are: numerous names (people, brands, product numbers, geographical names etc.), neologisms and e-speak.
Why URLs
Web graph analyses: graph structure PLUS node info, ie. their URLs, needed. Specific characteristics (Heaps law for NL doesnt apply). May be huge.
Modern ideas for URL representation
Belazzougui et al. 2009: minimal monotone perfect hashing. E.g. it is enough to spend about 6.5 bits / key (avg) for a 106M-key URL dataset (uk-2007-05), with fast access (about 30 s per key) apart from the keys themselves. If the keys (URLs) themselves are not needed, the average 6.5 bits per key is enough to map each key to its lexicographical position.
Brisaboa et al. 2011: experimental study, many algs tested. Two techniques most successful for URLs: grammar-based RePair algororithm and plain front coding accompanied with HuTucker coding of the remaining suffixes. HuTucker: optimal among those codes that preserve lexicographical order of the keys.
5
What we do (1 / 2)
Front coding (standard technique) + phrase replacement on the residuals + Deflate (zip). Phrases: popular URL segments separated with [.&=/_-], min. length 2. http://www.skwigly.co.uk/banner/abmc.asp?b=62&z=45 potential phrases: http: | www | skwigly | co | uk | banner | abmc | 62 | 45 (b and z are eliminated as being too short). Note that front coding is also likely to remove http:// or http://www. first. 127 most freq phrases in a superblock replaced with 1-byte symbols.
What we do (2 / 2)
General philosophy: different steams, block based. Indidivual blocks compressed, access entries to blocks given. Deflate (zip) compression used. Front coding: up to length 255. The prefix bytes sent to a separate stream, with blocks of bp size. Residuals: in blocks of b lines. Common phrases: represented on a superblock level, of sb lines (sb being a multiple of b). extract(i) queries: find the prefix block, decode it, find the phrase block, extract its phrases, find the residuals block, decode it, insert back phrases, attach prefixes, 7 refer to the required line.
Datasets and results
Results in brief: about 59 bytes / URL in compressed form with avg extract time about 150600 s (@ Intel Core 2 Duo 6420 2.13 GHz).
Datasets available from the WebGraph project: http://webgraph.dsi.unimi.it/
Future work
Speed-optimize. Experiments also on fully lex-sorted URL collections (better compression). Add support for locate queries (given a key, return its index in the structure, or -1 if it doesnt exist). Smarter phrase replacement?