Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit b027ad9

Browse files
Jan WieckJan Wieck
authored andcommitted
Added comments about the compression algorithm as requested by Tom
Jan
1 parent 43f6ab8 commit b027ad9

File tree

1 file changed

+58
-4
lines changed

1 file changed

+58
-4
lines changed

src/backend/utils/adt/pg_lzcompress.c

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/* ----------
22
* pg_lzcompress.c -
33
*
4-
* $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.6 2000/07/03 23:09:52 wieck Exp $
4+
* $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.7 2000/07/06 21:02:07 wieck Exp $
55
*
66
* This is an implementation of LZ compression for PostgreSQL.
77
* It uses a simple history table and generates 2-3 byte tags
@@ -46,7 +46,7 @@
4646
* The return value is the size of bytes written to buff.
4747
* Obviously the same as PGLZ_RAW_SIZE() returns.
4848
*
49-
* The compression algorithm and internal data format:
49+
* The decompression algorithm and internal data format:
5050
*
5151
* PGLZ_Header is defined as
5252
*
@@ -57,8 +57,8 @@
5757
*
5858
* The header is followed by the compressed data itself.
5959
*
60-
* The algorithm is easiest explained by describing the process
61-
* of decompression.
60+
* The data representation is easiest explained by describing
61+
* the process of decompression.
6262
*
6363
* If varsize == rawsize + sizeof(PGLZ_Header), then the data
6464
* is stored uncompressed as plain bytes. Thus, the decompressor
@@ -108,6 +108,60 @@
108108
* and end up with a total compression rate of 96%, what's still
109109
* worth a Whow.
110110
*
111+
* The compression algorithm
112+
*
113+
* The following uses numbers used in the default strategy.
114+
*
115+
* The compressor works best for attributes of a size between
116+
* 1K and 1M. For smaller items there's not that much chance of
117+
* redundancy in the character sequence (except for large areas
118+
* of identical bytes like trailing spaces) and for bigger ones
119+
* the allocation of the history table is expensive (it needs
120+
* 8 times the size of the input!).
121+
*
122+
* The compressor creates a table for 8192 lists of positions.
123+
* For each input position (except the last 3), a hash key is
124+
* built from the 4 next input bytes and the posiiton remembered
125+
* in the appropriate list. Thus, the table points to linked
126+
* lists of likely to be at least in the first 4 characters
127+
* matching strings. This is done on the fly while the input
128+
* is compressed into the output area.
129+
*
130+
* For each byte in the input, it's hash key (built from this
131+
* byte and the next 3) is used to find the appropriate list
132+
* in the table. The lists remember the positions of all bytes
133+
* that had the same hash key in the past in increasing backward
134+
* offset order. Now for all entries in the used lists, the
135+
* match length is computed by comparing the characters from the
136+
* entries position with the characters from the actual input
137+
* position.
138+
*
139+
* The compressor starts with a so called "good_match" of 128.
140+
* It is a "prefer speed against compression ratio" optimizer.
141+
* So if the first entry looked at already has 128 or more
142+
* matching characters, the lookup stops and that position is
143+
* used for the next tag in the output.
144+
*
145+
* For each subsequent entry in the history list, the "good_match"
146+
* is lowered by 10%. So the compressor will be more happy with
147+
* short matches the farer it has to go back in the history.
148+
* Another "speed against ratio" preference characteristic of
149+
* the algorithm.
150+
*
151+
* Thus there are 3 stop conditions for the lookup of matches:
152+
*
153+
* - a match >= good_match is found
154+
* - there are no more history entries to look at
155+
* - the next history entry is already too far back
156+
* to be coded into a tag.
157+
*
158+
* Finally the match algorithm checks that at least a match
159+
* of 3 or more bytes has been found, because thats the smallest
160+
* amount of copy information to code into a tag. If so, a tag
161+
* is omitted and all the input bytes covered by that are just
162+
* scanned for the history add's, otherwise a literal character
163+
* is omitted and only his history entry added.
164+
*
111165
* Acknowledgements:
112166
*
113167
* Many thanks to Adisak Pochanayon, who's article about SLZ

0 commit comments

Comments
 (0)