Codestin Search App

v26

Update

Known dictionary words are compared with their codeword. Previously, text strings were compared.
Adjusted global StateMap prediction.
ContextMap (HT 128) reduced predictions from 6/4 to 4/3 per context. Use single internal StateMap. All context states are update with that.
ContextMap (HT 32) reduced predictions from 5/4 to 3/2 per context. Removed StateMap based predictions.
StationaryMap for 2 context
In WordsContext use also codeword for dictionary word.
Added SentenceContext for sentance managment. Max 64 sentances (WordsContexts). Search for similarity is performed by compareing codewords (default 53% means match found).
In stemmer add Pronoun word type.
Add InDirectStateMap with order-w mixing of primary predictions (similar to Paq9a/zpaq)
Partial sentance contexts.
Group of SentenceContexts for: lists ('*'), table, wikilinks and regular sentances. Total 4.
Removed SparseMatchModel.
Removed 4 SmallStationaryContextMap contexts
Mixer count from 12 to 24
Added 7 new ContextMap's
Added 22 new InDirectStateMap contexts
Adjusted mixer parameters and contexts
Adjusted ContextMap memory usage
There are 3 mixers layers (+1 in every InDirectStateMap).
For layer 0 mixers about ~40% of updates are skipped.
Some predictions are skipped if line is Category link, after topic 'See also', 'References', 'Bibliography' or 'External links'.
Some low memory ContextMaps are reset after every page (wikipedia article). The StateMap is preserved if it exists.

Sep 12, 2025
23d974f
zip
tar.gz
Notes
Downloads

v24

Update fxcm.cpp

* Update model to fx2-cmix level
* Add match skip

Sep 2, 2024
dad9772
zip
tar.gz
Notes
Downloads

* Reverse dictionary transform. We load the dictionary when it is found after decompressing it. Text has a separate buffer from coded byte stream buffer.
* Natural language processing using stemmer (from paq8px(d)).
* Stemmer has new word types: Article, Conjunction, Adposition, ConjunctiveAdverb.
* Some word (related) contexts are changed based on what type of word was last. Some words are removed from word streams depending on the last word type. This improves compression.

There are four word streams:
* 1. basic stream of undecoded words.
* 2. decoded word stream after stemming for sentences. Contains all words. Reset when sentence ends.
* 3. decoded word stream after stemming for paragraphs. Contains words that are not: Conjunction, Article, Male, Female, ConjunctiveAdverb. Reset when paragraph ends.
* 4. decoded word stream after stemming. Contains words that are not: Conjunction, Article, Male, Female, Adposition, AdverbOfManner, ConjunctiveAdverb.
* Word limit per stream is increased from 64 to 256 words.
* New context that uses stemmer and decoded plaintext. Some global context are changed when word type is: \
ConjunctiveAdverb or Conjunction - skip updating in stream 1. Conjunction for sentence reset. etc. Knowing these new words allowed large amounts of compression improvements.
* In some cases words are removed between certain chars from stream 2 and 3 when following is true: \
=| - wiki template \
&lt;> - html/xml tags \
[| - wiki links \
() - usually words in sentences
* Main predictors are split between three different ContextMaps. This provides better compression. Sizes for hash tables are 32, 64 (standard for paq8 versions) or 128 bytes per contexts. 32 byte size is good for small memory context (below 256 KB), 64 is good for medium sized context (up-to 16MB), 128 is good for large memory context (more than 16MB).
* One state table is removed and replaced with another one. State tables are generated at runtime to reduce code size.
* Added sparse match model. With a gap of 1-2 bytes and minimum length of 3-6 bytes. Mostly for escaped UTF8.
* Detection of math, pre, nowiki, text tags in decoded text stream. Some word related contexts are not used when 3 first tags content is compressed. Improves compression speed.
* More parsing of lists and paragraphs. So that context for predictors is best as they can.
* Optimized context skipping in main predictors.
* Main predictor context bias is not forwarded to cmix floating point mixers, instead a single prediction bias is set. This avoids unnecessary expansion of mixer weight space and maintains lower memory/cpu usage. There are some other predictions that are not forwarded as they make compression worse and slower.
* Some mixers and APM’s context size is larger so that prediction can be better.
* partially/fully decoded word index into dictionary is used as a context for mixer in fxcm mixer
* Some variables are renamed for better readability.

Jun 20, 2024
2737825
zip
tar.gz
Notes
Downloads

v17

v17-v18

Some cleanup
Adjust contexts

Dec 13, 2023
8fc2d17
zip
tar.gz
Notes
Downloads

v15

v15-v16

-Add one new context
-Adujust on mixer
-Some cleanup
-Add comments so reader can understand what is happening
-Tune some variables and contexts
-Improve compression speed

Oct 17, 2023
f34ed88
zip
tar.gz
Notes
Downloads

v13

v13-v14

-Word&Sentence context.
-Adjust some contexts

Sep 23, 2023
5f2ed2f
zip
tar.gz
Notes
Downloads

v11

v11-v12

-Wiki table/row & column context
-Add 3 new context based on table
-Change some context

Sep 18, 2023
5072936
zip
tar.gz
Notes
Downloads

v9

v9-v10

-Change some contexts
-Increase one mixer size

Sep 6, 2023
2f88b74
zip
tar.gz
Notes
Downloads

v7

Page

One article as seen by first char context

Sep 5, 2023
2e74c9e
zip
tar.gz
Downloads

v5

v5-6

-Add another context based on first char
-Adjust some contexts

Sep 4, 2023
1bdefaa
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v26

v24

v22

v17

v15

v13

v11

v9

v7

v5

Tags: kaitz/fxcm