Text Compression

From StoneHome

This is the current plan for the giant evil monster wikipedia compression fang-drooling idiot project of doom, +3 vorpal, acid (lsd) damage

The Scheme

  • We will be keeping a series of "keys", substrings (words and control strings, mostly) which are assembled to reconstitute text strings. Alongside that they're substrings, they carry some boolean flags to assist some of the training transformations. Since we're pre-parsing, we don't need to bother building key tables incrementally, nor seeding with letter primitives, though with a dataset this large we'll probably end up with all the letters anyway.

Setting up the dictionary

  • Encode wiki commands and common html as keys. Flag these as commands for the transformations. Make individual flags for each command, so that the net quantizer has as much information about what's going on as possible.
  • Make a dictionary of the entire source. Each word becomes a key.
  • Encode the data as dictionary entry key sequences.
  • Encode differences from sentence case as keys. Flag these as case transformations for the transformations.
  • Encode differences from normalized sentence spacing as keys. Flag these as whitespace for the transformations.
  • Encode paragraph breaks as keys. Flag these as paragraph breaks, so that the the generic text predictor pass can learn to use paragraphs to indicate contextual shifts.
  • Encode every page title as a key.

Maximizing space

  • Fill the keyspace's bit width.
    • If the dictionary has E entries, then for the smallest integer N for which E is less than (2^n)-1 (that is, the minimum number of bits needed to encode E), then there are (((2^n)-1)-E) "free" dictionary slots without expanding the key recording width.
    • Make a LZ table of strings of the encoded pages. Sort the results by an entry's size times its occurrence count (ie, the total amount of space it encodes) minus the amount of space using it however many times would be needed (ie, the cost of using it.)
    • Fill the free dictionary slots with the highest ranked LZ table entries; "phrases" are now also encoded optimally.
  • This should balance common links, headers, sections, weird spacing, recording categories, et cetera, all of which should be optimal.
  • Re-encode the first-order page encodings against phrases. Keep phrases recorded in terms

Learning about our data

  • Add context to each page
    • First order context is the vector of compressed strings referring to the categories this page is a member of
    • Second order context is the vector of compressed strings referring to the categories this page's categories are subcategories of
    • Third order context is the vector of compressed strings referring to the categories this page links to

Removing entropy

  • Two-pass entropy encode
    • Generate context-aware dictionaries
      • When parsing an article, keep word counts for each category an article is a member of also. This results in smaller dictionaries for a local string generator to deal with, reducing local entropy.
      • When encoding an article, encode it against the smallest of its member dictionaries, and note which dictionary was used; this is third-order data.
      • Entropy encode the key string part of the third-order data with MTF. The entropy encoded key string with the subdictionary stamp is considered fourth-order data.

Yay for Statistics

  • Generate Markhov Predictors
    • Train a Markhov Chain against every page.
    • Train a Markhov Chain for each category against each page which belongs to said category
  • Generate placement statistics against fourth-order encoded pages
    • Preceder and Postceder frequency tables: how often A is followed by x1, and also by x1x2; similarly, preceded by y1 or y2y1 (these are also called bigrams and trigrams)

Building a Network With Gobs Of Data

  • Train a neural net to compress the pages in various passes meant to represent incremental learning
  • Provide this network with significant amounts of contextual information, categorical information, special-case encodings and statistics, in order to allow it to learn to provide superior case-aware behavior
  • Spend some time teaching the network with out-of-goal material in order to weight its sane behavior with novel data, but run a final training pass against the desired target material to reinforce overtraining.

First Steps

  • Start the training by going through every article in alphabetical order
    • Pass as inputs the fourth order encoded page data, Markhov data, the frequency tables and the three orders of context.
    • This pass represents the network's basic training, which is relatively intense: the entire MediaWiki database. This is to get the network generally functioning.
    • Pass as a boolean input a true representing that content is wikitext.

Training for Generality

  • Next, run through the entire database again. Pass no contexts.
    • When passing the data, strain out all the keys marked as whitespace, the formatting commands, but not linking, lists and other structural markup, or paragraphs.
    • Include the Markhov data from all pages, but not the category-specific Marhhov data.
    • This pass represents grammar and syntax training, and the ability to handle arbitrary text other than WikiPedia where context won't be well known.
    • Find a geometric progression to go through the pages in a fundamentally different order than alphabetical.
    • Add to this training run a bunch of classic text, as the network's "literary training."
      • For those added blocks, turn off the wikitext flag.
      • Mix the literary training in evenly with the second run of the database.

Hyperspecificity

  • Finally, run through the database a third time. Reintroduce all extra data.
    • Use a reproducible random behavior to generate the article order.
    • This pass represents over-specialization once general grammar is well understood, which for our purposes is desirable.

Final Encoding

Page Encoding

  • Data for one page is as simple as keeping a local dictionary indicator, a list of category memberships, and a content string. Therefore, a class or struct can very naturally handle a page.
  • The content string will not be encoded in width K from above, but rather in the with of the local dictionary, which we will now call L. In almost every case, L will be dramatically smaller than K, and by definition cannot be larger. Content strings are terminated with the L-bit-wide symbol 0, the equivalent of a null-terminated string. After an l-bit 0, the remaining bits are padded to u8 width with off bits.

Dictionary Encoding

  • Use a control-string system and a 4- or 5-bit encoding to represent the dictionary decoding data for the keys
  • Use arithmetic encoding to compress the dictionary decoder; decode it into RAM on boot, RAM allowing.

Neat upshots of this system

  • Since we know that there is a power of two keys, for that power of two as K, there is a K-bit perfect encoding of our dictionary. Given that each article's title will be representable as a single integer in that scheme, if we make sure that those keys are in some predictable position - say, we encode them first (after null) and keep a count - then the articles have a built-in numeric index, and we can keep a simple pointer table to get their data. If there's a class or struct to handle a page, then you therefore have a simple way to track all your pages naturally in an array or vector.
  • Because of the middle training pass, text that's not part of Wikipedia which is encoded without context shouldn't suffer too badly.

Open Questions

  • Is Wikipedia already so big that adding literature to average out its network is silly and counterproductive?
  • Should I add the alphabet as tokens anyway, so that outside data has the potential to compress words which aren't in the network, or should I so heavily special case it that there's the potential for outside data to fail? In a K-bit arbitrary size encoding of every word in a gigabyte near-flat-text database going to bat an eye at 256 new entries?
  • Using special-case encoding, is pushing a gig of text down to 200 meg even reasonable? Is this stupid?