Copyright (C) 2010, 2011, Dell, Inc. Vi se permite sa copieze si sa distribuie materiale de la aceasta carte furnizate (1) orice material ce poate fi distribuit include aceasta licenta, (2) materialul nu este modificat, si (3) faceti nu percepe un comision sau solicita orice alte considerente, pentru copii sau pentru orice lucrari care includ material de pe acest carte. Aceste restrictii nu se aplica la normal "utilizare echitabila", definit ca citate citata in valoare totala de mai mult de o pagina. Aceasta carte poate fi descarcata gratuit de la http://mattmahoney.net/dc/dce.html.
Ultima actualizare: 3 august 2011.
Aceasta carte este pentru cititorul care vrea sa inteleaga cum functioneaza de compresie de date, sau care vrea sa scrie date software-ul de compresie. Capacitatii de programare prealabila si unele competente matematice vor fi necesare. Subiecte specifice includ:
Aceasta carte este destinat sa fie de sine statatoare. Surse sunt legate atunci cand este cazul, dar nu aveti nevoie sa faceti clic pe ele pentru a intelege materialul.
De compresie a datelor este arta de a reduce numarul de biti necesar pentru a stoca sau transmite date. De compresie pot fi pierderi sau pierderi. Detalii de Losslessly comprimat poate fi decomprimat exact la valoarea ei initiala. Un exemplu este 1848 Codul Morse. Fiecare litera a alfabetului este codificat ca o succesiune de puncte si linii. Literele cele mai frecvente in limba engleza cum ar fi E si T a primi cel mai scurt codurile. Cel putin comune precum J, Q, X, Z si sunt incadrate la codurile de cea mai lunga.
Toate algoritmi de compresie a datelor consta in cel putin un model si un programator (cu transforma preprocesing optional). Un model de estimarile de distributie a probabilitatii (E este mai frecventa decat Z). Coder atribuie coduri scurte la mai multe simboluri probabil. Exista solutii eficiente si optime a problemei de codificare. Cu toate acestea, modelarea optima a fost dovedit a nu calculabil. Modelare (sau echivalent, pronosticuri) este atat o inteligenta artificiala (AI), problema si o arta.
Compresie lossy se debaraseaza "neimportante" de date, de exemplu, detalii ale unui clip audio sau imagini care nu sunt perceptibile pentru ochi sau urechi. Un exemplu este 1953 NTSC standard pentru transmisie TV color, utilizate pana in anul 2009. Ochiul uman este mai putin sensibil la detalii fine intre culori de luminozitate egal (cum ar fi rosu si verde), decat sa luminozitate (alb-negru). Astfel, semnalul de culoare este transmis cu o rezolutie mai putin intr-o banda de frecventa mai restransa decat semnalul monocrom.
Compresie lossy consta intr-o transformare pentru a separa datele importante din neimportant, urmata de compresie fara pierderi importante pe de o parte si aruncarea inapoi restul. Transforma este o problema AI-ul, deoarece este nevoie de intelegerea a ceea ce creierul uman poate si ce nu pot percepe.
Teoria informatiei locuri de limite greu asupra a ceea ce poate si ce nu poate fi comprimat losslessly, si de cat de mult:
Acest lucru este dovedit de argumentul de numarare. Sa presupunem ca au existat un algoritm de compresie, care ar putea comprima toate sirurile de cel putin o anumita dimensiune, sa zicem, n biti. Nu sunt siruri de caractere exact 2 n diferite binare de lungime n. Un compresor universal ar trebui sa codifice fiecare intrare diferit. In caz contrar, daca doua intrari comprimat la iesire aceeasi, atunci decompresser nu ar fi in masura pentru a decomprima ca productia corect. Totusi, exista doar 2 n - 1 siruri binare mai scurta decat n biti.
De fapt, marea majoritate a siruri de caractere nu poate fi comprimat de foarte mult. Fractiune de siruri de caractere care poate fi comprimat de la n biti la biti m, este cel mai la 2 m - n. De exemplu, mai putin de 0,4% din siruri de caractere poate fi comprimat de un octet.
Fiecare compresor care poate comprima orice intrare trebuie sa se extinda, de asemenea, unele de intrare sale. Cu toate acestea, niciodata de expansiune trebuie sa fie mai mult de un simbol. Orice algoritm de compresie pot fi modificate prin adaugarea unui bit pentru a indica faptul ca restul de date sunt stocate necomprimat.
Argumentul de numarare se aplica sistemelor care ar putea comprima recursiv productia proprie. In general, comprimat de date apare aleatoriu pentru algoritmul care se comprimat, astfel incat acesta nu poate fi comprimat din nou.
Sa presupunem ca dorim sa comprime cifre de?, de exemplu, "314159265358979323846264...". Presupunem modelul nostru este ca fiecare cifra apare cu probabilitate de 0,1, independent de orice alte cifre. Luati in considerare posibile 3 coduri binare:
Cifra BCD Huffman binar
---- ---- ---- ----
0 0000 000 0
1 0001 001 1
2 0010 010 10
3 0011 011 11
4 0100 100 100
5 0101 101 101
6 0110 1100 110
7 0111 1101 111
8 1000 1110 1000
9 1001 1111 1001
--- ---- ---- ----
BPC 4.0 nu valabile 3.4
Folosind un BCD (Binary Coded Decimal) codul,? ar fi codificate ca 0011 0001 0100 0001 0101... (Spatiile sunt afisate pentru o lizibilitate numai). Raportul de compresie este de 4 biti per caracter (4 BPC). In cazul in care de intrare a fost text ASCII, productia ar fi comprimat cu 50%. Decompresser ar decoda datele prin impartire in siruri de biti 4.
Codul Huffman ar cod? ca 011 001 100 001 101 1111... Decodorul ar citi un biti la un moment dat si decoda o cifra de indata ce a constatat un meci in tabelul (fie dupa 3 sau 4 biti). Codul este unic, deoarece nu decodable cod este un prefix de orice alt cod. Raportul de compresie este de 3,4 BPC.
Codul binar nu este unic decodable. De exemplu, 111 ar putea fi decodat ca 7 sau 31 sau 13 sau 111.
Exista coduri mai bine decat codul Huffman de mai sus. De exemplu, am putea atribui codurile Huffman la perechi de cifre. Exista 100 de perechi, fiecare cu probabilitate 0.01. Am putea atribui 6 coduri de biti (000000 prin 011011), la 00 pana la 27, si 7 biti (0,111,000 prin 1,111,111), la 28, prin 99. Lungimea medie este de 6.72 Codul de biti per pereche de cifre, sau 3.36 BPC. In mod similar, grupuri de codificare de 3 cifre folosind 9 sau 10 biti ar genera 3.3253 BPC.
Shannon si Weaver (1949) a demonstrat ca cele mai bune le puteti face pentru un simbol cu probabilitatea p este atribui un cod de log lungime de 2 1 / p. In acest exemplu, log 2 1/0.1 = 3.3219 BPC.
Shannon a definit continutul informatiei scontate sau echivoc (numit acum entropia ) de un X variabila aleatoare ca o lungime de cod de asteptat. Sa presupunem ca X poate avea valorile X 1, X 2,... si ca fiecare X i are probabilitatea p (i). Apoi entropia X este H (X) = E [log 2 1 / p (X)] =? i p (i) log 2 1 / p (i). De exemplu, entropia de cifre de?, in conformitate cu modelul nostru, este de 10 (0.1 log 2 1/0.1) = 3.3219 BPC. Nu exista nici un cod mai mic pentru acest model, care ar putea fi decodificate fara echivoc.
Continutul de informatii al unui set de siruri de caractere este de cel mult suma de continutul informational al siruri de caractere individuale. Daca X si Y sunt siruri de caractere, atunci H (X, Y)? H (X) + H (Y). Daca acestea sunt egale, atunci X si Y sunt independente. Stiind un sir doriti sa va spun nimic despre celelalte.
Conditionata entropia H (X | Y) = H (X, Y) - H (Y) este continutul de informatii al X dat Y. Daca X si Y sunt independente, atunci H (X | Y) = H (X).
Daca X este un sir de simboluri x 1 x 2... x n, apoi de catre lantul de regula, p (X) pot fi exprimate ca un produs al secventei de predictii simbol conditionata de simboluri anterioare: p (X) =? I p (x i | x 1.. i-1). De asemenea, continutul de informatii H (X) X sir aleatorii este suma entropiile conditionata de fiecare simbol, avand in vedere simbolurile precedente: H (X) =? i H (x i | x 1.. i-1).
Entropia este o masura atat de incertitudine si o limita inferioara de compresie de asteptat. Entropia de o sursa de informare este limita pe care ar trebui sa aveti posibilitatea sa-l comprimati. Exista metode eficiente de codificare, cum ar fi codurile aritmetica, care sunt pentru toate scopurile practice optime in acest sens. Ar trebui sa se sublinieze, totusi, ca entropia poate fi calculata doar pentru o distributie de probabilitate cunoscuta. Dar, in general, modelul nu este cunoscuta.
Am modelat cifre de? ca uniform distribuite si independente. Avand in vedere ca modelul, teorema lui Shannon locuri de codificare o limita de greu pe cele mai bune de compresie, care ar putea fi atins. Cu toate acestea, este posibil sa se utilizeze un model mai bun. Cifre de? nu sunt cu adevarat aleatoare. Cifrele sunt doar necunoscute pana cand le calcula. Un compresor inteligent ar putea recunoaste cifre de? si codifica ca pe o descriere care inseamna "primul milion de cifre de pi", sau ca un program care reconstituie datele atunci cand rulati. Cu modelul nostru anterior, cel mai bun am putea face este (10 6 log 2 10) / 8? 415241 bytes. Cu toate acestea, exista programe foarte mici, care poate iesire?, unele la fel de mici ca 52 bytes.
Argumentul de numarare spune ca cele mai multe siruri de caractere nu sunt compresibile. Deci, este un fapt destul de remarcabil ca cele mai multe siruri de caractere ca ne pasa, de exemplu textul in limba engleza, imagini, software-ul, senzorul de lecturi, si ADN-ul, sunt compresibile fapt. Aceste siruri de caractere au, in general, scurte descrieri, daca acestea sunt descrise in limba engleza sau ca un program in C sau codul masina x86.
Solomonoff (1960, 1964), Kolmogorov (1965), si Chaitin (1966) a propus un independent universal a-priori distributie de probabilitate de peste siruri de caractere pe baza lungimii lor descriere minime. Probabilitatea algoritmica a unui sir x este definit ca fractiune de programe aleatorii in unele L limba pe care x iesire, in cazul in care fiecare program de M este ponderat de 2 - | M | si | M | este lungimea de M in biti. Aceasta probabilitate este dominat de cel mai scurt astfel de program. Noi numim aceasta lungime complexitatea Kolmogorov K L (x) din x.
Probabilitate algoritmica si complexitatea unui sir x depind de alegerea L limba, ci numai printr-o constanta, care este independent de x. Sa presupunem ca M1 si M2 sunt codificari a lui x in limbile L1 si L2, respectiv. De exemplu, daca L1 este C + +, atunci M1 va fi un program in C + + care iesiri x. In cazul in care L2 este limba engleza, M2 ar fi o descriere de x cu detalii suficienta pentru ca ar permite sa scrie x exact. Acum este posibil pentru orice pereche de limbi straine pentru a scrie intr-o limba de un compilator sau interpret sau norme pentru a intelege alte limbi. De exemplu, ai putea scrie o descriere a limbajului C + + in limba engleza, astfel incat ai putea (in teorie) a citit orice program C + + si a prezice productia sa de "functionare" se in capul tau. In schimb, ai putea (in teorie) a scrie un program in C + +, care de intrare o descriere in limba engleza si tradus in C + +. Marimea limbaj de descriere sau compilatorul nu depinde de x in nici un fel. Atunci, pentru orice M1 descriere in orice limba de L1 orice x, este posibil, sa gaseasca o descrption M2 in orice limba L2 alte x prin adaugarea la M1 o descriere lungime fixa??de L1 scrise in L2.
Nu este dovedit faptul ca probabilitatea de algoritmica este o probabilitate de adevarat universal anterior. Cu toate acestea, este larg acceptat pe motive empirice datorita succesului ei in predictie ordine si masina de invatare intr-o gama larga de tipuri de date. In masina de invatare, lungimea minima descriere principiul de a alege cea mai simpla ipoteza in concordanta cu datele de formare se aplica la o gama larga de algoritmi. Acesta formalizeaza Occam Razor lui. Occam a remarcat in secolul al 14'th, ca (parafrazand), "cel mai simplu raspuns este, de obicei raspunsul corect". Razor Occam este universal aplicat in toate stiintele, deoarece stim din experienta ca cea mai simpla sau mai elegant (adica cel mai scurt), teorie care explica de date tinde sa fie cel mai bun predictor de experimente viitoare.
Pentru a rezuma, in cele mai bune de compresie putem realiza pentru orice x sir este sa-l codifica ca M cel mai scurt program in unele L limba pe care iesirile x. In plus, alegerea L devine mai putin importanta ca siruri de caractere obtine mai mult. Tot ceea ce ramane este de a gasi o procedura care constata M pentru orice x, in unele L. limba Cu toate acestea, Kolmogorov a demonstrat ca nu exista nici o astfel de procedura in orice limba. De fapt, nu exista o procedura pe care tocmai va ofera | M |, lungimea descriere mai scurt timp posibil de x, pentru toate x. Sa presupunem ca au existat. Apoi, ar fi posibil pentru a descrie "string in primul rand ca nu poate fi descris in mai putin de un milion de biti", care duce la paradoxul pe care am avut doar facut acest lucru. (Prin "primul", presupune o ordonare peste siruri de caractere mai scurt de la pana la cel mai lung, ruperea de raporturi lexicografic).
De asemenea, nu exista un test general pentru dezordine. Li si Vitanyi (2007) ofera o dovada simpla de o variatie puternica a teoremei Godel de incompletitudine prima, si anume ca, in orice sistem coerent, formale (un set de axiome si reguli de inferenta), suficient de puternice pentru a descrie situatiile in aritmetica, ca nu exista la cel putin o declaratie adevarata, care nu pot fi dovedite. Li si Vitanyi dovedi ca in cazul in care sistemul este de sunet (nu poti dovedi orice declaratie falsa), atunci exista un numar infinit de situatii adevarate, dar nedovedit. In special, exista doar un numar finit de declaratii de forma "x este aleator" (K (x)? | x |), care poate fi dovedit, dintr-un numar infinit de posibile finit siruri de caractere x. Sa presupunem ca in caz contrar. Atunci ar fi posibil sa se pentru a enumera toate dovezile (listele de axiome si aplicatii ale regulilor de inferenta) si descriu "x sirul astfel ca acesta este primul care urmeaza sa fie dovedit a fi un milion de biti lung si aleatorii", in ciuda Faptul ca am dat doar o scurta descriere a acesteia. Daca F descrie orice sistem formal, atunci cel mai lung sir care poate fi dovedit a fi aleator nu este mult mai mult decat F, desi exista un numar infinit de siruri de caractere mai lungi si cele mai multe dintre ele sunt aleatoare.
De asemenea, nu exista un test general, pentru a determina daca comprimare a unui sir poate fi imbunatatita mai departe, pentru ca ar fi echivalenta cu testare, indiferent daca sirul de caractere comprimat este aleatoare.
Deoarece determinarea lungimii celei mai scurte descrieri ale siruri de caractere nu este calculabil, nici unul nu este de compresie optim. Nu este greu de gasit cazurile dificile. De exemplu, ia in considerare scurta descriere "un sir de un milion de bytes zero, comprimat cu AES in mod CBC cu tasta" foo ". Pentru orice program care nu stie cheia, datele arata complet aleatorii si incompresibil.
Predictii este intuitiv legate de intelegerea. Daca ati inteles-o secventa, atunci aveti posibilitatea sa il prezice. Daca ati inteles-o limba, atunci ati putea prezice ceea ce ar putea aparea alaturi de cuvant intr-un paragraf in aceasta limba. Daca ati inteles o imagine, atunci ati putea prezice ce s-ar putea afla sub portiuni ale imaginii care sunt acoperite in sus. Alternativ, date aleatoare are nici un sens si nu este previzibil. Acest lucru sugereaza ca intuitia Predictii sau de compresie ar putea fi folosite pentru a testa sau masura intelegere. Am formaliza acum aceasta idee.
Compresie optima, in cazul in care au fost calculabil, ar rezolva optim de inteligenta artificiala (AI) problema in conformitate cu doua definitii foarte diferite de "inteligenta": testul Turing (Turing, 1950), si universala de informatii (Legg si Hutter, 2006).
Turing propus pentru prima data un test pentru AI pentru a ocoli intrebarea filozofic dificila (pe care el a considerat irelevante) de masini daca ar putea crede. Acest test, acum cunoscut sub numele de testul Turing, este acum larg acceptat. Testul este un joc jucat de doi oameni care nu au mai intalnit si aparatul supus incercarii. Un om (judecator), comunica cu alt om (confederat) si cu masina printr-un terminal. Atat confederat si masina incerca sa convinga judecatorul ca fiecare este uman. In cazul in care judecatorul nu poate ghici corect, care este masina de 70% din timp, dupa 10 minute de interactiune, atunci aparatul se spune ca au AI. Turing a dat urmatorul exemplu al unui dialog posibile:
I: Va rugam sa scrieti-mi un sonet pe tema Podul Forth.
A: ma numar pe aceasta. N-am putea scrie poezie.
I: Se adauga 34957 la 70764.
A: (Pauza aproximativ 30 de secunde si apoi da-ca raspuns) 105621.
Q: Nu joci sah?
A: Da.
I: Am K la K1 meu, si nici alte piese. Ai K doar la K6 si R de la R1. Este a muta dumneavoastra. Ce vrei sa joci?
A: (Dupa o pauza de 15 secunde) R-R8 mate.
Ar trebui sa fie evident ca comprimarea transcrierile ca acest lucru necesita abilitatea de a calcula un model de forma P (A | Q) = p (AC) / P (Q) in cazul in care Q este contextul de pana la intrebarea curenta, iar A este raspuns. Dar, in cazul in care un model ar putea face astfel de predictii cu precizie, atunci acesta ar putea genera, de asemenea, raspunsurile imposibil de distins de cea a unui om.
Prezicerea transcrierile este similar cu problema de a prezice limbajul comun in scris. Este nevoie, in ambele cazuri vast, din lumea reala de cunostinte. Shannon (1950) estimeaza ca, continutul de informatii in scris case-insensitive limba engleza fara punctuatie este 0.6 - 1.3 biti pe caracter, bazate pe experimente in care subiectii umani ghicit caractere succesive in text cu ajutorul tabelelor de scrisori de frecventa n-gram si dictionare. Incertitudinea se datoreaza nu atat de mult de variatie in obiectul de indemanare si umane, aceasta se datoreaza faptului ca diferite misiuni probabilitatea sa conduca la aceleasi secvente observate ghicitul. Cu toate acestea, in cele mai bune compresoarele text sunt doar acum comprimarea langa capatul de sus al acestei game.
Legg si Hutter a propus doua definitie, universal de informatii, care urmeaza sa fie mult mai generale decat inteligenta umana Turing. Ei considera problema a recompensa-seeking agenti in medii complet arbitrare descrise de programe aleatorii. In acest model, un agent comunica cu un mediu prin trimiterea si primirea de simboluri. Mediul trimite, de asemenea, o consolidare sau semnal de recompensa pentru a agentului. Scopul acestui agent este de a maximiza recompensa acumulate. Universale de informatii este definita ca rasplata asteptat peste toate mediile posibile, in cazul in care probabilitatea de fiecare mediu descris de un program de M este algoritmice, proportionala cu 2 - | M |. Hutter (2004, 2007) a demonstrat ca optima (dar nu calculabil) strategie pentru agentul este de a ghici dupa fiecare intrare ca distributia de peste M este dominata de cel mai scurt program coerent de observare cu trecutul.
Hutter solicita aceasta strategie AIXI. Este, desigur, este doar problema noastra de compresie uncomputable aplicat la o transcriere de interactiune trecut. AIXI poate fi considerat, de asemenea, o declaratie formala si o dovada a lui Occam Razor. Cel mai bun predictor al viitorului este simpla teorie sau mai scurt, care explica trecut.
Nu exista nici un astfel de lucru ar fi compresia universal, de compresie recursive, sau de compresie de date aleatoare.
Cele mai multe siruri de caractere sunt aleatoare. Cele mai multe siruri de caractere nu sunt semnificative.
Compresie modelare = + codare. Programarea este o problema rezolvata. Modelarea este provably nu rezolvabil.
Compresie este atat o arta si o problema de inteligenta artificiala. Cheia de compresie este de a intelege datele pe care doriti sa le comprimati.
Un reper de compresie a datelor masuri de raportul de compresie pe un set de date, si, uneori, utilizarea memoriei si viteza pe un anumit computer. Unele valori de referinta a evalua dimensiunea numai, pentru a evita dependente hardware. Raportul de compresie este adesea masurata prin marimea fisierului comprimat de iesire, sau in biti per caracter (BPC) sensul cuvantului biti comprimat pe octet necomprimat. In ambele cazuri, un numar mai mic sunt mai bune. 8 bpc nu inseamna compresie. 6 BPC mijloace de compresie de 25% sau 75% din dimensiunea originala.
In general, exista o modalitate de 3 compromis intre marimea, viteza, si utilizarea memoriei. Compresoarele clasat pe primul loc de marime necesita o multime de resurse de calcul.
Corpus Calgary este reperul cel mai vechi de compresie inca in uz. El a fost creat in 1987 si a descris intr-un sondaj de modele de compresie textului in 1989 (Bell, Witten si Cleary, 1989). Se compune din 14 imagini, cu o suprafata totala de 3141622 octeti, dupa cum urmeaza:
111261 BIB - text ASCII in UNIX "se refera" format - 725 referinte bibliografice.
768771 BOOK1 - text neformatat ASCII - Thomas Hardy: Departe de multime Madding.
610856 BOOK2 - Witten - text ASCII in UNIX "troff" Format: Principiile de vorbire calculator.
102400 GEO - 32 de numere de biti in format in virgula mobila IBM - date seismice.
377109 NOUTATI - text ASCII - USENET fisier batch pe o varietate de subiecte.
21504 obj1 - VAX program executabil - elaborarea de PROGP.
246814 obj2 - Macintosh program executabil - "Sistemul de Sprijin de cunostinte".
53161 PAPER1 - UNIX "troff" format - Witten, Neal, Cleary: Aritmetica codificare pentru compresie a datelor.
82199 PAPER2 - UNIX "troff" format - Witten: Computer (in) de securitate.
513216 PIC - 1728 x 2376 imagine bitmap (MSB primul): textul in limba franceza si diagrame linie.
39611 PROGC - codul sursa in C - UNIX comprima v4.0.
71646 PROGL - codul sursa in Lisp - software de sistem.
49379 PROGP - codul sursa in Pascal - program de compresie pentru a evalua PPM.
93695 TRANS - transcrierea unei sesiuni de terminale - caractere ASCII si controlul.
Struture a corpusului este prezentata in diagrama de mai jos. Fiecare pixel reprezinta un meci intre aparitiile consecutive dintr-un sir. Culoarea pixel reprezinta lungimea meciului: negru pentru 1 octet, rosu pentru 2, verde si albastru pentru 4 pentru 8. Axa orizontala reprezinta pozitia a doua aparitie a sirului. Axa verticala reprezinta distanta inapoi la meci pe o scara logaritmica. (Imagine a fost generata de FV program cu etichete adaugat de mana).

Meci de structura sir de corpus Calgary
Linia orizontala in jurul valorii de intuneric, la 60-70 in BOOK1 este rezultatul de caractere linefeed repetarea lor la acest interval regulat. Linii intunecate la 1280 si 5576 in OUG se datoreaza unor date repetate in anteturile bloc. Benzile de intuneric la multipli de 4 sunt datorita structurii octet 4 al datelor. PIC are o banda intunecata la 1 ca urmare a ruleaza lung de la zero octeti, si linii la multipli de 216, care este latimea imaginii in bytes. Curbele de albastru din partea de sus a imaginii arata meciurile intre diferite fisiere de text, si anume BIB, BOOK1, BOOK2, NEWS, PAPER1, PAPER2, PROGC, PROGL, PROGP si TRANS, toate din care contin textul in limba engleza. Astfel, comprimarea acestor fisiere impreuna pot duce la o compresie mai buna, deoarece acestea contin informatii reciproca. Nu sunt meciuri astfel sunt vazute intre OUG, obj2, PIC si cu alte fisiere. Astfel, aceste fisiere trebuie sa fie comprimate separat.
Orice alta structura poate fi vazut. De exemplu, exista dark in obj2 la 4, 6, 8, si chiar mai mare numar de cauza la lungimi de instructiuni masina 68000. Trupa intuneric in jurul valorii de 150-200 la BIB reprezinta lungimea medie a unei intrari bibliografice. Aceasta cunoastere este utila in proiectarea algoritmilor de compresie.
O mostra de text de la BOOK1:
a fost primit un procent de pana la astfel de agricultor
timp in avans ar trebui sa fie eliminate de pe stejar gasite +
ca valoarea stocului, a plantelor, si pune in aplicare care
au fost intr-adevar propria lui ar fi de aproximativ suficiente pentru a plati lui
datorii, lasand el insusi un om liber, cu hainele pe care le
sa ridicat in, si nimic mai mult.
<C Vi>
<P 88>
FAIR - CALATORIE - FOC
Doua luni a murit. Ne sunt aduse pe o
zile in luna februarie, pe care a avut loc statutul anual
sau inchirierea echitabil in judet-oras din Casterbridge.
La un capat al strazii se afla doi la trei
sute muncitori vesel si consistent de asteptare pe Sanse
- Toti oamenii de timbru a fortei de munca care sugereaza nimic
mai rau decat o tranta cu gravitatiei, si placere
nimic mai bun decat o renuntare de aceeasi intre
aceste carutasi, si waggoners au fost distins de catre
avand o bucata de bici-snur rasucit vs palariile lor;
Tabelele de mai jos arata apoi cele mai frecvente 10 de grame-n (n secvente byte) in BOOK1 pentru N de la 1 la 5, si 10 cuvinte cele mai frecvente, bigrams, si trigrame. (2 sau 3 secvente de cuvant). Cuvinte (dar nu n-grame) au fost luate in considerare ca secvente de scrisori cu alte semne de punctuatie ignorate.
Count n = 1 numar n = 2 numar n = 3 numar n = 4 numar n = 5 cuvinte Bigrams trigrame
--------- --------- ---------- -------- ----------- --- ----------- ------------------ -------
125551 20452 e 11229-lea 7991 5869 7079 890 din cele 127 I don t
72431 e 17470 el 9585 t 6366 3238 si 4071 si 648 in cele 38 pot
50027 t 17173 t 8989 el 3771 si 1918, si 3760 de la 401 la un 36 din
47836 o 15995-lea 4728 ING 3647 si 1590 a fost 3701 un 297 si a celor 34 nu stiu
44795 o 13649 o ING 4666 si 3373 1407 3565 - 292, care sa fie de 30 ar fi fost
40919 n 13255 d 4564-a 3179 din 1356 n 2265, in 268 intr-un 30 la sfarsitul anului
37561 h 11153 in 4539 o 2966 - 1262 pe care am 254 din 2217 de 30 A fost o
37007 i 10937 t 4501 ed 1942, o d 1056 din 1966 a fost de 231 nu t 28 din
36788 e 10749 E 3936 la 1853 in 1050 cu 225 la 1536 ca cele 26 de a fi o
32889 R 9974 er 3756 ng 1843 a fost 1030 din 1430-lea ei 201 de la 24 am castigat t
PIC este o scanare FAX necomprimata de o pagina de un manual francez de mai jos. Imaginea este reprezentata prin scanarea in randurile din partea din stanga sus folosind un 1 pentru negru sau 0 pentru alb. Biti sunt ambalate 8 la un octet (MSB primele) cu 216 octeti per rand. Fisierul nu are antet. Acesta poate fi vizualizat in unele software-ul de imagine ca un fisier. Pbm adaugand urmatoarele doua linii de text la inceputul.
P4
1728 2376


Stanga: PIC (redus ca marime). Dreptul: closeup de PIC.
OUG contine date seismice IBM in 32 biti format punctul plutitoare, care a devenit caduc. Fiecare numar are un bit de semn (1 daca este negativ), un exponent 7 biti, si o mantisa 24 biti, reprezentand o fractiune intre 0 si 1. Numarul este decodificat ca semn mantisa x 16 x exponent - 64. Datele decodat este grafic de mai jos. Fiecare linie reprezinta o secventa de date fie de 1344 sau 320 probe cu incepere de la partea de sus. Fisierul are o structura repetarea cu un antet octet 200 (nu apare pe site) care precede fiecare secventa.

OUG
Valorile numerice gama de -147504 - 159280. Valorile au cel mult 16 biti de precizie. Astfel, ultimii 8 biti a tuturor valorilor mantisa sunt 0. Punctele sunt mai rotunjite, dupa cum urmeaza: intre -240 si 240, toate valorile sunt rotunjite la un multiplu de 1 / 64. Alte valori cu magnitudini mai putin de 960 sunt rotunjite la 1 / 16. Intre 960 si 4048, valorile sunt rotunjite la multipli de 1 / 4. Intre 4048 si 61440, valorile sunt rotunjite la intregi. 61440 de mai sus, valorile sunt rotunjite la multipli de 16.
Testele de inceputul anului utilizat, uneori, o versiune 18 fisier de corpus, care a inclus 4 lucrari addtional (PAPER3 prin PAPER6). Programe au fost de multe ori clasificate, prin masurarea biti per caracter (BPC), pe fiecare fisier separat si raportare a acestora in mod individual sau luand media. Adaugand pur si simplu dimensiunile comprimat se numeste o "medie ponderata", deoarece acesta este ponderat fata de fisiere mai mari.
Corpus Calgary nu mai este utilizat pe scara larga, datorita dimensiunilor sale mici. Cu toate acestea, acesta a fost utilizat incepand cu anul 1996 intr-o continua provocare de compresie condusa de Leonid A. Broukhis cu premii in bani mici. Cele mai bune rapoarte de compresie stabilit ca din februarie 2010 sunt dupa cum urmeaza.
Tabel. Calgary compresie Challenge History
Dimensiune Data Nume
------- -------------------- ------
759881 septembrie 1997 Malcolm Taylor
692154 august 2001 Maxim Smirnov
680558 septembrie 2001 Maxim Smirnov
653720 noiembrie 2002, Serge Voskoboynikov
645667 ianuarie 2004 Matt Mahoney
637116 aprilie 2004 Alexander Ratushnyak
608980 decembrie 2004 Alexander Ratushnyak
603416 aprilie 2005 Przemyslaw Skibinski
596314 octombrie 2005 Alexander Ratushnyak
593620 decembrie 2005 Alexander Ratushnyak
589863 mai 2006 Alexander Ratushnyak
580170 iulie 2010 Alexander Ratushnyak
Normele de provocare Calgary specifica faptul ca dimensiunea comprimat include dimensiunea de programul de decompresie, fie ca un fisier executabil Windows sau Linux, sau sub forma de cod sursa. Acest lucru este de a evita programele care trisa prin ascunderea de informatii din corpusul in programul de decompresie. In plus, programul si fisierele comprimate trebuie sa fie ambalate intr-o arhiva (in una din mai multe formate specifice), sau 4 octeti, plus lungimea de fiecare nume de fisier este adaugat. Aceasta este pentru a preveni trisarea prin ascunderea de informatii in numele de fisiere si dimensiunile. Fara aceste masuri de precautie, cum ar fi programe de barf ar putea pretinde a comprima la zero octeti.
Documentele inainte de 2004 sunt variante particularizate de compresoare de catre autori pe baza unor algoritmi PPM (rk pentru Taylor, subtire pentru Voskoboynikov, ppmn pentru Smirnov). Observatiile ulterioare sunt variante ale open source paq6, un algoritm de amestecare context. Depunerea 2010 se bazeaza pe paq8.
Sunt urmatoarele dimensiuni comprimat si pentru unele compresoare ori pe fisierul calgary.tar, un TAR fisier care contine o concatenare de 14 imagini. Ori de compresie (CT) si ori decompresie (DT) sunt ori de proces (nu ori de perete), in cateva secunde si sunt masurate pe o T3200 de 2 GHz sub Windows Vista. Valorile subliniate indica frontiera Pareto, in sensul ca nici un alt program este mai rapid si atat de rang superior. Atunci cand optiunile sunt afisate, ele sunt de obicei, setat pentru compresie maxim sau aproape maxim pe cheltuiala de viteza si de memorie (de pana la 1 GB). Cu toate acestea, utilizarea de memorie nu este critica din cauza dimensiunii reduse a datelor de testare. Unele programe sunt testate din nou, cu optiunile implicite. Programul Dimensiunea decompresie nu este inclus. An este pentru versiunea testata. Multe programe au mai mult de un autor. Autorul listate este pentru versiunea testata.
calgary.tar CT DT Programul de Anul Optiuni Algoritmul Autor ----------- ----- ----- ------- ------- ----- ---- ---- ------ 595533 411 401 paq8l -6 CM 2007 Matt Mahoney 598983 462 463 paq8px_v67 -6 CM 2009 Jan Ondrus 605656 182 197 paq8f -6 CM 2005 Matt Mahoney 610871 549 528 paqar 4.1 - 6 CM 2006 Alexander Ratushnyak 643746 59 60 PC8 -6 CM 2010 Jan Ondrus 644190 19.9 19.9 zpaq 1.10 CM 2009 ocmax.cfg Matt Mahoney 647110 140 138 paq6 -8 CM 2003 Matt Mahoney 647,440 7.6 7.3 nanozip 0.07a-cc CM 2009 Sami Runsas 666216 9.1 9.4 durilca 0,5 ppm 2006 Dmitri Shkarin 669,497 8.5 8.1 ppmonstr J PPM-12 mai 2006, Dmitri Shkarin 674145 22.6 22.8 PPM-32 subtire 23d 2004, Serge Voskoboynikov 676904 12.4 12.3 paq9a -6 LZP + CM 2007 Matt Mahoney 681419 10.9 10.2 3.2 xwrt-L14 dict + CM 2007 Przemyslaw Skibinski 682,512 8.7 9.3 lpaq1 6 CM 2007 Matt Mahoney 682581 51.2 51.6 pimple2 CM 2006 Ilia Muraviev 686,139 9.5 8.8 lpaq9m 8 CM 2009 Alexander Ratushnyak 694,050 9.0 8.6 EPM R2 PPM 2003, Serge Osnach 696850 43.9 44.5 frasin 04A CM 2003 Eugene D. Shelwien 699,191 7.6 7.4 zpaq 1.10 CM 2009 ocmid.cfg Matt Mahoney 705,962 6.9 7.0 0.7-biti p = 5 CM 2008 Osman Turan 706,911 5.0 5.2 cmm4 CM 2008 Christopher Mattern 716731??18.7 18.7 paq1 CM 2001 Matt Mahoney 716734 103 18.2 enc CM 2003 Serge Osnach 720980 8.7 8.5 tc 5.2dev2 CM 2007 Ilia Muraviev 725,423 5.0 4.3 ccmx 7 CM 2008 crestine Martelock 731926 10.2 10.0 albine 0.79-m3-D8 PPM 2005 Andrew Filinsky, Melchiorre Caruso 733,864 5.0 4.1 uharc 0.6b-mx-md32768 PPM 2005 Uwe Herklotz 738016 3.8 3.6 cmc 7 CM 2008 crestine Martelock 741419 64.4 65.5 ocamyd 1.66 test1-M0-S8-f DMC 2008 Frank Schwellinger 743,070 1.7 1.4 ctxf-mi PPM 2003 Nikita Lesnikov 743,351 1.4 0.53 nanozip 0.07a BWT 2009 Sami Runsas 745162 662 448 bwmonstr 0.02 BWT 2009 Runsas Sami 754,037 1.8 1.8 ppmvc-O16-m256 ROLZ + PPM 2006 Przemyslaw Skibinski 754,243 1.4 1.4 ppmd J-O16-m256 PPM 2006 Dmitri Shkarin 769,400 8.5 0.62 2.00 mcomp_x32-mf3x ROLZ 2008 Malcolm Taylor 773141 0.93 0.98 ppms J-O6 PPM 2006 Dmitri Shkarin 774,397 1.1 0.62 BSC 1.0.3 BWT 2010 Ilya Grebnov 775,647 3.6 3.7 ppmx 0,5 ppm 2010 Ilia Muraviev 776,262 79.4 0.87 cuarci 0.95r beta-m1-D25-L8 LZ77 2006 Frederic Bautista 776,273 9.6 9.7 boa 0.58b PPM 1998 Ian Sutton 778760 18.7 18.7 ACB 2.00a LZ77 1997 George Buyanovsky 780549 0.93 0.71 BSC 1.0.3-2010 m2 ST5 Ilya Grebnov 783,884 1.4 1.1 miliarde de metri cubi 0.10 x86-BWT 2009 Ilia Muraviev 784,981 3.9 4.0 1.0 px CM 2006 Ilia Muraviev 785,427 14.1 6.9 m03exp 32MB BWT 2005 Michael Maniscalco, mij4x 785,541 2.3 1.8 M03 0.2 alfa 4000000 BWT 2009 Michael Maniscalco 787,007 1.6 0.72 SBC 0.970r2-m3-B63 BWT 2002 Sami Runsas 787653 0.73 0.73 0.51 intuneric BWT 2007 Malyshev Dmitri Alexandrovici 791,446 2.3 2.0 1.0 uhbc-m3-cf BWT 2003 Uwe Herklotz 794255 0.84 0.70 GRZipII 0.2.4 LZP + BWT 2004 Ilya Grebnov 798,705 10.3 3.8 BBB cf BWT 2006 Matt Mahoney 800402 0.96 0.87 2.00 mcomp_x32-MW BWT 2008 Malcolm Taylor 804992 0.57 0.67 ppmd J PPM-4 mai 2006, Dmitri Shkarin 806,208 1.1 0.57 0.95 bssc a-b16383 BWT 2005 Sergeo Sizikov 812,123 3.1 2.8 lzpxj 9 LZP + PPM 2007 Ilia Muraviev, Jan Ondrus 816,982 3.1 3.3 mcomp_x32 2.00 mc DMC-2008 Malcolm Taylor 820,307 3.6 0.42 3.2 xwrt-L6 dict + LZMA 2007 Przemyslaw Skibinski 824,573 1.5 0.16 9.08 7zip o LZMA 2009 Igor Pavlov 831828 0.36 0.50 1.6 10 inele BWT 2009 Nania Francesco Antonio 832,035 6.4 6.6 p12 NN 2000 Matt Mahoney 842,265 5.9 5.9 P6 NN 2000 Matt Mahoney 842696 0.95 0.70 szip-B41 ST6 1998 Michael Schindler 851,043 1.9 1.8 1.4 512 carlig LZP + DMC 2009 Nania Francesco Antonio 858,791 8.7 0.23 0.15 9 lzpm ROLZ 2008 Ilia Muraviev 860097 0.60 0.39 1.0.3 bzip2 -9 BWT 2005 Julian Seward 860,984 4.6 0.53 flashzip 0.99b8-m3-C7-b8 ROLZ 2010 Nania Francesco Antonio 863,350 3.0 0.06 1.00 cabarc 0.0601-m lzx: 21 LZX 1997 Microsoft Corp 864,499 17.0 4.2 qc -8 LZP? 2006 Denis Kyznetsov 870,948 1.9 0.37 1.15 balz LZ77 2008 Ilia Muraviev 872,028 5.2 0.31 0.9 lzturbo-m59 LZ77 2007 Hamid Bouzidi 873,681 1.4 0.44 1.12 quad-x ROLZ 2007 Ilia Muraviev 876,234 8.8 7.3 0.98 ha a2 PPM-4 1993 Harri Hirvola 885,273 1.3 1.3 tarsalzp LZP 2007 Piotr Tarsa 885,954 6.9 2.2 qazar-D9-X7-L7 LZP 2006 Denis Kyznetsov 900,048 2.6 2.9 dmc 100000000 DMC 1987 Gordon V. Cormack 905,889 1.2 0.17 csc31-m3-D7 LZ77 2009 Fu Siyuan 917,136 8.3 0.16 pachete de 0.91-M6-S9 LZ77 2009 Nania Francesco Antonio 935598 43.5 43.6 fpaq2 PPM + CM 2006 Nania Francesco Antonio 948568 0.67 0.75 1.60 winturtle 512 PPM 2008 Nania Francesco Antonio 975,889 35.8 0.09 kzip LZ77 2006 Ken Silverman 979349 0.37 0.40 SR2 SR 2007 Matt Mahoney 996,074 1.2 0.15 2.50 rar LZ77 1999 Eugene Roshal 997381 33.5 33.9 ctw CTW 2002 Erik Franken, Marcel Peeters 1001364 0.31 0.12 Thor e5 LZ77 2007 Oscar Garcia 1004728 0.33 0.36 BZP LZP 2008 Nania Francesco Antonio 1011786 0.47 0.03 tornada LZ77 2008 Bulat Ziganshin 1016779 3.0 0.90 0.98 ha LZ77 1993 Harri Hirvola 1020896 5.6 1,0 cm-m25659172-t3152896-O4 PPM-4 1997 Bulat Ziganshin 1021485 1.0 0.32 xwrt-L3 dict + LZ77 2007 Przemyslaw Skibinski 1021821 0.64 0.17 PKZIP 2.04e-ex LZ77 1993 Phil Katz, PKWARE Inc 1022414 0.09 0.11 slug 1.27b ROLZ 2007 Christian Martelock 1022810 0.68 0.11 gzip -9 1.3.5 LZ77 2002 Jean-Loup Gailly, Mark Adler 1023037 0.68 0.10 2.32 zip -9 LZ77 2005 Jean-Loup Gailly, Mark Adler 1029644 0.34 0.11 2.32 zip LZ77 2005 Jean-Loup Gailly, Mark Adler 1045532 0.31 0.14 Thor e4 LZP 2007 Oscar Garcia 1053371 4.2 4.2 fpaq3 O3 2006 Nania Francesco Antonio 1093944 0.12 0.09 etincelle ROLZ 2010 Yann Collet 1093958 3.6 5.1 lzgt3a LZ77 2008 Gerald R. Tamayo 1112774 1.4 0.17 lzuf LZ77 2009 Gerald R. Tamayo 1,131,100 23.1 24.7 lcssr 0.2 SR 2007 Frank Schwellinger 1,137,453 30.9 0.03 0.01 LZSS franco LZSS 2008 Ilia Muraviev 1140859 2.8 0.46 ulz c6 LZ77 2010 Ilia Muraviev 1155175 8.4 9.1 arbc2z PPM-2 mai 2006, David A. Scott 1156301 0.50 0.31 0.08 LZC LZ77 2007 Nania Francesco Antonio 1162904 1.2 0.04 lzop 1.01 LZ77 2003 Markus FXJ Oberhumer 1237612 0.08 0.08 zhuff LZ77 2009 Yann Collet 1243434 6.7 0.09 LZW LZW 2008 Ilia Muraviev 1267256 0.20 0.20 srank-C8-O3 SR 1997 Peter Fenwick 1319521 1.6 0.22 comprima LZW 1985 Spencer Thomas et. Al. 1331783 1,0 1,0 fastari O2 2006 Nania Francesco Antonio 1337403 0.14 0.03 1.30 quicklz LZ77 2007 Lasse Mikkel Reinhold 1340008 0.16 0.09 brieflz LZ77 2004 Joergen Ibsen 1401857 0.12 0.06 lzrw3-o LZ77 1991 Ross Williams 1,430,672 1.0 0.05 LZSS e LZSS 2008 Ilia Muraviev 1537191 1.8 0.02 BPE 5000 4096 200 3 BPE 1994 Filip Gage 1545939 0.08 0.06 lzrw3 LZ77 1991 Ross Williams 1601207 0.65 0.64 fpaq0f2 o0 (IND) 2008 Matt Mahoney 1615386 0.06 0.04 lzrw2 LZ77 1991 Ross Williams 1669277 0.08 0.04 vioi 1.0.1 LZ77 2011 Google 1670642 0.50 0.06 lzrw5 LZW 1991 Ross Williams 1685882 0.36 0.40 fpaq0p o0 (adapta) 2007 Ilia Muraviev 1691924 0.21 0.14 flzp LZP 2008 Matt Mahoney 1725464 0.06 0.06 lzrw1-o LZ77 1991 Ross Williams 1737702 0.04 0.03 lzrw1 LZ77 1991 Ross Williams 1884160 0.03 0.03 lznt1/NTFS LZSS 1995 Microsoft Corp 1903024 0.70 0.76 fpaq0 o0 (Stat). 2004 Matt Mahoney 1938635 0.06 0.03 lzp2 LZP 2009 Yann Collet 3152896 necomprimate
Algoritmi (atunci cand se cunoaste) sunt dupa cum urmeaza. Ele sunt descrise in detaliu mai mult in tot restul acestei carti. "CM" inseamna "context de amestec". "Dict" inseamna codificare dictionar. "PPM-N" inseamna, pentru N-PPM. "ON" inseamna, pentru N-modelare context (stationare, adaptive, sau indirecte). "STN" inseamna o sortare ordine-n (Schindler) transforma. "NN" inseamna retele neuronale. "SR" inseamna simbolul clasament.
Benchmark-mare de compresie Text Unicode consta dintr-un singur fisier codat XML care contine o bena de text din Wikipedia 03 martie 2006, trunchiate la 10 9 octeti (fisier enwik9). Scopul sau declarat este acela de a incuraja cercetarea in inteligenta artificiala, in special, procesarea limbajului natural. Ca de februarie 2010, 128 programe diferite (889, inclusiv diferite versiuni si optiuni) au fost evaluate pentru marimea comprimat (inclusiv decompresia sursa programului sau executabil, precum si orice alte fisiere necesare pentru ca o arhiva zip), viteza, si utilizarea memoriei. Referinta este deschis, ceea ce inseamna ca oricine poate depune rezultatele.
Programele sunt clasificate in functie de marime comprimat cu optiuni selectarea de compresie maxima, daca este cazul. Cel mai bun rezultat obtinut este 127,784,888 bytes de Dmitry Shkarin pentru o versiune personalizata a durilca folosind 13 GB de memorie. A fost nevoie de 1398 secunde pentru a comprima si de 1,797 secunde pentru a decomprima utilizati o dimensiune a-optimizat programul de decompresie la 3,8 GHz Quad Core Q9650 cu memorie de 16 GB in 64 de biti Windows XP Pro pe 21 iulie 2009. Datele au fost preprocessed cu un dictionar particularizat construit de referinta si codificate cu scopul de 40 ppm. durilca este o versiune modificata a ppmonstr de acelasi autor. ppmonstr este la randul sau un program de comprimare mai lenta, dar mai bine ppmd care este folosit pentru compresia maxima in arhivare mai multe, cum ar fi rar, WinZip, 7zip, si freearc.
Prin comparatie, zip -9 comprese la 322,649,703 bytes in 104 secunde si decomprima in 35 de secunde folosind 0.1 MB de memorie. Este clasat pe locul 92'nd.
De referinta arata o cale 3 compromis intre dimensiunea comprimat, viteza, si utilizarea memoriei. Cele doua grafice de mai jos arata cele frontiera Pareto, compresoare cele pentru care nu comprima alte compresoare atat mai mici si mai rapid (sau mai mici si de a folosi mai putina memorie). Graficele sunt din august 2008, dar datele actuale arata o tendinta similara. In special, nici un algoritm unic (indicat in paranteza) este "cel mai bun".

Pareto frontiera: comprimat dimensiunea vs timp de compresie de la 18 august 2008 (pentru optiunile de compresie maxima).

Pareto frontiera: comprimat dimensiunea vs memorie de 18 august 2008 (pentru optiunile de compresie maxima).
Retineti ca teste de viteza pot fi rulate pe masini diferite, si ca numai optiunile pentru compresie maxime pentru fiecare program sunt utilizate. Cu toate acestea, tendinta generala ramane valabila. Compresoare individuale au adesea optiuni care permit utilizatorului sa faca acelasi fel in afara comertului 3.
Motivul pentru care nivelul de referinta este atat de dependent de memorie este ca exista meciuri string care acopera intreaga lungime a fisierului 1 GB. Acest lucru poate fi vazut in analiza de mai jos. Zona albastra la partea din dreapta sus a imaginii reprezinta meciurile intre inceputul si sfarsitul fisierului. Trupa intuneric in mijlocul imaginii se datoreaza mai multor mii de articole Wikipedia pentru orase din SUA, care au fost aparent generate de datele recensamantului de un program. Aceasta regiune este foarte compresibil.

Repeta analiza String enwik9 cu FV.
Premiul Hutter se bazeaza pe primele 10 8 octeti (fisier enwik8) de referinta mari de compresie Text cu norme similare si goluri. Este un concurs in care premiul in bani (500 de euro pentru 1% din castig) se acorda pentru imbunatatiri de 3% sau mai mult de depunere anterior, sub rezerva unor limite de timp si de memorie de pe computerul de test. Cel mai bun rezultat este de 15,949,688 bytes pentru o arhiva si un decompresser inscrise de Alexandru Ratushnyak la 23 mai 2009. Este nevoie de 7608 secunde si 936 MB de memorie pentru a decomprima pe un miez de 2 GHz dublu T3200 sub 32 de biti Windows Vista. Depunerea se bazeaza pe doua open source, programe de context si de amestecare paq8hp12 lpaq9m cu un dictionar particularizat pentru preprocesare.
Prin comparatie, zip -9 comprese aceleasi date de 36,445,373 bytes si decompreseaza in 3,5 secunde folosind 0.1 MB de memorie.
De referinta de compresie maxim are doua parti: un set de 10 imagini publice in valoare totala de 53 MB, si o colectie particulara de 510 fisiere in valoare totala de 301 MB. In set de date publice (de compresie a fisierelor SFC sau singur), fiecare fisier este comprimat separat si dimensiunile adaugat. Programele sunt clasificate in functie de marime numai, cu optiuni setate pentru cele mai bune de compresie, individual pentru fiecare fisier. Setul este format din urmatoarele 10 imagini:
842468 a10.jpg - un nivel ridicat de calitate 1152 x 864 JPEG imagine de baza a unui avion de lupta.
3870784 acrord32.exe - codul executabil x86 - Acrobat Reader 5.0.
4067439 english.dic - o lista ordonata alfabetic de 354,941 cuvinte in limba engleza.
4526946 FlashMX.pdf - fisier PDF cu incorporate si imagini JPEG BMP fermoar.
20617071 fp.log - server de web log, de text ASCII.
Mso97.dll 3782416 - codul executabil x86 de la Microsoft Office.
4168192 ohs.doc - document Word incorporat cu imagini JPEG.
4149414 rafale.bmp - 1356 imagini color 1020 x 16 bit la 24 de biti format RGB.
4121418 vcfiu.hlp - OCX fisier de ajutor - de date binare cu text incorporat.
World95.txt 2988578 - text ASCII - 1995 CIA World Factbook.
Programul clasat pe primul loc ca din 31 decembrie 2009, cu o suprafata totala de 8,813,124 bytes este paq8px, un algoritm context amestec cu modele specializate pentru imagini JPEG, BMP imagini, codul x86, text, si date structurate binare. WinRK 3.1.2, un alt algoritm de amestecare context, este clasat pe primul loc pe 4 fisiere (txt, exe, dll, pdf). WinRK utilizeaza un dictionar, care nu este inclusa in volumul total. 208 de programe sunt clasificate. zip 2.2 este clasat pe locul 163, cu o dimensiune de 14948761.
In valoarea de referinta a doua sau a MFC (de compresie a fisierelor multiple), programele sunt clasificate in functie de marimea, viteza de compresie, viteza de decompresie, si printr-o formula care combina dimensiunea si viteza cu timp redus logaritmic. Datele nu este disponibil pentru descarcare. Fisierele sunt comprimate impreuna intr-o arhiva unica. Daca un compresor nu poate crea arhive, atunci fisierele sunt colectate intr-o arhiva necompresat (TAR sau QFC), care este comprimat.
La testul de MFC, paq8px este clasat pe primul loc de marime. freearc este clasat pe primul loc de scor combinat, urmata de nanozip, WinRAR si 7zip. Toate sunt arhivare care detecteaza tip de fisier si de a aplica algoritmi diferite in functie de tipul.
Benchmark-Compresie generice are ca scop de a evalua algoritmilor de compresie, in contextul de predictie universal sau de informatii, astfel cum sunt definite de boala Legg si Hutter (2006). Prin aceasta definitie, surse de date se presupune ca au o distributie Solomonoff universal, si anume generate de programe aleatorii, cu o preferinta pentru programele mai mici sau mai simple. Dovezi pentru o astfel de distributie este de succesul punerii in aplicare a Occam Razor la masina de invatare si de a stiintei in general: cele mai simple teorii care se potrivesc cu datele observate tind sa fie cei mai buni predictori. Scopul testului este de a gasi algoritmi de compresie buna, care nu sunt reglate la tipuri de fisiere specifice.
De referinta nu publica toate datele de testare. Mai degraba, aceasta publica un program pentru a genera datele dintr-o samanta secret sau un numar intern hardware generate aleator. Datele consta in iesirile sir de biti de un milion de masini Turing aleatoare, trunchiat la 256 de biti si ambalate in siruri de caractere null byte incheiata. Marimea medie de iesire este de aproximativ 6,5 MB. Testul permite verificarea publice in acelasi timp eliminand necesitatea de a masura dimensiunea decompresser, deoarece nu este posibil pentru a ascunde datele de incercare, in decompresser fara sa stie ca semintele criptografic de numere aleatorii. Testul produce rezultate repetabile, cu aproximativ 0,05% precizie. Programele sunt ordonate in functie de raportul de iesire comprimat la iesire comprimat de un compresor de referinta (ppmonstr) pentru a imbunatati repetabilitate.
Din pacate, nivelul de referinta nu a elimina complet problema de compresoare de tuning pentru repere publice. Programul clasat pe primul loc, este o configuratie stationara context model de amestecare puse in aplicare in zpaq cu ajutorul unui preprocesor de J. Ondrus care imparte fiecare sir intr-un prefix incompresibil si o instructiune repeta pic. Scorul sau este de 0.8750, comparativ cu 1.3124 pentru zip -9. In general, cu toate acestea, ordinea rangul de compresoare este similara cu cea a altor valori de referinta.
Calificative de compresie de Sami Runsas locul programelor de 5.4 GB de diverse tipuri de date din surse publice utilizand un scor care combina dimensiunea si viteza. Datele includ textul in limba engleza, cod executabil Windows, RGB si tonuri de gri, CD-uri audio de calitate, si un amestec de tipuri de date de la jocurile video doua. Nici una dintre datele de incercare este comprimat (JPEG, PDF, ZIP, etc). Fiecare din cele 10 seturi de date contine mai multe fisiere. Compresoare singur fisier sunt testate pe un echivalent de gudron fisier. Programe trebuie sa treaca o runda de calificare, cu rata de compresie cerintele minime de timp, pe un mic set de date. De referinta include un calculator care permite utilizatorului sa rang compresoare pe baza unor ponderi diferite pentru importanta de dimensiune, viteza viteza de compresie, si decompresie. Scara implicita este "raportul 1 / 10 sau de doua ori mai mica timpul total este in valoare de jumatate de rating". Aceasta este efectiv aceeasi formula de utilizat de catre maximumcompression.com. Dimensiuni comprimat includ dimensiunea programului (nu comprimat) Windows executabil decompresie. Ca de februarie 2010, 427 combinatii de calificare versiuni de program si optiuni au fost testate. Programele de top clasate pentru setarile implicite au fost urmate de nanozip freearc, CCM, flashzip, si 7-Zip. Runsas este, de asemenea, autorul nanozip.
Unele valori de referinta sunt mentionate alte scurt.
Strangeti Grafic de Stephen Busch, pe locul programelor de 6.4 GB de date cea mai mare parte privat, de diferite tipuri de marimea numai. In partea de sus este clasat ca paq8px_v67 de 28 decembrie 2009.
Monstru de compresie de Nania Francesco Antonio, pe locul programe de dimensiune pe 1061420156 bytes cea mai mare parte a datelor publice de diferite tipuri, cu o limita de timp 40 minute. Exista teste separate pentru compresoare singur fisier si arhivare. Ca de 20 decembrie 2009 arhivatorul clasat pe primul loc este nanozip 0.7a si compresorul de top fisierul este clasat pe locul ccmx 1.30c. Ambele context utilizarea de amestecare.
UCLC de Johan de Bock contine mai multe repere de date publice, pentru compresoare, cu o interfata linie de comanda (care este cel mai dintre ele). Ca din februarie 2009, paq8i sau paq8p a fost clasat pe primul loc de marime pe cele mai multe dintre ele.
Metacompressor este un reper automatizat care permite dezvoltatorilor sa prezinte programe si fisiere de testare si compara dimensiunea (excl. decompresser), timp de compresie, decompresie si timp.
Xtreme de compresie de compresie compara 60 / optiune combinatii cu produsul lor proprii pentru dimensiunea si viteza pe un fisier de 80 GB de baze de date sintetice. Excluzand produsele lor, care functioneaza doar pe tabele de baze de date, zp C2 (zpaq mid.cfg) este clasat pe primul loc, dar, de asemenea, de dimensiunea cea mai mica ca a octombrie 2010.
Meyer si Bolosky (2011), intr-un studiu de deduplicare practice, a examinat continutul de 857 sisteme de fisiere pentru Windows in randul angajatilor Microsoft dintr-o gama larga de departamente in 2009. Tabelul de mai jos prezinta distributia aproximative de tip, comparativ cu studii similare de ei in 2000 si 2004.
2000 2004 2009 tip de date
---- ---- ---- ---------
2% 3% 13% nr prelungire filename
13% 10% 10%. Dll (biblioteca x86 dinamic)
3% 4% 8%. Lib (x86 biblioteca statica)
5% 7%. VHD (hard drive virtual)
7% 5% 7% PPB (baza de date).
Exe 6% 4% 4%. (Executabil x86)
4% 3%. PCH (precompilate C / C + + fisierul header)
3% 4% 2% cabina (comprimat arhiva).
4% 1%. Wma (comprimat audio)
1%. Iso (CD / DVD imagine)
5% 4%. Pst (cererea de baze de date de e-mail)
4% 3%. Mp3 (comprimat audio)
3%. CHM (comprimat HTML fisierul de ajutor)
49% 55% 43% Altele
Sistemul de fisiere media in 2009 a avut o capacitate de 150 GB si a fost de 40% plin. Fractiunea din uz a ramas aproape constant din 2000, in ciuda o dublare a capacitatii de disc aproape in fiecare an. Dimensiunea fisierului medie, de asemenea, a ramas aproape constant la aproximativ 4 KO din 1981. Cu toate acestea, numarul de dosare a crescut, la o medie de 225,000 fisiere in directoare 36000 in 2009. Dimensiunea fisierului variaza foarte mult. Aproximativ jumatate din toate datele se afla in fisiere mai mari de 25 MB. Al patrulea este in fisiere mai mici de 3 MB si un sfert in fisiere mai mari de 1 GB. Aceste procentuale sunt de asemenea in crestere ca numar de fisiere mai mari cresc. Repriza a tuturor datelor a fost in fisiere mai mari de 1 MB, in 2000 si 5 MB in 2004.
Un sistem de fisiere media in 2009 ar putea fi comprimat cu 22% pana in deduplicarea intregul dosar, adica redus la 78% din dimensiunea originala prin inlocuirea copii identice ale fisierelor cu link-uri. De compresie ar putea fi crescut la 28% prin impartirea fisierelor in 64 blocuri KO si inlaturarea blocuri duplicat, sau la 31% folosind 8 blocuri KB. Daca restrictiile aliniere sunt indepartate, apoi creste de compresie la 39% prin inlocuirea segmente duplicat mai mare de 64 KO, sau 42% folosind 8 segmente KB. In cazul in care link-uri li se permite sa indicati spre duplicate pe alte sisteme de fisiere, apoi de compresie imbunatateste cu aproximativ 3 puncte procentuale pentru fiecare dublare a numarului de calculatoare. Pentru sistemele de intregul dosar 857, deduplicare intregul dosar comprimat cu 50%, si 8K segmente de 69%.
Cererea evidenta a deduplicarea este backup-uri incrementale. Este necesar doar de rezerva a datelor care au fost modificate. In studiu din 2009, valoarea medie a timpului de la ultima modificare a fost de 100 de zile. 7% din fisiere de dosare au fost modificate in termen de o saptamana, si 75% in termen de un an.
Un cod este o atribuire de siruri de biti la simboluri, astfel incat siruri de caractere poate fi decodat fara echivoc pentru a recupera datele initiale. Codul optima pentru un simbol cu probabilitatea p va avea o lungime de log 2 biti 1 / P. Mai multe algoritmi de codificare eficient sunt cunoscute.
Huffman (1952) a dezvoltat un algoritm care calculeaza o cesiune optima de peste un alfabet de simboluri n in O (n log n) timp. dezumfle (zip) si bzip2 utilizarea codurilor Huffman. Cu toate acestea, codurile Huffman sunt ineficiente in practica, deoarece lungimi de cod trebuie sa fie rotunjita la un numar intreg de biti. Daca o probabilitate simbol nu este o putere de 1 / 2, apoi codul de atribuire este mai mic decat optim. Aceasta ineficienta de codificare poate fi redus prin atribuirea de probabilitati pentru grupuri mai mare de simboluri, ci numai la costul de o crestere exponentiala in marime alfabet, si, astfel, in timp rula.
Algoritmul este dupa cum urmeaza. Ni se da un alfabet si o probabilitate pentru fiecare simbol. Am construi un arbore binar de catre incepand cu fiecare simbol din structura proprie si care uneste cele doua arbori, care au cea mai mica probabilitatile doua pana cand vom dispune de un copac. Apoi, numarul de biti in fiecare cod Huffman este adancimea de simbol care, in copac, iar codul sau este o descriere de calea de la radacina (0 = stanga, 1 = dreapta). De exemplu, sa presupunem ca ne este data alfabetul {0,1,2,3,4,5,6,7,8,9} cu fiecare simbol avand in probabilitate de 0,1. Vom incepe cu fiecare simbol intr-un copac de un nod:
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
Deoarece fiecare arbore mic are aceeasi probabilitate, putem alege oricare doua si sa le combine:
0.2
/ \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
, Continuarea
.2.2.2.2
/ \ / \ / \ / \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
In acest moment, 8 si 9 au cele mai mici probabilitati doua asa am sa aleaga acele:
.2.2.2.2.2
/ \ / \ / \ / \ / \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
Acum, toate de copaci au probabilitatea de 0.2 asa ca am alege orice pereche de ele:
0.4
/ \
/ \
/ \
.2.2.2.2.2
/ \ / \ / \ / \ / \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
Am ales pentru oricare doua dintre cele trei copaci ramase cu probabilitatea 0.2:
.4.4
/ \ / \
/ \ / \
/ \ / \
.2.2.2.2.2
/ \ / \ / \ / \ / \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
Acum doi sunt mai mici probabilitatile 0.2 si una dintre 0.4:
0.6
/ \
/ \
/ \
0.4 0.4 \
/ \ / \ \
/ \ / \ \
/ \ / \ \
.2.2.2.2.2
/ \ / \ / \ / \ / \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
Acum, cele mai mici doua sunt 0.4 si 0.6. Dupa acest pas, copac este terminat. Putem eticheta de la 0 ramurile de stanga si 1 pentru dreapta, cu toate ca alegerea este arbitrara.
1.0
/ \
/ \
/ \
/ \
0 / \ 1
/ \
/ \
/ 0.6
/ / \
/ 0 / \ 1
/ / \
0.4 0.4 \
/ \ / \ \
0 / \ 1 0 / \ 1 \
/ \ / \ \
.2.2.2.2.2
/ \ / \ / \ / \ / \
.1.1.1.1.1.1.1.1.1.1
0 1 2 3 4 5 6 7 8 9
Din acest copac vom construi codul:
Simbolul codului
------ ----
0 000
1 001
2 010
3 011
4 1000
5 1001
6 1010
7 1011
8 110
9 111
Un cod poate fi static sau dinamic. Un cod de statica este calculat de compresor si transmise decompresser ca parte a datelor comprimate. Un cod de dinamic este calculat de compresor si actualizate periodic, dar nu transmit. In schimb, decompresser reconstituie folosind codul exact acelasi algoritm utilizand datele pentru a estima anterior decodat probabilitatilor. Nici metoda comprima mai bine, deoarece orice spatiu salvate prin faptul ca nu transmite modelul este varsat de catre avand mai putine date cu care sa se estima probabilitati.
Codurile Huffman sunt de obicei statice, in principal pentru viteza. Compresor trebuie doar pentru a calcula codul de data, utilizand metoda intreaga pentru a calcula probabilitatile. Pentru a transmite un tabel de Huffman, este necesar doar pentru a trimite dimensiunea fiecarui simbol, de exemplu: (3,3,3,3,4,4,4,4,3,3). Atat compresor si decompresser-ar aloca apoi codurile de catre incepand cu cel mai scurt simboluri, numarand de la 0, si adaugarea unui bit 0 ori de cate ori devine codul de mai mult timp. Acest lucru ar duce la codul de mai jos diferite, dar la fel de eficiente:
Simbolul Dimensiune Cod
---- ---- ------
0 3 000
1 3 001
2 3 010
3 3 011
8 3 100
9 3 101
4 4 1100
5 4 1101
6 4 1110
7 4 1111
Pentru compresie a fisierelor, date codificate Huffman inca trebuie sa fie ambalate in octeti. JPEG pachete de biti in MSB (bitul cel mai semnificativ) la LSB (bitul cel mai putin semnificativ) ordine. De exemplu, codurile 00001 00111 ar fi ambalate ca 00001001 11....... Dezumfle formatul utilizat in zip, gzip, si PNG fisiere de biti, in cutii LSB MSB la comanda, ca in cazul in care fiecare octet este scris inapoi, si anume 10010000...... 11.
Un alt complicatie este faptul ca ultimul octet trebuie sa fie capitonate in asa fel ca nu este interpretat ca un cod Huffman. JPEG face acest lucru prin faptul ca nu orice simbol atribuirea unui cod de toate 1 biti, iar apoi padding ultimul octet cu 1 biti. Dezumfle se ocupa de acest lucru prin rezervarea unui cod care sa indice sfarsitul termenului de date. Acest lucru spune decodor nu pentru a decoda biti ramase din ultimul octet.
Huffman de codificare a drawback ca lungimile cod trebuie sa fie un numar intreg de biti. Aceasta constrange in mod eficient model probabilitatile care sunt competentele de 1 / 2. Pedeapsa cu dimensiune pentru erori de modelare este aproximativ proportionala cu patratul de eroare. De exemplu, un rezultat de 10% de eroare intr-o dimensiune penalizare de 1%. Pedeapsa poate fi de mare pentru codurile de mici. De exemplu, singura modalitate posibila de a Huffman un cod binar alfabet este codul fiecare bit ca ea insasi (sau opusul sau), care rezulta in niciun compresie.
Aritmetica de codificare (Rissanen, 1976), de asemenea, numit codare gama, nu sufera de aceasta dificultate. Fie P un model, ceea ce inseamna ca pentru orice x sir, P (x) este probabilitatea de acel sir de caractere. Fie P (<x) este suma probabilitatilor de toate sirurile lexicografic mai mic de x. Fie P (? x) = P (<x) + P (x). Apoi, codul de aritmetica pentru un sir x este cel mai scurt numarul binar y, astfel incat P (<x)? y <P (? x). Un astfel de numar poate fi intotdeauna a constatat ca nu este mai mult de 1 pic mai mult decat limita Shannon log 2 1 / P (x).
Un cod de aritmetice pot fi calculate eficient prin exprimarea P (x), ca un produs de predictii simbol succesive de regula lant, P (x) =? i P (x i | x 1 x 2... x i-1) in cazul in care x I inseamna simbolul i'th (biti sau caracter) in x. Apoi codul de aritmetica poate fi calculata prin actualizarea unei game [scazuta, mare) (initial [0, 1)) pentru fiecare simbol prin impartirea in intervalul proportional cu distributia de probabilitate pentru acel simbol. Apoi portiunea din gama corespunzatoare simbolului care urmeaza sa fie codificate este folosit pentru a actualiza gama. In ceea ce micsoreaza gama, bitii de conducere de meci de mica si mare si poate fi de iesire imediat deoarece y codul trebuie sa fie intre ele. Decompresser este capabil sa decodeze y facand un sir identic de predictii si reduceri gama.
Compresoare cele mai moderne utilizarea datelor de codificare aritmetica. Compresoare timpurie utilizate de codare Huffman de codificare pentru ca aritmetica a fost brevetat si pentru ca punerea sa in aplicare necesare operatiunilor de multiplicare, care a fost lent pe procesoare mai mari. Nici unul dintre aceste aspecte sunt relevante, deoarece astazi brevete importante au expirat si procesoare mai noi au multiplica rapid instructiuni (mai repede decat de acces de memorie).
Cele mai frecvente aritmetica Codul de programatori un octet la un moment dat (PPM) sau un bit la un moment dat (CM, DMC). Cod sursa gratuit ce fara restrictii de acordare a licentelor pentru un encoder bytewise pot fi gasite in codul sursa pentru ppmd (D. Shkarin). Programatori la nivel de bit sub licenta GPL, pot fi gasite in codul sursa pentru PAQ compresoare bazate inclusiv FPAQ, LPAQ, si ZPAQ, BWT BBB compresor, clasament si simbolul SR2 compresor. Cea mai simpla dintre acestea este de ordinul 0 programator fpaq0.
Urmatoarele este programator de la aritmetica zpaq 1.10 Se codifica biti y (0 sau 1) cu probabilitatea p = P (1) * 65536 (un numar de 16 biti la scara) si codurile care FILE * afara. Gama encoder este reprezentata de doua numere intregi pe 32 de biti (unsigned int) scazute si inalte, care sunt initial 1 si, respectiv, 0xFFFFFFFF. Dupa ce gama este divizat, un 1 este codat in partea inferioara a intervalului si un 0 in partea de sus. Dupa intervalul este impartita si a redus, este normalizat prin transferarea orice octeti de conducere care sunt identice intre mica si mare. Biti scazut mutat in sunt toate 0 biti pentru low si toate 1 biti pentru mare.
/ / Initial de stat
unsigned int scazut = 1, mare = 0xFFFFFFFF;
/ / Codare y biti cu probabilitate p/65536
inline void Encoder:: codifica (int y, int p) {
assert (out); / / iesire fisier
assert (p> = 0 & & p <65536);
assert (y == 0 | | y == 1);
assert (mare> mic & & low> 0);
unsigned int = mijlocul low + ((high-low)>> 16) * p +(((( high-low) si 0xffff) * p)>> 16); / / divizat interval
assert (mare> jumatatea & & mijlocul> = scazut);
daca (y) = mare la mijlocul; altceva low = mijlocul +1; / / pick jumatate
in timp ce ((mare ^ low) <0x1000000) {/ / scrie identice octeti de conducere
putc (mare>> 24, out); / / acelasi nivel cat mai scazut>> 24
mare = mare <<8 | 255;
mica = low <<8;
low + = (low == 0); / / deci nu codul de 4 0 bytes dintr-un rand
}
}
Gama este impartita in scris, pentru a evita 32 overflow aritmetica biti. Acesta este echivalent cu:
unsigned int = mijlocul low + ((unsigned long lungime) (high-low) * p>> 16);
in cazul in care (unsigned long lungime) este un tip pe 64 biti fara semn, care nu tot sprijinul compilatoare. Initializare a scazut la 1 in loc de 0 si declaratia
low + = (low == 0);
aruncati un pic din gama, pentru a evita scrierea 4 consecutive 0 bytes, care compresorul ZPAQ foloseste pentru a marca sfarsitul datelor codificate astfel incat sa poata fi gasite rapid, fara decodare. Aceasta nu este o cerinta, in general. Acesta nu este folosit in restul din seria PAQ. Decodorul arata astfel:
/ / Initial de stat
unsigned int scazut = 1, mare = 0xFFFFFFFF, pac;
pentru (int i = 0; i <4; + + i)
pac pac = <<8 | getc (in); / / primele 4 octeti de intrare
/ / Intoarce-bit decodificate de la dosar "in" cu probabilitate p/65536
inline Decoder int:: decodeaza (int p) {
assert (p> = 0 & & p <65536);
assert (mare> mic & & low> 0);
daca (pac <low || curr> mare) de eroare ("arhiva corupt");
assert (pac> = low & & pac <= mare);
unsigned int = mijlocul low + ((high-low)>> 16) * p +(((( high-low) si 0xffff) * p)>> 16); / / divizat interval
assert (mare> jumatatea & & mijlocul> = scazut);
int y = pac <= mijlocul;
daca (y) = mare la mijlocul; altceva low = mijlocul +1; / / pick jumatate
in timp ce ((mare ^ low) <0x1000000) {/ / transfer out identice octeti de conducere
mare = mare <<8 | 255;
mica = low <<8;
low + = (low == 0);
int c = getc (in);
daca (c == EOF) eroare ("sfarsitul de fisier neasteptat");
pac pac = <<8 | c;
}
intoarcere y;
}
Decodorul primeste ca intrare p, probabilitatea 16 biti ca urmatorul bit este 1, si intoarce bit decodat. Decodorul are o variabila suplimentara in starea sa, pe 32 de biti intreg pac, care este initializat cu primii 4 octeti de date comprimate. De fiecare data cand gama este rescaled, un alt octet este citit de la * FILE inch de inalta si joasa sunt initializate ca in encoder.
Un detaliu suplimentar este cum sa se ocupe de la sfarsitul de fisier. Cea mai mare parte a compresoarelor PAQ seria codifica dimensiunea fisierului separat si de a efectua operatiuni de codificare pe 8 byte. Dupa operatia de codificare ultima, de 4 octeti fie mica sau mare sau o anumita valoare in intre trebuie sa fie curatate periodic pentru a arhiva, deoarece decodor vor citi aceste 4 octeti inch
ZPAQ codifica 9 biti pe octet, folosind un lider 1 bit de modelat cu probabilitatea p = 0, pentru a marca sfarsitul de fisier. Efectul asupra encoder este de a stabili la mijlocul = low cauza 4 octeti care urmeaza sa fie spalat. Dupa aceea, 4 octeti zero, sunt scrise pentru a marca sfarsitul de date comprimate. In cazul in care sfarsitul de biti fisier este decodat, decodor citeste aceste 4 zero octeti in pac, rezultand in mici = 1, pac = 0, mare = 0xFFFFFFFF. Orice continuare decodare ar duce la o eroare, deoarece conditia low? pac? ridicat nu reuseste.
Afirmatii nu sunt necesare pentru a face munca codul. Mai degraba, ele sunt utile, deoarece acestea fac presupuneri programatorului explicit, sub forma de teste auto documentare timpului de functionare. Codul este compilat cu-DNDEBUG pentru a le elimina dupa depanare.
Cele mai multe compresoare de inalta sfarsitul utilizarea de codificare aritmetica. Cu toate acestea, o alta posibilitate de codare cu aceeasi teoretice si eficienta de timp pentru siruri de biti este binar de codificare asimetric sau ABC (Duda, 2007). Un programator asimetric are un singur n-biti variabile integer de stat y, spre deosebire de doua variabile (mica si mare), intr-un programator aritmetica. Acest lucru permite o punere in aplicare tabel de cautare. O y bit (0 sau 1) cu probabilitatea p = P (y = 1) (0 <p <1, un multiplu de 2-n) este codat in stare x, 2 n initial:
daca y = 0, atunci x: = ceil ((x +1) / (1-p)) - 1
daca Y = 1, atunci x: = podea (x / p)
Pentru a decoda, dat x si p
y = ceil ((x +1) * p) - ceil (x * p) (0 in cazul in care fract (x * p) <1-p, altfel 1)
daca y = 0, atunci x: x = - ceil (x * p)
daca Y = 1, atunci x: = ceil (x * p)
x este mentinut in intervalul 2 n 2 n 1 - 1 prin scrierea de biti scazut x y inainte de codare si de lectura in biti scazute de x dupa decodificare. Deoarece de compresie si de decompresie sunt operatiuni reverse unul de altul, acestea trebuie sa fie efectuate in ordine inversa prin stocarea previziuni si biti codificate intr-o stiva in nici compresor sau decompresser.
Programator este pus in aplicare in ordinea-0 compresoare fpaqa, fpaqb, si fpaqc si contextul de amestecare compresor lpaq1a din seria PAQ. fpaqa utilizeaza tabele de cautare pentru atat de compresie si decompresie. Acesta foloseste un pic de stat 10 si probabilitatea este cuantificat la 7 biti pe o scara de neliniare (mai fine aproape de 0 si 1 si grosier de aproape de 1 / 2). Dimensiunea stivei este 500.000. Cresterea aceste numere ar duce la o compresie mai buna in detrimentul unui tabel mai mare. fpaqb utilizeaza calcule directe, cu exceptia pentru divizare, care utilizeaza un tabel de inverse, deoarece diviziune este lenta. fpaqc este fpaqb cu unele optimizari de viteza. Programator pentru fpaqc este utilizat in lpaq1a (un program de context de amestec), in locul programatorul aritmetica in lpaq1.
Desi o punere in aplicare tabel de cautare ar putea fi mai rapid pe unele procesoare mici, este mai lent pe computerele moderne, deoarece multiplicare este mai rapida decat memoria de acces aleatoriu. Unele repere sunt afisate pentru enwik8 (100 MB text) pe un procesor de 2.0 GHz, T3200 care ruleaza pe unul din cele doua nuclee. Raportul este fractiune din dimensiunea originala. Ori de compresie si de decompresie sunt nanosecunde pe octet.
Raportului Comp Decomp Coder
---- ---- ------- -----
fpaqa 0.61310 247 238 ABC cautare de masa
fpaqb 0.61270 244 197 ABC directe de calcul
fpaqc 0.61270 246 173 ABC directe de calcul
fpaqa-DARITH 0.61280 130 112 aritmetica (fpaqa compilate cu-DARITH)
Pentru compresoare high end, timp CPU si memorie sunt dominate de modelul, astfel incat alegerea programatorul face o diferenta mica. lpaq1 este un compresor de amestecare context, un predecesor al lpaq9m, clasat pe locul al treilea pe 127 de referinta text mare ca de februarie 2010. lpaq1a este acelasi cu exceptia faptului ca programatorul aritmetica a fost inlocuit cu programatorul binar asimetrice din fpaqb. (Cronometrat la 2.188 GHz Athlon-64 3500 +).
Raportului Comp Decomp Coder
---- ---- ------- -----
lpaq1a 0.19755 3462 3423 ABC calcul direct (fpaqb)
lpaq1 0.19759 3646 3594 aritmetica
Coder aritmetica in lpaq1 si fpaqa-DARITH comprese putin mai rau decat programatorul ABC, pentru ca foloseste 12 biti de precizie pentru probabilitate, mai degraba decat 16 biti la fel ca in ZPAQ.
Codurile numerice sunt codurile Huffman pentru numere cu unele distributii de probabilitate comune, cum ar fi ar putea sa apara in codificare erori de predictie in imagini sau audio, sau executati in lungimi de date redundant. In general, astfel de coduri au descrieri mai simplu sa transmita decodor decat o masa plina Huffman cod de lungime.
Un cod unar pentru un numar intreg negativ care nu este n cele n urmat de un zero (sau zero-uri n urmat de unul). De exemplu, 5 va fi codificate ca 111110. Un cod de unar implica o probabilitatea P (n) = 1 / 2 n +1 = 1 / 2, 1 / 4, 1 / 8, 1 / 16, etc
Semnat intregi pot fi codificate de cartografiere n pozitive si negative n 2n la-2n-1 pentru a forma secventa de 0, -1, 1, -2, 2, 3, 3, etc
Un cod Golomb de n non intreg negativ cu parametrul M imparte n intr-un coeficient de q = podea (n / M) si si un rest r = n - Mq. Apoi, q este codat in unar si R in binar. Daca M este o putere a lui 2, apoi codul este numit un "cod Rice", si toate remainers sunt codificate in jurnalul de bord 2 biti M. In caz contrar, resturile mai mici sunt codificate folosind etaj (log 2 M) biti si cele mari, cu un pic mai mult. Un cod de Golomb cu M = 1 este doar un cod unar. De exemplu:
n M = 1 (unar) M = 2 (orez) M = 3 (Golomb) M = 4 (orez)
QR QR code Codul de cod QR code QR
--- ----------- ------------ ---------- ----------
0 0 0 0 0 0 00 0 0 00 0 0 000
1 1 0 10 0 1 01 0 1 010 0 1 001
2 2 0 110 1 0 100 0 2 011 0 2 010
3 3 0 1110 1 1 101 1 0 100 0 3 011
4 4 0 11110 2 0 1100 1 1 1010 1 0 1000
5 5 0 111110 2 1 1101 1 2 1011 1 1 1001
6 6 0 1111110 3 0 11100 2 0 1100 1 2 1010
Un cod de Golomb cu parametrul cresteri m in lungime o data la valorile M. Astfel, ea implica o distributie aproximativa geometrica in cazul in care n are o probabilitate proportionala cu 2-n / M.
Un "extra bit" cod foloseste un cod Huffman pentru a codifica o serie de numere, apoi codifica o valoare in acel interval (biti suplimentar), in binar. Aceasta metoda poate fi folosita pentru a aproxima orice distributie care este de aproximativ fara probleme diferite.
JPEG utilizeaza coduri suplimentare de biti pentru a codifica cosinus discrete, coeficientii de transformare. Intervalele sunt 0, 1, 2.. 3, 4.. 7, 8.. 15, in crestere cu puteri de 2. Se foloseste, de asemenea, un bit de semn pentru toate gamele, cu exceptia 0. Dezumfle formatul utilizat in zip si gzip utilizeaza coduri suplimentare de biti pentru a codifica lungimi meci si offseturi utilizand un set mai complex de zone.
Un numar in virgula mobila este un caz special de un cod extra-biti in cazul in care reali sunt aproximate de cartografiere la intregi pe o scara logaritmica, si in cazul in care fiecare interval este de aceeasi marime si la fel de probabile. O IEEE pe 64 de biti punctul numarul plutitoare (de tip dublu in C / C + +) are un bit de semn e reprezentand -1 sau 1, un mesaj e exponent 11 biti reprezentand un numar in intervalul -1022 - 1023, si un pic mantisa 52 m, reprezentand un numar din intervalul 1 / 2 la 1 (sau 0 la 1 pentru cel mai mic exponent). Aceasta codifica reala e SM2 numar. O ?-lege sau mu-lege cod, folosit pentru a codifica discurs de telefon esantionate in America de Nord si Japonia, este un numar de 8 biti in virgula mobila cu un bit de semn, exponent 3 biti si 4 biti mantisa.
Fisierele comprimate sunt stocate in arhive. O arhiva poate contine un singur fisier sau mai multe fisiere. De exemplu, gzip comprese un fisier, adauga o extensie. gz la numele si sterge originalul. Decompression restabileste original si sterge fisierul comprimat. zip stocheaza mai multe fisiere intr-o arhiva cu extensia. zip nume de fisier. Ea are comenzi pentru a lista, adauga, actualiza, sterge, si extrage fisierele din arhiva (fara a sterge fisierele de intrare). Cu toate acestea, ambele programe utilizeaza algoritmul de compresie aceeasi.
De compresie poate fi, uneori, imbunatatita prin comprimarea fisierelor similare impreuna pentru a profita de informare reciproca. Unele arhivare suport acest lucru, folosind un "solid", format, care impune ca toate fisierele sa fie comprimate sau extras, in acelasi timp. zip este optimizat pentru viteza, astfel arhivele sale nu sunt solide. PAQ este optimizat pentru compresie maxima, astfel incat arhivele sunt intotdeauna solide. Acesta este facultativa pentru unele arhivare, cum ar fi WinRAR.
Un compresor fisier, cum ar fi gzip pot fi folosite pentru a crea arhive solid prin colectarea intai fisierele de intrare impreuna folosind un arhivator non-comprimare, cum ar fi gudron si apoi comprimarea. Amestec de gudron + gzip este dat de obicei, o extensie de nume de fisier. Tar.gz sau. Tgz.
Formatul de gudron a fost proiectat initial pentru backup-urile UNIX sau Linux pe banda. Astfel, este posibil sa se faca un fisier tar a unui sistem intreg fisier si a restabili structura sa neschimbate. Adauga un antet de gudron Programul 512 octet pentru fiecare fisier care contine numele fisierului, cale, dimensiune, amprenta de timp, proprietar, grup, permisiuni, si padding cu zero octeti. Apoi, toate fisierele sunt adaugate impreuna. Fisierele sunt, de asemenea, captusit cu zero octeti la un multiplu de 512. Aceste octeti zero, sunt usor de comprimate.
Arhivare zip, WinRAR, si 7-Zip au caracteristici similare cu gudron in masura in care pastreaza amprente de timp si poate comprima si extract de copaci director. Totusi, acestea nu se proprietar de magazin, de grup, precum si informatii permisiunea pentru a fi compatibil cu Windows, care are un model de securitate diferit.
Arhivare adauga uneori o suma de control pentru a detecta erorile de transmisie. Suma de control este calculata si depozitat inainte de compresie si verificate dupa decompresie. Nevoie de o suma de control este discutabila, deoarece retele si medii de stocare, de obicei folosesc corectarea erorilor, astfel incat erorile sunt foarte putin probabil. Erori Decompression sunt mult mai susceptibile de a rezulta din bug-uri software sau manipularea deliberata. O schimbare singur bit intr-o arhiva poate produce un fisier complet diferit. Un control mareste dimensiunea arhiva usor (de obicei, de 4 octeti) si creste, de asemenea, de compresie si decompresie timp de timpul necesar pentru ao calcula. Unele functii comune de control:
Suma de control este calculat prin numararea numarului de 1 biti in fluxul de intrare. Suma de control este 1, daca totalul este ciudat sau 0 daca chiar. Toate erorile singur bit sunt detectate, dar alte erori sunt detectate doar cu probabilitate 1 / 2.
CRC-32 este un frecvent utilizat controlul redundantei ciclice. Acesta detecteaza toate erorile de spargere pana la 31 biti si toate alte erori cu probabilitatea 1 - 2 -32.
De intrare este un flux de biti care sunt transferate printr-o fereastra pe 32 de biti. De fiecare data cand un pic 1 este mutat afara, continutul fereastra noua este XORed cu 0x04C11DB7, altfel este neschimbat. Continutul final al ferestrei este suma de control de 32 de biti. O punere in aplicare rapid se va deplasa si XOR un octet la un moment dat folosind un 256 cu 32 biti tabel de cautare.
Adler-32 de control este utilizat in zlib punerea in aplicare a reduce preturile, care este folosit in zip si gzip. Este o suma de control pe 32 de biti, care detecteaza erori cu probabilitate de 1 la 65521 -2, care este aproape 1 - 2 -32.
Suma de control are doua 16-bit piese. Prima parte este de 1, plus suma tuturor octeti de intrare, modulo 65521. A doua parte este suma tuturor sumelor intermediare pe de o parte in primul rand, modulo 65521. Calculul este foarte rapid, deoarece operatiunea (lent) mod doar trebuie sa fie calculat folosind frecvent 32 de biti aritmetice. 65521 este cel mai mare numar prim mai putin de 2 16.
CRC-32 si Adler-32 detecta cele mai multe erori accidentale, dar este usor sa modifice in mod deliberat de intrare, fara a schimba suma de control. functiile hash criptografic sunt concepute pentru a rezista chiar si incercari deliberate de a gasi coliziuni (doua intrari diferite care produc aceeasi suma de control). Ele sunt proiectate astfel incat in??cele mai bune algoritmi posibile nu sunt mai bune decat ghicitul aleatoriu intrarile si iesirile de testare. Dezavantajul este ca acestea sunt mai mari (de obicei 20 - 32 bytes) si dura mai mult pentru a calcula (desi inca suficient de rapid).
ZPAQ adauga optional un pic 160 (20 byte) SHA-1 de control. Acesta detecteaza erori cu probabilitatea 1 - 2 -160, chiar daca introduse deliberat. Calcul de checksum adauga 5.6 ns pe octet pe o GHz 2 T3200, comparativ cu 5.0 ns pentru CRC-32 si 5,3 ns pentru Adler-32 folosind SlavaSoft fsum v2.51 punerea in aplicare si testarea pe un fisier 1 GB de text.
Unele arhive va cripta de date, precum si comprima cum. Eu nu recomandam sa adaugati aceasta functie daca nu stii ce faci. Este tentant pentru a proiecta propriile algoritm, poate ca combinarea cu compresie, prin modificarea starii initiale a modelului intr-un mod care depinde de o parola. O astfel de incercare va fi aproape sigur esua. Doar pentru ca arata aleatoare de iesire nu inseamna ca este securizat. Nu exista nici un test pentru dezordine.
Daca adaugati criptare, utilizeaza un algoritm bine testate, cum ar fi AES. Comprimare date inainte de criptare, pentru ca datele criptate nu poate fi comprimat. Folositi codul de criptare publicate. Nu scrie singur. Publica atunci toate codul sursa si au expertii in securitate ai uita la ea. Altii vor incerca sa rupa codul, si ar trebui sa faca munca lor, cat mai usor posibil, prin documentarea in mod clar ceea ce face codul dvs. si cum functioneaza. Doar pentru ca se utilizeaza un algoritm securizat, bine testate nu inseamna restul programului dvs. este sigur. De exemplu, daca dumneavoastra arhivatorul Sterge plaintext, aceasta ar putea fi inca recuperat de la sectoare de disc nealocat sau fisierul de swap. De asemenea, pentru datele parola sau parola derivate in memorie care ajunge la inlocuit pe disc. Sunt multe lucruri subtile care pot merge prost.
Nu cred ca un algoritm secret sau o versiune executabil numai ca este mai sigur. Un algoritm de criptare nu este considerat sigur exceptia cazului in care poate rezista la atacuri plaintext alese de catre adversarii cu deplina cunostinta de algoritm. Cei mai multi experti in securitate nu va pierde nici timpul lor incearca sa atace un sistem inchis. Ei vor folosi doar alte solutii, bine testate. Daca sistemul dvs. vreodata nu devina populare, atunci altii vor inversa inginer IT si le vor rupe.
Un model este o estimare a distributiei de probabilitate de intrari la un compresor. De obicei, acest lucru este exprimat ca o succesiune de predictii de simboluri succesive (bits, bytes, sau cuvinte) in secventa de intrare, avand in vedere contextul de intrare anterioara. Odata ce avem un model, de codificare este o problema rezolvata. Dar (dupa cum sa dovedit prin Kolmogorov) nu exista nici un algoritm pentru determinarea cel mai bun model. Aceasta este problema greu in compresie a datelor.
Un model poate fi static sau de adaptare. In cazul in statica, compresor analizele de intrare, calculeaza distributiei de probabilitate pentru simbolurile sale, si transmite aceste date pentru a decompresser urmat de simboluri codificate. Atat compresor si decompresser selectati codurile de lungimi corespunzatoare folosind algoritmi identice. Aceasta metoda este deseori folosita cu Huffman de codificare.
De obicei in cele mai bune compresoare utilizarea modelele dinamice si codificare aritmetica. Compresorul foloseste de intrare trecut pentru a estima o distributie de probabilitate (predictie) pentru simbolul urmator fara sa se uite la ea. Apoi se trece de predictie si simbol la coder aritmetica, si actualizeaza in cele din urma modelul cu simbolul doar codificate. Decompresser face o predictie identica utilizand datele pe care le are deja decodat, decodeaza simbolul, apoi actualizarile sau model, cu simbolul de iesire decodat. Modelul nu are cunostinta de faptul ca este comprimarea sau decomprima. Aceasta este tehnica vom folosi in restul acestui capitol.
Cel mai simplu model este un model de ordine fixa. Un model de ordinul n intrari bytes ultimele n sau simboluri (context), intr-un tabel si iesiri cu o distributie de probabilitate pentru simbolul urmator. In faza de actualizare, simbolul prezis este revelat si tabelul este actualizat pentru a creste probabilitatea acesteia atunci cand acelasi context, urmatoarea apare. Un model 0 pentru utilizari nr context.
O distributie de probabilitate este de obicei calculata cu ajutorul unui contor pentru fiecare simbol din alfabet. In cazul in care simbolurile sunt bytes, atunci dimensiunea a alfabetului este de 256. De predictie pentru fiecare simbol este luate in considerare pentru acest simbol impartit la numarul total. Procedura de actualizare este de a incrementa luate in considerare pentru acest simbol. In cazul in care codurile aritmetica programator un octet la un moment dat, apoi treci gama de numarul si total la programator aritmetica. Pentru compresie, tu treci, de asemenea, byte care urmeaza sa fie codificate. Pentru decompresie, returneaza octet decodat. Procedura arata astfel:
const int CONTEXT_SIZE = 1 <<(n * 8); / / pentru ordinul n
int count [CONTEXT_SIZE] [256] = __undefined__; / / simbolul conteaza
int context = 0; / / octeti ultimele n ambalate impreuna
/ / Actualizati modelul cu octet c
void update (int c) {
+ + Conta [context] [c];
context = (context <<8 | c) CONTEXT_SIZE%;
}
/ / Comprima byte c
void comprima (int c) {
codifica (count [context], c); / / anticipa si codifica
actualizare (c);
}
/ / Decomprima de un octet si il returneaza
int decompresie () {
int c = decoda (count [context]);
actualizare (c);
intoarcere c;
}
Functiile codifica () si decoda () se presupune a fi procedurile de codare si decodare pentru un programator aritmetica bytewise. Fiecare dintre ei ia o matrice de 256 conteaza si impart intervalul actual de variatie in functie de aceste conteaza. codifica () actualizari, apoi intervalul de subrange c'th. decodifica () citeste datele comprimate si considera ca acest lucru este delimitata de intervalul c'th si intoarce C. Actualizare () procedura de magazine bytes ultimele n in biti scazute a contextului.
Exista cateva probleme cu aceasta metoda. In primul rand, ce se intampla daca un octet are o probabilitate de zero? Un encoder ideala ar da un cod de dimensiune infinit. In practica, encoderul ar esua. Un fixa este de a initializa toate elementele de conta la 1. Uneori se imbunatateste comprimare in cazul in care numarul initial au fost mai mici, sa zicem 0,5 sau 0.1. Acest lucru ar putea fi realizat in mod eficient prin cresterea increment, sa zicem, 2 sau 10.
O a doua problema este ca un numar poate in cele din urma preaplin. O solutie este ca, atunci cand un numar devine prea mare, pentru a rescala toate conteaza prin impartirea la 2. Stabilirea unei limite superioare mic (de obicei 30 pana la cateva sute de) poate imbunatati de compresie a datelor cu blocuri de tip mixt (cum ar fi text cu imagini incorporate), deoarece statisticile reflecta de intrare recente. Acesta este un model de adaptare sau nestationare. De date cu statisticile uniforme cum ar fi text pure sunt comprimate mai bine cu modele de stationare, in cazul in care conteaza sunt lasate sa creasca mari. In acest caz, probabilitatile depind in mod egal pe toate de intrare trecut.
O a treia problema este ca tabelul a numarului de creste exponential cu ordinea context. Unele de memorie pot fi salvate prin schimbarea numarului de [] cu numarul unsigned char si limitarea la 255. O alta este de a inlocui context, cu un hash, de exemplu:
const int k = 5; / / de biti de octet pe hash
const int CONTEXT_SIZE = 1 <<(n * k); / / ordinul n
...
context = (context * (3 <<k) + c) CONTEXT_SIZE%; / / update contextul hash
Multiplicatorul poate fi orice numar impar stanga mutat de catre un alt numar de k in intervalul 1 pana la 8. Apoi biti ultimele nk din context depinde de bytes ultimele n de intrare. A k mai mare va avea ca rezultat o compresie mai buna in detrimentul de mai multa memorie.
O patra problema este ca codificare aritmetica simpla bytewise este ineficient. Decodorul trebuie sa calculeze 256 intervale de gama pentru a gasi unul care contine datele comprimate. Acest lucru ar putea fi rezolvata prin utilizarea numarului cumulativ, si anume numarului de [context] [c] este suma conteaza pentru toate valorile octet? c, dar care se misca numai ineficienta pentru actualizarea () functie, care trebuie sa crestere de pana la 256 de valori. O solutie este de a codifica un bit la un moment dat folosind encoderul bit, precum cea descrisa la punctul 3.2.
Ideea este de a codifica un bit la un moment dat, prin utilizarea biti anterioare ale octet curent ca context suplimentare. Doar doua valori sunt stocate: un numar de cele, numarul 1, si un numar total de. Predictii este numarul 1 / luate in considerare. Procedura de actualizare este de a crestere numarul si la cresterea numarului 1 in cazul in care este de 1 bit. Ne ocupam de la zero probabilitatile, preaplin, si contexte de mare ca inainte.
Alternativ, putem evita o operatiune de (lente) divizare prin stocarea de predictie in mod direct. Fiecare context la nivel de bit este asociat cu o predictie care biti urmator va fi o 1 (initial 1 / 2) si un numar de actualizare (initial 0). Rata de actualizare este initial rapid si scade ca numarul este incrementat, rezultand intr-un model stationar. Alternativ, numarul poate fi delimitate, rezultand intr-un model de adaptare.
/ / Predictie si luate in considerare pentru un context de bit
Modelul struct {
Predictii dubla; / / intre 0 si 1 bit, care urmatoare va fi de 1
int count; / / numarul de actualizari
Model (): Predictii (0,5), numarul (0) {}
};
Modelul Modelul [CONTEXT_SIZE] [256]; / / context, bit_context -> predictie, iar numarul de
int context = 0; / / pentru bytewise n contextul
/ / Comprimare c byte in MSB LSB la comanda
void comprima (int c) {
pentru (int i = 7; i> = 0 - i) {
bit_context int = c 256>> i +1;
int = bit (c>> i) 2%;
codifica (bit, modelul [context] [bit_context] Predictii.);
actualizare (bit, modelul [context] [bit_context]);
}
context = (context <<8 | c) CONTEXT_SIZE%;
}
/ / Decomprima si a reveni un octet
int decompresie () {
int c; / / bit_context
int bit;
pentru (c = 1; c <256; c = c * 2 + biti) {
biti = decoda (modelul [context] [c] Predictii.);
actualizare (bit, modelul [context] [c]);
}
c -= 256; / / decodate byte
context = (context <<8 | c) CONTEXT_SIZE%;
intoarcere c;
}
/ / Actualizati modelul
void update (int bit, Model & m) {
const double DELTA = 0,5;
const int LIMITA = 255;
daca (m.count <limita) + + m.count;
m.prediction + = (bit - m.prediction) / (m.count + DELTA);
}
Compresa () functie ia un c byte si comprese-l un bit la un moment dat, incepand cu bitul cel mai semnificativ. La fiecare din cele 8 etape, biti anterior codificate sunt ambalate intr-un numar din intervalul (1.. 255) ca un numar binar 1 urmat de pana la 7 biti mai devreme. De exemplu, daca c = 00011100, apoi bit_context are 8 valori succesive de 1, 10, 100, 1000, 10001, 100011, 1000111, 10001110. In decomprima (), c joaca acelasi rol. Dupa 8 operatiunile de decodare are valoarea 100011100 si de conducere 1 este dezbracat inainte de a fi returnate.
Ca si inainte, contextul poate fi, de asemenea, un hash.
Functia de actualizare calculeaza eroarea de predictie (bit - m.prediction) si ajusteaza de predictie in proportie inversa cu numarul. Numarul este incrementat de pana la o valoare maxima. In acest moment, modelul trece de la stationare la adaptare.
DELTA si pentru a limita parametrii sunt reglabile. Cele mai bune valori depind de date. O LIMITA mare functioneaza cel mai bine pentru stationare de date. O LIMITA mai mici functioneaza mai bine pentru tipurile de date mixte. Pe surse stationare, dimensiunea comprimat este de obicei mai mare de 1/LIMIT. Alegerea DELTA este mai putin critica, deoarece are doar un efect mare atunci cand dimensiunea de date este mic (relativ la dimensiunea modelului). Cu DELTA = 1, o serie de zero biti ar duce la secventa de predictie 1 / 2, 1 / 4, 1 / 6, 1 / 8, 1 / 10. Cu DELTA = 0,5, secventa ar fi 1 / 2, 1 / 6, 1 / 10, 1 / 14, 1 / 18. Cleary si Teahan (1995) masurat probabilitatile reale in textul in limba engleza si a constatat un sir langa 1 / 2, 1 / 30, 1 / 60, 1 / 90... pentru zero-uri si 1 / 2, 19/20, 39/40, 59/60... pentru cele consecutive. Acest lucru ar potrivi DELTA in jurul valorii de 0.07 - 0.1.
Un implementarea reala ar folosi aritmetica integer, pentru a reprezenta numere fixe punct, si de a folosi un tabel de cautare pentru a calcula 1 / (m.count + DELTA), in actualizare () pentru a evita o operatiune de divizare lent. ZPAQ cutii o predictie 22 biti si 10 biti conta intr-un element de 32 biti modelului. Ca o optimizare suplimentara, modelul este stocata ca o matrice un dimensional aliniate pe o linie de granita 64 byte cache. Contextul bytewise este actualizat o data pe octet ca de obicei, dar biti suplimentare sunt extinse in grupuri de 4 intr-un mod care provoaca numai doua cache dor de pe octet. Biti de conducere sunt extinse la 9 biti dupa cum se arata mai jos, apoi exclusive-ORed cu adresa contextul bytewise.
0 0000 0001
0 0000 001x
Zero 0,000 01xx
0 0000 1xxx
1 xxxx 0001
1 xxxx 001x
1 xxxx 01xx
1 xxxx 1xxx
ZPAQ stabileste DELTA la 1 / 2, dar limita este configurabil la 4, 8, 12,..., 1020. Tabelul de mai jos prezinta efectul de diferite LIMITA pentru un model de ordinul 0 la 10 6 cifre ale? (stationare) si ordinele de la 0 la 2 pe 14 fisierul Calgary corpus concatenate intr-un flux de date unic (nestationare). Folosind un model de ordin superior pot imbunatati comprimare la costul de memorie. Cu toate acestea, tabele directa de cautare nu sunt practice pentru comenzi mai mari decat de aproximativ 2. Ordinul 2 model pe ZPAQ foloseste 134 MB de memorie. Ordin superior nu au nici un efect asupra?, deoarece cifrele sunt independente (scurta de fapt de calcul?).
pi Calgary corpus ordin limita-0-0 pentru comanda pentru a-1-2 ----- ------- --------- --------- ---- ----- 4 455.976 1.859.853 1.408.402 1.153.855 8 435.664 1.756.081 1.334.979 1.105.621 16 425.490 1.704.809 1.306.838 1.089.660 32 420.425 1.683.890 1.304.204 1.091.029 64 417.882 1.680.784 1.315.988 1.101.612 128 416.619 1.686.478 1.335.080 1.115.717 256 415.990 1.696.658 1.357.396 1.129.790 512 415.693 1.710.035 1.379.823 1.141.800 1020 415.566 1.726.280 1.399.988 1.150.737
Un model context indirect raspunde la intrebarea cum se harta o secventa de biti intr-o predictie pentru bit urmatoare. Sa presupunem ca vi se ofera o secventa ca 0000000001 si a cerut sa prezica ceea ce biti este urmatoarea. Daca presupunem ca sursa este stationara, atunci raspunsul este de 0,1 pentru ca 1 din 10 biti este un 1. Daca presupunem o sursa de nestationare atunci raspunsul este mai mare deoarece le dam prioritate la istoria mai noi. Cum vom decide?
Un model indirect invata raspunsul prin observarea ceea ce sa intamplat dupa ce a aparut secvente similare. Modelul utilizeaza doua tabele. Primul tabel harti-un context intr-o istorie pic, un stat care reprezinta o secventa din trecut de biti. Hartile doilea tabel istorie la o prezicere, la fel ca un model de context directa.
Modelele indirecte au fost introduse in paq6 in perioada 2004 O istorie biti poate fi scrisa sub forma (n 0, n 1, LB), ceea ce inseamna ca s-au n 0 zerouri, cele n 1, si ca ultimul bit a fost LB (0 sau 1). De exemplu, secventa de 00101 ar duce la stat (3, 2, 1). Starea initiala este (0, 0, -) insemnand ca nu exista nici ultimul bit.
In paq6 si derivati ai acestuia (inclusiv ZPAQ), o istorie de biti sunt stocate ca 1 octet, care limiteaza numarul de stari la 256. Diagrama de stare de mai jos prezinta statele permis in ZPAQ cu n 0 pe axa orizontala si N 1 pe axa verticala. Doua puncte (:) reprezinta doua state pentru LB = 0 si LB = 1. Un punct unic reprezinta un singur stat in cazul in care LB poate lua numai o singura valoare, deoarece statul se poate ajunge fie cu un. 0 sau 1, dar nu atat (LB este cel mai mare dintre cele doua conteaza). In general, o actualizare cu un 0 se muta la dreapta si de o actualizare cu un 1 se misca in sus. Starea initiala este marcata cu un 0 in coltul din stanga jos. Diagrama este simetrica despre diagonala. Exista un numar total de 219 state.
n 1 48. 47. 46. : 23. 22. 21. 20.. 19.. 18.. 17.. 16.. 15... 14... 13... 12... 11... 10... 9... 8.... 7::.: 6.:::: 5.::::: 4.:::::: 3.:::::::. 2.:::::::........ 1.:::::::........................................ 0 0.................... 012345678 15 20 48 N 0
Exista unele exceptii de la regula de actualizare. Deoarece nu este posibil sa se mearga pe la sfarsitul schemei, regula generala este sa se mute inapoi la cel mai apropiat de stat permise in directia de coltul din stanga jos (pastrarea raportului de la 0 la n n 1). Nu exista o alta regula care sa faca din modelul oarecum nestationare, si ca este atunci cand unul dintre conteaza este mare, iar celalalt este incrementat, apoi numarul de mai mare este redusa. Norme specifice de la standard, ZPAQ este faptul ca in cazul in care numarul mai mare este de 6 sau 7 este decrementat, iar daca aceasta este mai mare decat 7, atunci este redus la 7. Aceasta regula se aplica in primul rand, inainte de a se muta inapoi de la o stare de invalid. De exemplu, o secventa de 10 zerouri, 43 si cele rezultate zero in:
De intrare de stat articolul
-------- ------ ----------
0000000000 (10,0,0) Normal caz, va deplasa spre dreapta
1 (7,1,1) Reducere poza conta
1 (6,2,1) Reducere
1 (5,3,1) Reducere
1 (5,4,1) Normal caz, va deplasa in sus
1 (5,5,1) Normal caz, va deplasa in sus
1 (4,5,1) Mutare in sus de pe diagrama, apoi inapoi
1 (4,6,1) Normal caz, va deplasa in sus
1 (3,5,1) Mutare in sus de pe diagrama, apoi inapoi
111 (3,8,1) Normal, va deplasa in sus
1 (2,6,1) Mutare de pe diagrama si inapoi
111111111 (2,15,1) Normal
1 (1,8,1) Mutare de pe diagrama si inapoi
111.. (20x) (1,48,1) Normal, va deplasa in sus
1 (1,48,1), nu pot merge mai departe
0 (2,7,0) Reducere
O istorie biti este mapata la o predictie ca un model de context directe, cu exceptia faptului ca nu exista nici o conta, iar rata de invatare este stabilita la 1 / 4096 al erorii. Predictie initiala pentru fiecare bit este istorie (n 1 + 0.5) / (n 0 + n 1 + 1).
Detaliile de proiectare au fost determinate experimental pentru a imbunatati compresie usor peste seria PAQ8, care foloseste un design similar.
Un model indirecta este mai eficienta de memorie, deoarece foloseste doar de un octet pentru fiecare contextul in loc de patru. In toate din seria PAQ, este pus in aplicare ca un tabel de hash. In seria PAQ8, tabelul hash este conceput pentru a permite cautari cu cel mult 3 cache dor de pe octet. In ZPAQ, exista 2 cache rateaza pe octet, similar cu modelul directe. Hartile ZPAQ tabel hash-un context pe o granita bitul 4 la o serie de istorii biti 15 si un control 8-biti. Istoriile reprezinta toate cele 15 contextele posibile care au loc dupa ce pana la 3 biti mai mult. Pasii sunt dupa cum urmeaza:
Intr-un model de context directa, nu ne verifica pentru coliziuni hash, deoarece acestea au doar un efect foarte mic asupra compresie. Efectul este mai mare pentru modelele indirecte, astfel vom folosi o confirmare a 8 biti. Nu exista inca despre o sansa de 1,2% care o coliziune nu va fi detectat, dar efectul asupra compresie este foarte mic.
Urmatoarele dimensiuni au fost obtinute pentru? si corpus Calgary, cu scopul de la 0 la 5 modele si o dimensiune tabel hash de 2 28 (268 MB). Pentru comparatie, cele mai bune rezultate pentru modele de context directe sunt prezentate. Direct modele 3, 4, si 5 hashes context, utilizarea cu limita stabilita la 32 si utilizarea memoriei aceeasi.
Modelul directe indirecte --------------- ------- -------- pi ordinul 0 415566 426343 pentru Calgary 0 1680784 1716974 pentru Calgary 1 1304204 1289769 Calgary comanda 2 1089660 1048050 pentru Calgary 3 1017354 964942 pentru Calgary 4 1069981 1010329 Calgary pentru 5 1232997 1148677
Modelul indirecte directe
------- -------- ---------------
pi de ordinul 0 415566 426343
Calgary de ordinul 0 1680784 1716974
Calgary ordinul 1 1304204 1289769
Calgary ordinul 2 1089660 1048050
Calgary ordinul 3 1017354 964942
Calgary ordinul 4 1069981 1010329
Calgary, pentru 5 1232997 1148677
Exista numeroase alte posibilitati. De exemplu, M1, un compresor context de amestecare, de Christopher Mattern, foloseste 2 modele dimensionale context, trebuie sa luati 2 istoriile cuantificat fel de intrare.
Modele fixe, pentru a comprima mai bine folosind contexte mai mult pana la un punct (pentru 3 pentru corpus Calgary). Dincolo de faptul ca, de compresie se inrautateste, deoarece multe contexte de ordin superior sunt vazute pentru prima data si nici o predictie pot fi facute. O solutie este de a colecta statisticile pentru comenzi diferite in acelasi timp, si apoi sa utilizati cea mai lunga contextul potrivite pentru care noi stiu ceva. DMC face acest lucru pentru predictii nivel de bit, si PPM pentru predictii nivel de octet.
DMC (dinamic Markov codificare) a fost descrisa intr-o lucrare de Gordon Cormack si Horspool Nigel in 1987 si puse in aplicare in C, compresor cu carlig de Nania Francesco Antonio a fost scris in 2007 si se bazeaza pe aceasta punere in aplicare. DMC a fost implementd separat in ocamyd de Frank Schwellinger in 2006.
DMC utilizeaza un tabel de contexte variabile la nivel de biti lungime ca harta la o pereche de numarul n 0 si n 1. It predicts the next bit with probability n 1 /(n 0 +n 1 ) and updates by incrementing the corresponding count. This implements a stationary, direct context model. There are other possibilities, of course.
Contexts are not stored or looked up in the table. Rather, each table entry has a pair of pointers to the next entries corresponding to appending a 0 or 1 bit to the current context. The next context might also drop bits off the back. Normally it is arranged so that each context starts on a byte boundary. In DMC and HOOK, the initial table contains 64K entries consisting of all contexts of length 8 to 15 bits that start on a byte boundary, ie bytewise order 1.
In addition to updating counts, we may also add new table entries corresponding to the input just observed. We do this by "cloning" the next state with one that does not drop any bits off the back. Consider the case below where the context is 1111 and we update the model with a 0 bit. Without cloning, we would increment ny, transition to state 110, and the next prediction would be n1/(n0+n1).
n0 ----> 1100 n0*(1-w) ----> 1100
ny / / /
1111 -----> 110 1111 110 /
(y=0) \ | \ /
n1 ----> 1101 | n1*(1-w) ----> 1101
| / /
| n0*w / /
| ny / /
+----> 11110 /
\ /
n1*w --
Before cloning After cloning 110 to 11110
The cloning procedure is to allocate a new state (labeled 11100) and copy its two output pointers from 110. The new state will also "inherit" the same prediction by copying the two counts. However we proportionally reduce the two original and two new counts in proportion to the contribution from the original input (1111) and from other states. The weight w = ny/(n0+n1) is the fraction of counts in state 110 that came from 1111 after a 0 bit. The newly cloned state gets that fraction of the counts and the original gets the rest. This scaling maintains the condition that the input and output counts are equal. The counts are implemened as fixed or floating point numbers to allow fractional values. The transition from 1111 to 110 is changed to point to the new state 11110.
Before cloning (which uses memory), there should be sufficient justification to do so. The two conditions are that the cloned state appears often enough (ny exceeds a threshold) and that there are sufficient other transitions to the original state that would remain after cloning (n0+n1-ny exceeds a threshold). In the original DMC, both thresholds are 2. Normally when the state table is full, the model is discarded and re-initialized. Setting higher thresholds can delay this from happening.
hook v1.4 also has an LZP preprocessor to encode long, repeated strings with the match length from the last matching context prior to DMC encoding. It compresses calgary.tar to 851,043 bytes in 1.9 seconds with 70 MB memory on a 2.0 GHz T3200. It compresses the files individually to a total of 840,970 bytes.
PPM (prediction by partial match) uses a byte-wise context model. Each context up to some maximum order is mapped to an array of counts for the next byte that occurred in this context. To predict the next byte, we find the longest context seen at least once before and allocate probabilities in proportion to those counts. The main complication is that some of the counts might be zero but we must never assign a zero probability to any byte value because if it did occur, it would have an infinite code length. This is handled by assigning an "escape" probability that the next byte will not be any of the ones that have nonzero counts, and divide those according to the frequency distribution of the next lower context. This is repeated all the way down to order 0. If any of the 256 byte values have still not been assigned a probability, then the remaining space is divided equally (an order -1 model).
Example: Suppose the input is BANANABOAT and we are at the point of coding the second B.
Estimating the escape probability can be complex. Suppose you draw 10 marbles from an urn. There are 8 blue, 1 green, and 1 white. What is the probability that the next marble will be a different color not seen before?
Method X was shown to be optimal under certain assumptions, including that the source is stationary. Of course, that is not always the case. Suppose you receive the sequence "BBBBBBBBWG" and are asked to predict whether the next character will be novel. The answer might be different for the sequence "WGBBBBBBBB".
These methods are not used in modern PPM compressors. Rather, better compression can be obtained by learning the escape probability directly. The method is known as secondary escape estimation (SEE). Charles Bloom (1998) used SEE in PPMZ. Dmitry Shkarin (2002) refined this method in ppmd. ppmd variant I, released as source code in 2002, is used in several archivers such as the open source 7zip and freearc, and the commercial WinZIP, Stuffit, and WinRAR. (WinRAR uses an older variant H). A newer variant ppmd J in 2006 gets slightly better compression than I. The code is the basis of slower but better compressing programs such as ppmonstr and durilca. A variant of durilca using 13 GB memory is top ranked on the large text benchmark.
ppmd uses a complex SEE model. It considers 3 cases:
In the binary context, a 13 bit context to a direct context model is constructed:
In the nm-context, the program fits the frequency distribution to a geometric approximation such that the n'th most frequent value is proportional to r n. Then r is the context.
In the m-context, the SEE context is constructed from:
ppmonstr uses an even more complex SEE context, and additionally uses interpolation to smooth some of the quantized contexts. It also adjusts the prediction for the most probable byte using secondary symbol estimation (SSE). This is a direct context model taking as input the quantized prediction and a (very complex) context and outputting a new prediction.
Both programs use other techniques to improve compression. They use partial update exclusion. When a character is counted in some context, it is counted with a weight of 1/2 in the next lower order context. Also, when computing symbol probabilities, it performs a weighted averaging with the predictions of the lower order context, with the weight of the lower order context inversely proportional to the number of different higher order contexts of which it is a suffix.
Statistics are stored in a tree which grows during modeling. When the memory limit is reached, the tree is discarded and rebuilt from scratch. Optionally, the the statistics associated with each context are scaled down and those with zero counts are pruned until the tree size becomes smaller than some threshold (25%). This improves compression but takes longer.
Although bytewise arithmetic encoding can be inefficient, ppmd is in practice faster than many equivalent bitwise context mixing models. First, a byte is encoded as a sequence of escape symbols followed by a non-escape. Each of these encodings is from a smaller alphabet. Second, the alphabet within each context can be ordered so that the most likely symbols are first. This reduces the number of operations in the majority of cases.
Shown below are compressed sizes of the Calgary corpus as a tar file and separate files. Compression and decompression times are the same. Option -o16 means use maximum order 16. -m256 says use 256 MB memory. -r1 says to prune the context tree rather than discard it.
Compressor Options calgary.tar 14 files Time
----------- -------------- --------- -------- -------
ppmd J -o16 -m256 -r1 754,243 740,737 2 sec
ppmonstr J -o16 -m256 -r1 674,704 668,459 8 sec
durilca 0.5 -o128 -m256 672,752 666,216 10 sec
Some additional programs using PPM are as follows:
CTW (Context Tree Weighting) is an algorithm developed by Frans Willems, Yuri Shtarkov and Tjalling Tjalkens in 1995 and implemented by Erik Franken and Marcel Peeters in 2002. In the US and Europe it is protected by patents EP0913035B1, EP0855803, US6150966, US5986591, US5864308 owned by Koninklijke KPN NV, Netherlands.
CTW, like DMC, predicts one bit at a time using bit count statistics for contexts that begin on byte boundaries. Like PPM and unlike DMC, it combines the predictions of different order contexts from order 0 up to some maximum order D. Each context is associated with two bit counters, n 0 and n 1, and a mixing weight, w. Each context order i from 0 through D computes an estimator e i that the next bit will be a 1 depending on the two counts. The highest order context predicts p D = e D. All of the lower order contexts predict p i which is a weighted average of e i and the next higher order p i+1. The final prediction is either p 0 or p 1.
On update with bit y (0 or 1), the count n y is incremented and w is updated to favor the more accurate prediction in each context, either toward e i or p i+1.
Specifically, CTW uses the Krichevski-Trofimov (KT) estimator:
e i (0) = (n 0 + 1/2)/(n 0 + n 1 + 1)
e i (1) = (n 1 + 1/2)/(n 0 + n 1 + 1) = 1 - e i (0).
except for the special case where one of the counts is 0, which we consider later. Then the prediction, expressed as a ratio, is:
p D (0) = e D (0)
p D (1) = e D (1)
p i (1)/p i (0) = (w i e i (1) + p i+1 (1))/(w i e i (0) + p i+1 (0)), i < D.
Given bit y, the weight and one count for each context is updated:
w i := w i e i (y)/p i+1 (y), i < D
n y := n y + 1.
CTW uses a zero redundancy estimator for the special case where one of the two counts is zero. The idea is to predict more strongly the other bit. Consider the case of n 0 = 0, ie only ones have been observed. Then define k(n) as the cumulative probability of n consecutive bits by the KT estimator, and define the ZR estimator e() as:
k(0) = 1/2
k(i) = k(i-1)(i+1/2)/(i+1), i > 0
e(1) = (n 1 + 1/2)/(n 0 + n 1 + 1), n 1 > 0, n 0 > 0 (KT estimator)
e(1) = k(n 1 )/(2(n 1 + 1)k(n 1 ) + n 1 + 1), n 1 > 0, n 0 = 0 (ZR estimator)
All estimators are symmetric with respect to 0 and 1. The probability of a 1 given the sequence 00000 is the same as the probability of 0 given 11111. The ZR estimator gives a smaller probability for these cases than the KT estimator. The following table gives the probability that a 1 will follow n 0 zeros and no ones with each estimator.
n 0 KT ZR ---- ------ ------ 0 0.500000 0.500000 1 0.250000 0.107143 2 0.166667 0.064103 3 0.125000 0.044192 4 0.100000 0.032984 5 0.083333 0.025908 6 0.071429 0.021089 7 0.062500 0.017625 8 0.055556 0.015032 9 0.050000 0.013029 10 0.045455 0.011441 100 0.004950 0.000499 1000 0.000500 0.000017
For large runs of identical bits, the total code length for KT grows logarithmically but the ZR estimator is bounded to 2 bits. However, ZR adds 1 bit of redundancy when the other bit value eventually occurs. For large n 0, the ZR estimated probability of a 1 approaches 0 at the rate n 0 -3/2. This is faster than n 0 -1 for the KT estimator.
The CTW implementation uses a context tree to store the two counts and w. Each count is 8 bits. If one count exceeds 255, then both counts are divided by 2. It stores log(w) rather than w and does all arithmetic in the log domain so that multiplication and division are implemented using addition and subtraction. The KT and ZR estimators are looked up in a table indexed by the two counts.
A context tree is a 256-ary tree where each node at depth N+1 from the root represents an order-N context. The parent of each node is the suffix obtained by removing one byte. The children of the root node represent the 255 possible partial byte order 0 contexts of length 0 to 7 bits. To save space, children of contexts that have only occurred once are not stored. Each tree node uses 8 bytes, including a 24 bit pointer to the context in a separate file buffer.
CTW is best suited for stationary sources. The published CTW implementation compresses the Calgary corpus to 767,594 bytes as 14 separate files in 62 seconds with options -n16M -b16M -d12 set for maximum compression. With the same options, it compresses calgary.tar to 965,855 bytes. -d12 sets the maximum context order to 12. -b16M sets the file buffer to 16 MB. -n16M limits the tree to 16 million nodes. When the tree is full, no new contexts are added but the counts and weights continue to be updated. These settings require 144 MB memory, the maximum that the published implemention can use.
Context mixing algorithms based on the PAQ series are top ranked on many benchmarks by size, but are very slow. These algorithms predict one bit at a time (like CTW) except that weights are associated with models rather than contexts, and the contexts need not be mixed from longest to shortest context order. Contexts can be arbitrary functions of the history, not just suffixes of different lengths. Often the result is that the combined prediction of independent models compresses better than any of the individuals that contributed to it.
DMC, PPM, and CTW are based on the premise that the longest context for which statistics is available is the best predictor. This is usually true for text but not always the case. For example, in an audio file, a predictor would be better off ignoring the low order bits of the the samples in its context because they are mostly noise. For image compression, the best predictors are the neighboring pixels in two dimensions, which do not form a contiguous context. For text, we can improve compression using some contexts that begin on word boundaries and merge upper and lower case letters. In data with fixed length records such as spreadsheets, databases or tables, the column number is a useful context, sometimes in combination with adjacent data in two dimensions. PAQ based compressors may have tens or hundreds of these different models to predict the next input bit.
A fundamental question is how do we combine predictions? Suppose you are given two predictions pa = P(y=1|A) and pb = P(y=1|B), probabilities that the next bit y will be a 1 given contexts A and B. Assume that A and B have occurred often enough for the two models to make reliable guesses, but that both contexts have never occurred together before. What is p = P(y=1|A,B)?
Probability theory does not answer the question. It is possible to create sequences where p can be anything at all for any pa and pb. For example, we could have pa=1, pb=1, p=0. But intuitively, we should do some kind of averaging or weighted averaging. For example, if we wish to estimate P(car accident | dark and raining) given P(car accident | dark) and P(car accident | raining), we would expect the effects to be additive.
Furthermore, we want to mix predictions weighed by confidence. If pa = 0.5 and pb = 0.99, then intuitively it seems that pb expresses greater confidence in its prediction, so it should be given greater weight. All PAQ versions do this. Early versions expressed predictions as pairs of counts (observed zeros and ones) and added them together. This implicitly gave greater weight to predictions near 0 or 1 because such predictions can only occur when one of the counts is large. Later versions improved on this by transforming probabilities into the logistic domain, log(p/(1-p)), followed by averaging. This allowed greater flexibility in modeling techniques.
In most PAQ based algorithms, there is also a procedure for evaluating the accuracy of models and further adjusting the weights to favor the best ones. Early versions used fixed weights.
In PAQ6 (Mahoney, 2005a), a probability is expressed as a count of zeros and ones. Probabilities are combined by weighted addition of the counts. Weights are adjusted in the direction that minimizes coding cost in weight space. Let n 0i and n 1i be the counts of 0 and 1 bits for the i'th model. The combined probabilities p 0 and p 1 that the next bit will be a 0 or 1 respectively, are computed as follows:
S0 =? +? i w i n 0i = evidence for 0
S1 =? +? i w i n 1i = evidence for 1
S = S0 + S1 = total evidence
p 0 = S0/S = probability that next bit is 0
p 1 = S1/S = probability that next bit is 1
where w i is the non-negative weight of the i'th model and? is a small positive constant needed to prevent degenerate behavior when S is near 0.
The optimal weight update can be found by taking the partial derivative of the coding cost with respect to w i. The coding cost of a 0 is -log p 1. The coding cost of a 1 is -log p 0. The result is that after coding bit y (0 or 1), the weights are updated by moving along the cost gradient in weight space:
w i := max[0, w i + (y - p i )(S n 1i - S1 n i ) / S0 S1]
Counts are discounted to favor newer data over older. A pair of counts is represented as a bit history similar to the one described in section 4.1.3, but with more aggressive discounting. When a bit is observed and the count for the opposite bit is more than 2, the excess is halved. For example if the state is (n 0, n 1 ) = (0, 10), then successive zero bits will result in the states (1, 6), (2, 4), (3, 3), (4, 2), (5, 2), (6, 2).
PAQ7 introduced logistic mixing, which is now favored because it gives better compression. It is more general, since only a probability is needed as input. This allows the use of direct context models and a more flexible arrangement of different model types. It is used in the PAQ8, LPAQ, PAQ8HP series and in ZPAQ.
Given a set of predictions p i that the next bit will be a 1, and a set of weights w i, the combined prediction is:
p = squash(? i w i stretch(p i ))
unde
stretch(p) = ln(p/(1 - p))
squash(x) = stretch -1 (x) = 1/(1 + e -x )
The probability computation is essentially a neural network evaluation taking stretched probabilities as input. Again we find the optimal weight update by taking the partial derivative of the coding cost with respect to the weights. The result is that the update for bit y (0 or 1) is simpler than back propagation (which would minimizes RMS error instead).
w i := w i +? (y - p) stretch(p i )
where? is the learning rate, typically around 0.01, and (y - p) is the prediction error. Unlike linear mixing, weights can be negative.
Compression can often be improved by using a set of weights selected by a small context, such as a bytewise order 0 context.
In PAQ and ZPAQ, squash() and stretch() are implemented using lookup tables. In PAQ, both output 12 bit fixed point numbers. A stretched probability has a resolution of 2 -8 and range of -8 to 8. Squashed probabilities are multiples of 2 -12. ZPAQ represents stretched probabilities as 12 bits with a resolution of 2 -6 and range -32 to 32. Squashed probabilities are 15 bits as an odd multiple of 2 -16. This representation was found to give slightly better compression than PAQ.
ZPAQ allows different components (models and mixers) to be connected in arbitrary ways. All components output a stretched probability, which simplifies the mixer implementation. ZPAQ has 3 types of mixers:
Mixer weights in PAQ are 16 bit signed values to facilitate vectorized implementation using MMX/SSE2 parallel instructions. In ZPAQ, 16 bits was found to be inadequate for best compression. Weights were expanded to 20 bit signed values with range -8 to 8 and precision 2 -16.
SSE (secondary symbol estimation) is implemented in all PAQ versions beginning with PAQ2. Like in ppmonstr, it inputs a prediction and a context and outputs a refined prediction. The prediction is quantized typically to 32 or 64 values on a nonlinear scale with finer resolution near 0 and 1 and sometimes interpolated between the two closest values. On update, one or both values are adjusted to reduce the prediction error, typically by about 1%. A typical place for SSE is to adjust the output of a mixer using a low order (0 or 1) context. SSE components may be chained in series with contexts typically in increasing order. Or they may be in parallel with independent contexts, and the results mixed or averaged together. The table is initialized so that the output prediction is equal to the input prediction for all contexts.
SSE was introduced to PAQ in PAQ2 in 2003 with 64 quantization levels and no interpolation. Later versions used 32 levels and interpolation with updates to the two nearest values above and below. In some versions of PAQ, SSE is also known as an APM (adaptive probability map).
ZPAQ allows a SSE to be placed anywhere in the prediction sequence with any context. Recall that ZPAQ probabilities are stretched by mapping to ln(p/(1-p)) as a 12 bit fixed point number in the range -32 to +32 with resolution 1/64. The SSE input prediction is clamped and quantized to an odd multiple of 1/2 between -15.5 and 15.5. The low 6 bits serve as an interpolation weight. For example, if stretch(p) = 2.7, then the two table entries are selected by below=2.5 and above=3.5, and the interpolation weight is 0.2. Then the output prediction is SSE[context][below]*(1-w) + SSE[context][above]*w. Upon update with bit y, the table entry nearest the input prediction (SSE[context][below] in this example) is updated by reducing the prediction error (y - output) by a user specified fraction.
Exista si alte posibilitati. CCM, a context mixing compressor by Christian Martelock, uses a 2 dimensional SSE taking 2 quantized predictions as input.
ISSE (indirect secondary symbol estimation) is a technique introduced in paq9a in Dec. 2007 and is a component in ZPAQ. The idea is to use SSE as a direct prediction method rather than to refine an existing prediction. However, SSE does not work well with high order contexts because the large table size uses too much memory. More generally, a large model with lots of free parameters (each table entry is a free parameter) will overfit the training data and have no predictive power for future input. As a general rule, a model should not be larger than the input it is trained on.
ISSE does not use a 2-D table. Instead it first maps a context to a bit history as with an indirect context map. Then the 8-bit bit history is used as a context to select the pair of weights for a 2 input mixer taking the input prediction and a fixed constant as its two inputs. The weights are initialized to (1.0, 0.0) meaning that the initial output prediction is equal to the input.
PAQ9A and the default compression mode of ZPAQ both start with an order 0 model prediction and refine it using a chain of ISSE components in increasing order.
In ZPAQ, the weights are 20 bit signed, fixed point numbers with range -8 to 8 and precision 2 -16 like in a MIX. The fixed input is 4.0 and the learning rate is fixed at? = 2 -8.
A match model finds the last occurrence of a high order context and predicts whatever symbol came next. The accuracy of the prediction depends on the length of the context match. Longer matches generally give more confidence to the prediction. Typically a match model of order 6-8 is mixed with lower order context models. A match model is faster and uses less memory than a corresponding context model but does not model well for low orders.
Match models are used in PAQ and ZPAQ. They consist of a rotating history buffer and a hash table mapping contexts to pointers into the buffer. In ZPAQ, a pointer to the match is maintained until a mismatching bit is found. The model will then look for a new match at the start of the next byte. On each byte boundary, the buffer is updated with the modeled byte and the hash table is updated so that the current context hash points to the end of the buffer. ZPAQ allows both the hash table size and buffer size to be user specified (as powers of 2). For best compression, the history buffer should be as large as the input (if this is practical) and the hash table size is typically 1/4 of this. Because each pointer is 4 bytes, both data structures use the same amount of memory.
Match models in PAQ maintain multiple context hashes of different orders and multiple pointers into the buffer. The prediction is indirect by mapping the match length to a prediction through a direct context model. ZPAQ uses a simpler match model with just one pointer and one hash, although it is possible to have multiple, independent match models. The prediction for a match of L bytes (or 8L bits) is that the next bit will be the same with probability 1 - 1/8L.
The user may specify the context length by using a rolling hash that depends on the desired number of characters. If h is the context hash, c is the input byte, then the update:
h = h*((2*k+1) << m) + c;
specifies an order ceil(n/m) context hash for a hash table size of 2 n and any k. For example, the hash update "h=h*40+c;" (m = 3) is an order 6 context hash for a table size of 2 18. Only the low 18 bits of h would be used to index the hash table of this size, and these bits depend only on the last 6 values of c.
The high compression ratio (and slow speed) in PAQ comes from using many different context models for different types of data. These are described in historical order.
Schmidhuber and Heil (1996) developed an experimental neural network data compressor. It used a 3 layer network trained by back propagation to predict characters from an 80 character alphabet in text. It used separate training and prediction phases. Compressing 10 KB of text required several days of computation on an HP 700 workstation.
Mahoney (2000) made several improvements that made neural network compression practical.
These changes make the algorithm about 10 5 times faster, mainly because only a few neurons (out of millions) are active at any one time. To make the training online, it is necessary to add a pair of counters to each weight (to count 0 and 1 bits) so that the learning rate is initially large. The rate decreases in inverse proportion to the smaller of the two counts.
There are 3 versions: P5, P6, and P12. P5 uses 256 KB memory to represent 5 orders using 2 16 input neurons (each representing a context hash) and one output (the bit prediction). P6 uses 2 20 input neurons. P12 is the same size as P6 but adds a whole word model. This context hash depends only on alphabetic characters mapped to lower case, and is reset after a nonalphabetic character. It improves compression of text files.
In PAQ1 (Mahoney, 2002), it was realized that the counts alone could be used to make predictions, so the weights were eliminated. Predictions are combined by adding the 0 and 1 counts associated with each context. Each counter is 1 byte.
PAQ2 added SSE by Serge Osnach in May 2003. It uses 64 quantization levels without interpolation.
PAQ3 modified SSE to use 32 levels with interpolation in Sept. 2003.
PAQ3N by Serge Osnach in Oct. 2003 added sparse models: three order-2 models that skipped 1, 2, or 3 bytes of context between the two context bytes. This improves compression of some binary files.
PAQ4 (Nov. 2003) uses adaptive linear weighting of models as described in section 4.3.1. It also introduced a record model. It identifies structures that repeat at regular intervals, as found in spreadsheets, tables, databases, and images, and adds contexts of adjacent bytes in two dimensions.
PAQ5 (Dec. 2003) has some minor improvements over PAQ4, including word models for text, models for audio and images, an improved hash table, and modeling of run lengths within contexts. It uses two mixers with different contexts to select their weights. The two mixer outputs are averaged together. It uses about 186 MB of memory.
PAQ6 (Jan. 2004) adds models for x86 code (modeling jump/call addresses) and CCITT images and more aggressive discounting of opposing bit counts. It takes options allowing up to 1616 MB memory. It is the basis of a number of forks and dozens of versions. An early version won the Calgary challenge. Many other models and optimizations were added by Berto Destasio, Johan de Bock, David A. Scott, Fabio Buffoni, Jason Schmidt, Alexandar Ratushnyak (PAQAR), Przemyslaw Skibinski (PASQDA, text preprocessing), Rudi Cilibrasi, Pavel Holoborodko, Bill Pettis, Serge Osnach, and Jan Ondrus.
PAQAR (v1.0 to 4.0, May-July 2004) by Alexander Ratushnyak is a PAQ6 fork which is the basis of several winning submissions to the Calgary Challenge. The primary difference is a greatly increased number of mixers and SSE chains.
PAQ7 (Dec. 2005) was a complete rewrite. It uses logistic mixing rather than linear mixing, as described in section 4.3.2. It has models for color BMP, TIFF, and JPEG images. The BMP and TIFF models use adjacent pixels as context. JPEG is already compressed. The model partially undoes the compression back to the DCT (discrete cosine transform) coefficients and uses these as context to predict the Huffman codes.
PAQ8A (Jan. 2006) adds a E8E9 preprocessor to improve compression of x86 (EXE and DLL) files. The preprocessor replaces relative CALL and JMP addresses with absolute addresses, which improves compression because an address may appear multiple times. Many other compressors use this technique.
PASQDA (v1.0-v4.4, Jan. 2005 to Jan. 2006) is a fork by Przemyslaw Skibinski. It adds an external dictionary to replace words in text files with 1 to 3 byte symbols. This technique was used successfully in the Hutter Prize and in the top ranked programs in the large text benchmark. PAQ8A2, PAQ8B, PAQ8C, PAQ8D, PAQ8E, and PAQ8G (Feb. to Mar. 2006) also use this technique, as does PAQAR 4.5 by Alexander Ratushnyak.
PAQ8F (Feb. 2006) adds a more memory efficient context model and a new indirect model: The byte history within a low order context is modeled by another low order context.
PAQ8L (Mar. 2006) adds a DMC model. Its predictions are mixed with those of other models.
As of Feb. 2010, development remains active on the PAQ8 series. There have been hundreds of versions with improvements and additional models. The latest is PAQ8PX_V67. Most of the improvements have been for file types not included in the Calgary corpus such as x86, JPEG, BMP, TIFF, and WAV.
A benchmark for the Calgary corpus is given below for versions of PAQ from 2000 to Jan. 2010 showing major code changes. About 150 intermediate versions with minor improvements have been omitted. Older programs marked with * were benchmarked on slower machines such as a 750 MHz Duron and have been adjusted to show projected times on a 2.0 GHz T3200, assumed to be 5.21 times faster. Sizes marked with a D use an external English dictionary that must be present during decompression. The size shown does not include the dictionary, so it is artificially low. However, including it would give a size artificially high because the dictionary is not extracted from the test data. All versions of PAQ are archivers that compress in solid mode, exploiting similarities between files. Decompression time is about the same as compression time.
Compressor Calgary Seconds Memory Date Author Major changes
---------- ------- ------- ------ -------- ------ --------------------
P5 992,902 6.1* 256 KB 2000 Mahoney 64K x 1 neural network
P6 841,717 7.4* 16 MB 2000 Mahoney 1M neurons
P12 831,341 7.5* 16 MB 2000 Mahoney Word context model
PAQ1 716,704 13* 48 MB 2002 Mahoney Linear mixing with fixed weights
PAQ2 702,382 18* 48 MB May 2003 Osnach SSE
PAQ3 696,616 15* 48 MB Sep 2003 Mahoney Interpolated SSE
PAQ3N 684,580 30* 80 MB Oct 2003 Osnach Sparse models
PAQ4 672,134 43* 84 MB Nov 2003 Mahoney Adaptive mixer weights, record models
PAQ5 661,811 70* 186 MB Dec 2003 Mahoney Models for text, audio, images, runs, 2 mixers
PAQ6 -6 648,892 99* 202 MB Jan 2004 Mahoney Models for PIC, x86
PAQAR 4.0 -6 604,254 408* 230 MB Jul 2004 Ratushnyak Many mixers and SSE chains
PAQ7 -5 611,684 142* 525 MB Dec 2005 Mahoney Logistic mixing, image models
PAQ8A -4 610,624 152* 115 MB Jan 2006 Mahoney E8E9 preprocessor
PASQDA 4.4 -7 D 571,011 283* 470 MB Jan 2006 Skibinski PAQ7 + external dictionary
PAQAR 4.5 -5 D 570,374 299* 191 MB Feb 2006 Ratushnyak PAQAR + external dictionary
PAQ8F -6 605,650 161* 435 MB Feb 2006 Mahoney Bytewise indirect model, memory tuning
PAQ8L -6 595,586 368 435 MB Mar 2007 Mahoney DMC model
PAQ8PX_V67 -6 598,969 469 421 MB Jan 2010 Ondrus Improved JPEG, TIFF, BMP, WAV models
Since 2007, development has contined on PAQ. In addition to PAQ8PX, there are 3 additional forks, no longer under active development.
Of the hundreds of PAQ versions, no program can decompress files compressed by any other version. The goal of the proposed ZPAQ standard is to solve this problem. It specifies a format in which a description of the compression algorithm is stored in the compressed archive. The specification includes a reference decoder.
The specification does not describe the compression algorithm. However, several compression programs and models are available on the ZPAQ page. There is a ZPAQ program that takes a configuration file to describe the compression algorithm, as well as other programs like ZPIPE that use a fixed compression algorithm. All of these produce files that can be decompressed with the reference decoder or by any of these programs. The standard was published in Mar. 2009 by Matt Mahoney.
ZPAQ describes an archive format, although it may be used for single file compression or memory to memory compression. A compressed stream consists of a sequence of blocks that are independent and can be reordered. Each block starts with a header that describes the decompression algorithm. A block consists of a sequence of compressed segments that must be decompressed in sequence from the start of the block. A segment may be a file or a part of a file. Each segment has an optional file name, an optional comment (file size, timestamp, etc.), and ends with an optional 20 byte SHA-1 checksum. If the file name is omitted, then the decompresser must supply it.
An algorithm description consists of a network of components (each making a prediction), a program that computes the bytewise contexts for each component, and an optional program that post-processes the output. Both programs are written in a language called ZPAQL which is compiled or interpreted by the decompresser. If the post-processing program is present, then it is appended to the front of the first uncompressed segment and compressed along with its input data. If not, then the data is compressed directly with a one byte header to indicate its absence.
Up to 255 components are placed in an array. Each component in the model inputs a context hash and possibly the predictions of earlier components, and outputs a stretched prediction. The output of the last component is squashed and used to arithmetic code the data one bit at a time. The components are as follows:
Contexts are computed by a ZPAQL program that is called once per modeled byte with that byte as input. The program places the context hashes in an array H of 32 bit unsigned values. Each element of H is the input for one component, beginning with H[0] for the first one. Only the low bits of the hash are used, depending on the table size of each component. Because each bit must be modeled, the context hashes are combined with the previous bits of the current byte. This is done by expanding the previous bits to a 9 bit value (ICM or ISSE) or 8 bits (all others) and adding it to the bytewise context.
ZPAQL is designed for fast execution rather than ease of programming. It resembles an assembly language instruction set. A ZPAQL program runs on a virtual machine with the following state, all initialized to 0 at the start of a block:
The sizes of H and M are specified as powers of 2 in the block header. Most instructions are either 1 byte, or 2 bytes including a numeric operand. The instruction set is as follows:
The post-processor (called PCOMP), if it is present, is called once per decoded byte with that byte in the A register. At the end of each segment, it is called once more with -1 in A. The decompresser output is whatever is output by the OUT instruction.
The context model (called HCOMP) is always present. It is called once per decoded byte. It puts its result in H. OUT has no effect. HCOMP sees as input the PCOMP code (if present) followed by a contiguous stream of segments with no separator symbols.
The ZPAQ program is a development environment for writing and debugging models. It allows programs to be single stepped or run separate from compression. It accepts control statements IF/IFNOT-ELSE-ENDIF and DO-WHILE/UNTIL/FOREVER and converts them to conditional jumps. It allows passing of numeric arguments and comments in parenthesis. If a C++ compiler is present, then ZPAQL code is compiled by converting it to C++ and then running it. Otherwise the code is interpreted. Compiling makes compression and decompression 2 to 4 times faster.
The default configuration for both ZPAQ and ZPIPE is described by the file mid.cfg below.
(zpaq 1.07+ config file tuned for average compression.
Uses 111 * 2^$1 MB memory, where $1 is the first argument.)
comp 3 3 0 0 8 (hh hm ph pm n)
0 icm 5 (order 0...5 chain)
1 isse 13 0
2 isse $1+17 1
3 isse $1+18 2
4 isse $1+18 3
5 isse $1+19 4
6 match $1+22 $1+24 (order 7)
7 mix 16 0 7 24 255 (order 1)
hcomp
c++ *c=ab=ca=0 (save in rotating buffer M)
d= 1 hash *d=a (orders 1...5 for isse)
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash b-- hash *d=a (order 7 for match)
d++ a=*c a<<= 8 *d=a (order 1 for mix)
opri
post
0
capat
The comment about $1 means that the model can be run with additional memory to improve compression. De exemplu:
zpaq ocmid.cfg archive files...
will compress with 111 MB memory, and
zpaq ocmid.cfg,3 archive files...
will compress with 111 * 2 3 = 888 MB memory. Decompression requires the same amount. The effect of ",3" is to make substitutions like "$1+17" with 20 throughout the configuration file. Up to 9 parameters (to $9) are allowed.
The command "oc" means optimize (compile the ZPAQL into C++) and compress. If the "o" is dropped then no external C++ compiler is required, but compression and decompression takes twice as long.
The configuration file is divided into 3 sections. COMP describes the arrangement of components. HCOMP contains ZPAQL code that computes the contexts and puts them in H. POST 0 indicates that there is no post-processing.
COMP is followed by 5 arguments: hh, hm, ph, pm, n. hh and hm specify the sizes of H and M in HCOMP as powers of 2 (2 3 = 8 each). ph and pm are 0 because these arrays are not used. (Their size is actually 1). n = 8 is the number of components. They must be numbered 0 through n-1 in the COMP section. Except for the line numbers, each token compiles to one byte of ZPAQL. (Thus, ZPAQ requires "A= 10" be written exactly like this and not "A=10" or "A = 10" to indicate it is a 2 byte instruction).
Linia
0 icm 5
describes an indirect context model with a table size of 64 * 2 5 bytes. It takes its context from the low 15 bits of H[0]. The low 7 bits index a table of 16 byte arrays, and the next 8 bits are the checksum to detect collisions. Since this is an order 0 context, H[0] is left at 0 and only the bitwise context (a 9 bit value) is used. Linia
1 isse 13 0
describes an indirect SSE using 64 * 2 13 bytes taking its input from component number 0 and context from H[1]. Linia
2 isse $1+17 1
describes an indirect SSE using 64 * 2 $1+17 bytes, where $1 is the argument passed to mid.cfg. For example, if the argument is 3, it uses 64 * 2 20 bytes. $1 defaults to 0. It gets its input prediction from component 1 and its context from H[2]. Linia
6 match $1+22 $1+24
specifies a match model with a hash table size of 2 22 and history buffer of size 2 24 (taking 16 MB each if $1 is 0). Its context is H[6]. Linia
7 mix 16 0 7 24 255
specifies a mixer with 2 16 sets of weights selected by the low 16 bits of the context, taking as input predictions from the range of components 0 through 7-1, with a learning rate of 24/4096, and no masking (AND the bitwise context with 255). The context is H[7].
The HCOMP section computes the contexts and puts them in H. It puts order 0 through 5 context hashes in H[0] through H[5], an order 7 context in H[6] for the match model, and an unhashed order 1 context in bits 8..15 of H[7] for the mixer. It leaves bits 0..7 clear because the bitwise context will be added to this. This is not a concern for the other contexts because they are hashed.
HCOMP is called once after modeling each byte with the byte in the A register. All state information except A and PC (which is reset to the first instruction) is preserved between calls.
This program uses M as a rotating history buffer of 8 bytes (hm = 3) with the low 3 bits of C pointing to the last byte stored. It uses B as a working pointer to compute hashes and D as a pointer into H to store the result. The instructions
c++ *c=ab=ca=0
increment C, store the input byte in M[C], copy C to B, and clear A.
d= 1 hash *d=a
assigns 1 to D so that it points to H[1]. The hash instruction takes input from M[B] and combines it with A (0), so A now contains a hash of the last input byte. It is stored in H[D] = H[1] as an order 1 context for component 1. Subsequent instructions store order 2, 3, 4, 5, and 7 hashes in H[2] through H[6]. Note that the space in "d= 1" is required because it is a 2 byte instruction. "a=0" doesn't require this because there is a special 1 byte instruction for clearing a register.
d++ a=*c a<<= 8 *d=a
computes the mixer context by putting the input byte (saved in *C) into bits 8..15 of D[7] and leaving the other bits at 0. Execution ends at the halt instruction.
The following is the configuration max.cfg, which gets better compression but is slower.
(zpaq 1.07+ config file tuned for high compression (slow)
Uses 245 x 2^$1 MB memory, where $1 is the first argument.
comp 5 9 0 0 22 (hh hm ph pm n)
0 const 160
1 icm 5 (orders 0-6)
2 isse 13 1 (sizebits j)
3 isse $1+16 2
4 isse $1+18 3
5 isse $1+19 4
6 isse $1+19 5
7 isse $1+20 6
8 match $1+22 $1+24
9 icm $1+17 (order 0 word)
10 isse $1+19 9 (order 1 word)
11 icm 13 (sparse with gaps 1-3)
12 icm 13
13 icm 13
14 icm 14 (pic)
15 mix 16 0 15 24 255 (mix orders 1 and 0)
16 mix 8 0 16 10 255 (including last mixer)
17 mix2 0 15 16 24 0 (order -1)
18 sse 8 17 32 255 (order 0)
19 mix2 8 17 18 16 255 (order 0)
20 sse 16 19 32 255 (order 1)
21 mix2 0 19 20 16 0 (order -1)
hcomp
c++ *c=ab=ca=0 (save in rotating buffer)
d= 2 hash *d=a b-- (orders 1,2,3,4,5,7)
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash b-- hash *d=a b--
d++ hash *d=a b-- (match, order 8)
d++ a=*c a&~ 32 (case insensitive words)
a> 64 if
a< 91 if (if az)
d++ hashd d-- (update order 1 word hash)
*d<>a a+=*da*= 20 *d=a (order 0 word hash)
jmp 9
endif
endif
(else not a letter)
a=*da== 0 ifnot (move word order 0 to 1)
d++ *d=a d--
endif
*d=0 (clear order 0 word hash)
(end else)
d++
d++ b=c b-- a=0 hash *d=a (sparse 2)
d++ b-- a=0 hash *d=a (sparse 3)
d++ b-- a=0 hash *d=a (sparse 4)
d++ a=b a-= 212 b=aa=0 hash
*d=a b<>a a-= 216 b<>aa=*b a&= 60 hashd (pic)
d++ a=*c a<<= 9 *d=a (mix)
d++
d++
d++ d++
d++ *d=a (sse)
opri
post
0
capat
The COMP section begins with an ISSE chain of orders 0 through 6 like mid.cfg (with one extra ISSE). "const 160" provides a bias for the mixers that follow later. It outputs a fixed prediction of (160-128)/16 = 2 (stretched). The match model is order 8. As with mid.cfg, M is used as a rotating history buffer, but with a size of 2 hm = 2 9 = 512. H is 2 hh = 32 elements. There are n = 22 components.
Components 9 and 10 are an ISSE chain of word-oriented order 0 and 1 contexts for modeling text. These form a separate chain. Generally, the best compression is obtained when each ISSE context contains the lower order context of its input. Otherwise the models should be independent and mixed later. The context is formed by mapping upper and lower case letters together and discarding all other input. The order 0 context is a hash of all of the letters since the beginning of the word or 0 if the last byte was not a letter. The order 1 hash combines this with the previous word.
Components 11 through 13 are sparse order 1 contexts with a gap of 1 to 3 bytes between the context and the current byte. These are useful for modeling binary files.
Component 14 is a model for CCITT binary fax images (PIC in the Calgary corpus). The image width is 1728 pixels or 216 bytes, mapped one bit per pixel in MSB to LSB order (0=white, 1=black). The context is the 8 bits from the previous scan line and 2 additional bits from the second to last scan line.
Components 15 and 16 are order 1 and 0 mixers taking all earlier components as inputs. The second (order 0) mixer has a slower learning rate because it has fewer free parameters. Those two mixer outputs are mixed by the context free (size 0) MIX2 at 17. Its output is refined by the order 0 SSE at 18. The input and output of the SSE are mixed at 19. That prediction is refined by the order 1 SSE at 20. Finally the input and output of that SSE are mixed by the context free MIX2 at 21.
The code for computing the order 0 and 1 word contexts in H[9..10] starts at
d++ a=*c a&~ 32 (lowercase words)
This increments D to point to H[9] (the order 0 word model), retrieves the input byte saved in M[C], and clears bit 5 (meaning a &= ~32) which converts lower case to upper case. Apoi
a> 64 if
a< 91 if (if az)
tests if A is in the range A..Z (ASCII 65..90). The "if" is converted to a conditional jump to the matching "else" or "endif". If the test passes then the two word hashes are updated. The instruction *d<>a means swap H[D] with A. The hash in D[9] is a rolling hash but in D[10] is cumulative. JMP 9 skips 9 bytes past the commented "else" clause. If the input is not a letter then H[9] is moved to H[10] and H[9] is cleared.
The following results are for the Calgary corpus as a solid archive when supported. Compression is timed in seconds on a 2 GHz T3200.
Program Size Time Memory (MB)
------- --------- ---- ------
zip -9 1,020,831 0.6 0.5
ppmd 804,992 0.6 7.5 (calgary.tar)
ppmd -m256 -o16 754,243 1.3 62 (calgary.tar)
ppmonstr 669,497 8 51 (calgary.tar)
lpaq1 6 682,512 8 195 (calgary.tar)
lpaq9m 6 686,161 8 198 (calgary.tar)
paq9a -6 676,914 12 209
zpaq ocmid.cfg 699,474 8 111
zpaq ocmax.cfg 644,433 20 246
zpaq ocmax.cfg,1 644,320 20 477
paq8l -6 595,474 368 435
paq8px_v67 -6 598,969 469 421
A transform converts data into a sequence of symbols which can be compressed with a simpler or faster model, or one using less memory, such as an order 0 context model. Those symbols still need to be modeled and coded as described in sections 4 and 3 respectively.
A transform should ideally be a bijection. For each input, there should be exactly one possible representation. More often, the transform is an injection, and its inverse a surjection. An input may have more than one valid representation, either of which is transformed back to the original data during decompression. This increases the information content of the transformed data because the arbitrary choice of representation has to be modeled and coded, taking up space.
A run length code replaces a long repeating sequence of identical symbols with two symbols representing a count and the value to be repeated. For example, the string "AAAAA" would be coded as (5,"A").
In LZ77 compression, duplicate strings in the input are replaced by pointers to previous occurrences. LZ77 is not a bijection. For example, given the string:
AB..BC..ABC
"ABC" could be coded as:
A pointer consists of an offset and a length. It is allowed for the copied string to overlap the output. For example "AAAAA" could be coded as a A,(-1,4) meaning write a literal "A" and then go back 1 and copy the next 4 bytes. Thus, LZ77 may also encode run lengths.
LZ77 decompression is extremely fast, faster than compression. The compressor must search for matching strings, typically using a hash table or tree. The decompresser only needs to maintain an output buffer from which to copy repeated strings, and then write a copy of its output to the buffer.
The name "LZ77" comes from Lempel and Ziv, who described it in a 1977 paper (Ziv and Lempel, 1977).
LZSS (Lempel-Ziv-Storer-Szymanski, 1982) uses 1 bit flags to mark whether the next symbol is a literal or a pointer. LZSS is used in NTFS file compression in Windows when the folder property is set to store files compressed. Its primary goal is to be extremely fast (faster than disk access) rather than provide good compression. The exact format was not published. Rather, it was reverse engineered (in Russian) in 1998. 16 literal/pointer flags are packed into 2 bytes. This is followed by 16 symbols which are either 1 byte literals or 2 byte pointers. The offset is variable length with a maximum of 12 bits. Any remaining bits are allocated to the length, which has a minimum value of 3. Thus, after 2K of input, each pointer is a 12 bit offset and a 4 bit length ranging from 3 to 18.
Windows indicates that a compressed folder containing the Calgary corpus occupies 1,916,928 bytes. On the large text benchmark, the 1 GB text file enwik9 compresses to 636 MB, slightly larger than an order 0 coder and about twice the size of zip. Copying enwik9 between 2 uncompressed folders takes 41 seconds on the test machine (a laptop with a 2.0 GHz T3200). Copying from a compressed folder to an uncompressed folder takes 35 seconds, ie decompression is faster than copying. Copying from an uncompressed folder to a compressed folder takes 51 seconds. This is equivalent to compressing the Calgary corpus in 0.03 seconds over the time to copy it.
The NTFS implementation of LZSS is very similar to lzrw1-a implemented by Ross Williams in 1991. lzrw1-a uses a fixed 12 bit offset and 4 bit length.
The widely popular deflate format is used in zip and gzip and is supported by many other archivers. It is used internally in PNG images, PDF documents, and Java JAR archives. The format is documented in RFC 1951 (1996) and supported by the open source zlib library.
In the deflate format, pointer offsets range from 1 to 32768 and length from 3 to 258. Literals and lengths are coded in a 286 symbol alphabet which is Huffman coded followed by up to 5 extra uncompressed bits of the length. A length code is followed by an offset from a 30 symbol Huffman coded alphabet followed by up to 13 extra uncompressed bits. Specifically the alphabets are as follows:
0..255 = literal byte
256 = end of data
257..264 = lengths 3..10
265..268 = lengths 11..18 followed by 1 extra bit
269..272 = lengths 19..34, 2 extra bits
273..276 = lengths 35..66, 3 extra bits
277..280 = lengths 67..130, 4 extra bits
281..284 = lengths 131..257, 5 extra bits
285 = length 258
Lengths are followed by an offset coded from a 30 symbol alphabet:
0..3 = offset 1..4
4..5 = offset 5..8 followed by 1 extra bit
6..7 = offset 9..16, 2 extra bits
8..9 = offset 17..32, 3 extra bits
...
28..29 = offset 16385..32768, 13 extra bits
The format allows either a default or a custom Huffman code. The default code lengths are as follows:
Literal/length
0..143 = 8 bits
144..255 = 9 bits
256..279 = 7 bits
280..287 = 8 bits
Compensa
0..29 = 5 bits
If a custom Huffman table is used, then the table is transmitted as a sequence of code lengths. That sequence is itself compressed by run length encoding using another Huffman code to encode the literals and run lengths. It uses a 19 symbol alphabet:
0..15 = code lengths of 0..15
16 = copy the previous code 3..6 times, followed by 2 extra bits
17 = copy 3..10 times, 3 extra bits
18 = copy 11..138 times, 7 extra bits
The Huffman table for these codes are sent as a sequence of up to 19 3-bit numbers. This sequence is further compressed by reordering the sequence so that the values most likely to be 0 (not used) are at the end, and sending the sequence only up to the last nonzero value. A 4 bit number indicates the sequence length. The order is: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. All Huffman codes are packed in LSB to MSB order.
zip and gzip take an option -1 through -9 to select compression level at the expense of speed. All options produce compressed data in deflate format which decompresses at the same speed (much faster than compression) with the same algorithm. The difference is that with the higher options, the compressor spends more time looking for encodings that compress better. A typical implementation will keep a list of 3 byte matches (the shortest allowed) in a hash table and test the following data to find the longest match. With a higher option, the compressor will spend more time searching. It is also sometimes possible to improve compression by encoding a literal even if a match is found, if it results in a longer match starting at the next byte. Such testing also increases compression time. kzip performs an extreme level of optimizations like this. Compressed sizes and compression times on a 2.0 GHz T3200 are shown below for the 14 file Calgary corpus.
Program Size Time
------ --------- ----
zip -1 1,194,720.17 sec.
zip -2 1,151,711.23
zip -3 1,115,078.25
zip -4 1,072,909.25
zip -5 1,041,083.33
zip -6 1,028,171.40 (default)
zip -7 1,025,244.42
zip -8 1,021,607.50
zip -9 1,020,831.67
kzip 978,707 24.21
unzip.10
LZMA (Lempel Ziv Markov Algorithm) is the native compression mode of 7-zip. Although 7-zip is open source, the algorithm was never well documented or described until it was analyzed by Charles Bloom in Aug. 2010. Compression is improved over LZ77 by using a longer history buffer (up to 4 GB), optimal parsing, shorter codes for recently repeated matches (like LZX), literal exclusion after matches, and arithmetic coding.
Optimal parsing. Simple LZ77 uses greedy parsing. The next byte is always coded using the longest match found in the history buffer. 7-zip uses look-ahead to consider a shorter match or a literal if it allows a longer match to be coded afterward with a shorter total code length.
Matias and Sahinalp (1999) proved that for any dictionary encoding method with the prefix property, that there is a fast ( O (n)) algorithm that optimally parses the input into the minimum number of phrases by looking only at the lengths of the next two phrases considered. A dictionary has the prefix property if for any phrase in it, all of its prefixes are also in the dictionary. LZ77 and LZW both have this property. However, optimal parsing is more difficult to achieve in practice because phrases may have different code lengths, so that the minimum number of phrases does not necessarily give the best compression.
Repeated matches. 7-zip uses shorter codes (like LZX) to code the 3 most recent matches in place of the LZ77 offset. In the case of the most recent match, the match length can be as small as 1.
Literal exclusion after matches. 7-zip uses a "delta literal" to code the first literal after the match. The idea is that the literal to be coded is known to be different than the predicted byte following the match in the history buffer. An ordinary context model would incorrectly assign a high probability to the predicted byte, which would waste code space. To prevent this, the coded literal is exclusive-ored with the predicted byte.
De codificare. 7-zip uses a binary arithmetic coder. It first codes one of 4 possible choices: literal, match, rep (repeat of one of the 3 last matches) or short rep (repeat the last match with a length of 1). A literal code is followed by the 8 bits of the coded byte in MSB to LSB order. A match is coded as a length followed by the offset, each using a variable length code in which the length of the binary value is transmitted followed by the value. A rep is followed by a length and a 1 or 2 byte code indicating the position in a move-to-front queue of the last 3 offsets. A short rep (SRep) code is complete.
Context modeling. Literals are coded using an order 1 model plus the low bits of the current position. The position is useful for modeling binary files that have a repeating structure that is a power of 2, for example, an array of integers or floating point numbers. An order 1 model means that the context for each bit is the 8 bits of the previous byte and the 0 to 7 previously coded bits of the current byte.
Match offsets are coded by transmitting the binary length of the offset and then the offset in LSB to MSB order. The length of the offset is coded using the match length as context. The context for the 4 low bits is the 0 to 3 previously coded bits. The higher order bits are not compressed.
The match/literal flags (literal, match, rep, short rep) are coded using the low bits of the position as context (as with literals) plus the encoding state. The state is one of 12 values and is updated by the following state table:
Next state
State Lit Mat Rep SRep Meaning of current state
----- --- --- --- ---- ------------------------
0 0 7 8 9 literal after 2 or more literals
1 0 7 8 9 literal after match, literal
2 0 7 8 9 literal after rep, literal or after non-literal, short rep, literal
3 0 7 8 9 literal after literal, short rep, literal
4 1 7 8 9 (delta) literal after match
5 2 7 8 9 (delta) literal after rep or after non-literal, short rep
6 3 7 8 9 (delta) literal after literal, short rep
7 4 10 11 11 match after literal
8 5 10 11 11 rep after literal
9 6 10 11 11 short rep after literal
10 4 10 11 11 match after non-literal
11 5 10 11 11 rep or short rep after non-literal
LZX is an LZ77 variant used in Microsoft CAB files and compressed help (CHM) files. It uses a history buffer of up to 2 MB and Huffman coding. To improve compression, it uses shorter codes to code the 3 most recent matches as with LZMA.
The idea of using shorter codes for recent matches can be extended. The compressor for lzrw3 builds a dictionary (a hash table) of pointers into the history buffer as usual to find matching strings, but instead of coding the offset, it codes the index into the table. The decompresser builds an identical hash table from the data it has already decompressed, then uses the index to find the match. The length is coded as usual.
ROLZ (reduced offset LZ) extends this idea further by replacing the large hash table with many smaller hash tables selected by a low order context. This reduces the size of the offset, although it can sometimes cause the best match to be missed. ROLZ was implemented in WinRK.
The extreme case of ROLZ is to use one element per hash table. In this case, only a literal or length must be coded. This algorithm is called LZP. It was first described by Charles Bloom in 1995. LZP works best with a high order context. Thus, it is often used as a preprocessor to a low or moderate order context model, rather than a fast order 0 model like LZ77.
snappy is a free, open source (Apache) compression library from Google in 2011. It uses byte aligned LZ77 intended for high speed (especially on x86-64 hardware) rather than good compression. Google uses snappy internally to compress its data structures for its search engine.
The compressed data contains tag bytes such that the low 2 bits indicate literals and matches as follows:
00 = literal
01 = 1 byte match
10 = 2 byte match
11 = 4 byte match (not used)
A literal of length 1 to 60 is encoded by storing the length - 1 in the upper 6 bits. Longer literals are coded by storing 60..63 in the upper 6 bits to indicate that the length - 1 is encoded in the next 1 to 4 bytes in little-endian (LSB first) format. This is followed by the uncompressed literals.
Matches of length 4 to 11 with offsets of 1 to 2047 are encoded using a 1 byte match. The match length - 4 is stored in the middle 3 bits of the tag byte. The most significant 3 bits of the offset are stored in the upper 3 bits of the tag byte. The lower 8 bits of the offset are stored in the next byte. A match may overlap the area to be copied. Thus, the string "abababa" could be written using a literal "ab" and a match with an offset of 2 and length of 5. This would be encoded as:
000001 00 (literal of length 2)
01100001 (literal 'a')
01100010 (literal 'b')
000 001 01 (high bits of offset, match of length 5)
00000010 (low 8 bits of offset)
Matches of length 1 to 64 with offsets of 1 through 65535 are encoded using a 2 byte match. The length - 1 is encoded in the high 6 bits of the tag byte. The offset is stored in the next 2 bytes with the least significant byte first. Longer matches are broken into matches of length 64 or less.
A 4 byte match allows offsets up to 2 32 - 1 to be encoded as with a 2 byte match. The decompresser will decode them but the compressor does not produce them because the input is compressed in 32K blocks such that a match does not span a block boundary.
The entire sequence of matches and literals is preceded by the uncompressed length up to 2 32 - 1 written in base 128, LSB first, using 1 to 5 digits in the low 7 bits. The high bit is 1 to indicate that more digits follow.
Compression searches for matches by comparing a hash of the 4 current bytes with previous occurrences of the same hash earlier in the 32K block. The hash function interprets the 4 bytes as a 32 bit value, LSB first, multiplies by 0x1e35a7bd, and shifts out the low bits. The hash table size is the smallest power of 2 in the range 256 to 16384 that is at least as large as the input string. This is a speed optimization to reduce the time to clear the hash table for small inputs.
As an optimization for hard to compress data, after 32 failures to find a match, the compressor checks only every second location in the input for the next 32 tests, then every third for the next 32 tests, and so on. When it finds a match, it goes back to testing every location.
As another optimization for the x86-64 architecture, copies of 16 bytes or less are done using two 64-byte assignments rather than memcpy(). To support this, if 15 or fewer bytes remain after a match then they are encoded as literals with no further search.
Deduplication is the application of LZ77 to a file system rather than a data stream. The idea is to find duplicate files or files containing large blocks of data duplicated elsewhere, and replace them with pointers. Deduplication can save disk space, but is more commonly used for incremental backups where only data that has been changed is saved.
Whole file deduplication is easy to implement. If a file system enforces a "last modified" timestamp or an "archive" flag, then files that have not been modified since the last backup need not be saved. Otherwise, or for on-disk deduplication, a secure hash such as SHA-1 of each file can be computed. If two files have the same 160 bit hash, then it is safe to assume that the contents are identical. It is not impossible for different files to have the same hash, but the probability of a collision, 2 -160 is far smaller than the 2 -60 probability of a disk I/O error. Furthermore, it is believed to be computationally infeasible to deliberately produce a collision.
Compression can be improved by dividing large files into smaller blocks and searching for duplicates among them. This can be done efficiently by computing a hash of each block and saving pointers to the blocks in a hash table. To save space, it might be desirable to use a smaller but insecure hash and check for potential collisions by comparing the blocks.
Dividing large files into fixed sized blocks (typically 8 KB to 64 KB) will not find duplicates between versions in which a small amount of data was inserted or deleted near the beginning because the data afterward would has moved relative to the block boundaries. This problem can be solved by selecting boundaries that depend on a rolling hash function over a sliding window. We set a boundary whenever n bits of the hash are equal to a fixed value. The average block size will be 2 n. For example, the following rolling hash function uses a sliding window of 32 bytes and marks a boundary on average every 16 KB when the high 14 bits of the hash are set to 1.
uint32_t h = 0; // rolling hash value
int c; // input byte
const int K = 876543210; // a random 32 bit even number not a multiple of 4
while ((c = getchar())!= EOF) {
h = (h + c + 1) * K;
if (h >= 0xfffc0000)
mark_boundary();
}
Dictionary methods substitute codes for common strings from a table or dictionary. A dictionary code may be, fixed, static or dynamic. In the fixed case, the dictionary is specified as part of the algorithm. In the static case, the compressor analyzes the input, constructs a dictionary, and transmits it to the decompresser. In the dynamic case, both the compressor and decompresser construct identical dictionaries from past data using identical algorithms.
LZW (Lempel-Ziv-Welch) is a dynamic dictionary method. It is used by the UNIX compress program, GIF images, and is one of the compressed modes in TIFF images. The algorithm was patented by both Sperry (later Unisys) in 1981 and by IBM and 1983 when the USPTO did not realize that they were the same algorithm. Unisys was criticized for waiting until GIF became an established standard before demanding royalties from makers of software that could read or write GIF images. Both patents are now expired.
LZW starts with a dictionary of 256 1-byte symbols. It parses the input into the longest possible strings that match a dictionary entry, then replaces the string with its index. After each encoding, that string plus the byte that follows it is added to the dictionary. For example, if the input is ABCABCABCABC then the encoding is as follows:
65 = A (add AB to dictionary as code 256)
66 = B (add BC as 257)
67 = C (add CA as 258)
256 = AB (add ABC as 259)
258 = CA (add CAB as 260)
257 = BC (add BCA 261)
259 = ABC (end of input)
Dictionary codes grow in length as it becomes larger. When the size is 257 to 512, each code has 9 bits. When it is 513 to 1024, each code is 10 bits, and so on. When the dictionary is full (64K = 16 bits), it is discarded and re-initialized.
A Windows version of the UNIX compress program compresses the Calgary corpus to 14 files totaling 1,272,722 bytes in 0.34 seconds and decompresses in 0.23 seconds.
Other variants of LZW may use larger dictionaries, or may use other replacement strategies like LRU (least recently used), or other strategies for adding new symbols such as concatenating the last two coded symbols instead of a symbol plus the next byte.
Dictionary encoding improves the compression of text files by replacing whole words with symbols ranging from 1 to 3 bytes. Fixed English dictionaries are used in WinRK, durilca, and in some versions of PAQ such as PAsQDacc 4.3c -7, which compresses the Calgary corpus to 567,668 using a dictionary extracted from the corpus itself, but not included in the compressed size. It is, of course, possible to compress to arbitrarily small sizes using this technique. The extreme case is barf. It has a built in 14 word dictionary, one for each file of the Calgary corpus. When the compressor detects a match, it "compresses" the file to 1 byte, which the decompresser correctly expands.
For this reason, the large text benchmark and contests like the Calgary challenge and Hutter prize include the size of the decompression program and all other files needed to decompress. Still, it may be useful to use a dictionary for one or more languages if the input is expected to contain text in those languages.
Of more interest are static dictionaries. These are used by the top 3 programs on the large text benchmark (as of Feb. 2010), and in all of the Hutter prize winners. Some of the later Calgary challenge winners also use small dictionaries.
Recall from section 1.4 that text compression is an AI problem. This can be seen by playing Shannon's character guessing game which he used to estimate that the entropy of written English is about 1 bpc (Shannon, 1950). Try partially covering some text with your hand and guessing what letters come next from the earlier context, for example: "the cat caught a mo___". Humans can beat computers at the game because the prediction problem requires vast understanding of English and of the world. Nevertheless, some of the constraints of natural language can be modeled. These rules are categorized as follows:
While playing the game, you will notice that useful contexts start on word boundaries. Thus, "a mo_" and "caught a mo_" are useful contexts, but "ght a mo_" is no more useful than the lower order "a mo_". Thus, text models in PAQ construct contexts that start on word boundaries.
It should be irrelevant if a context spans a line break. Thus, word contexts in PAQ discard the characters between words. Furthermore, it should be irrelevant if the context is upper or lower case, because it does not change the meaning. Thus, there are word models that ignore case.
Semantics can be modeled by associating each pair of words like (cat, mouse) with a co-occurrence frequency over a small window. Words that frequenly occur near each other tend to have related meanings. This can be modeled with a sparse order-1 word model, skipping one or more words in between the context and the predicted word. Many PAQ versions have sparse word models with small gaps of 1 to 3 words.
For modeling semantics, it is useful to split text into "meaningful" units or morphemes if possible. For example, "moves" really has two independent parts, the stem "move" and suffix "s". Ideally these should be modeled as separate words.
Grammar constrains text to make certain sequences more likely, such as (the NOUN) or (a ADJ NOUN). It is possible to learn the parts of speech by observing when words occur in similar contexts and grouping them. For example "the dog", "the cat", and "a dog" could be used to predict the unseen sequence "a cat". This works by grouping "the" and "a" into one semantic category and "dog" and "cat" into another category.
A dictionary transform works by replacing the input text with a sequence of highly predictive symbols corresponding to morphemes, independent of capitalization and punctuation. This improves compression both by allowing simpler models to be used, and by reducing the size of the input, which improves speed and reduces pressure on memory. Compression can be improved further by grouping semantically or grammatically related words so that the compressor can merge them into single contexts by ignoring the low bits of dictionary codes. Care should be taken not to remove useful context, which can happen if a dictionary is too large or divides words in the wrong places. For example, coding "move" and "remove" as unrelated symbols would not help compression.
A capitalization model replaces upper case letters with a special symbol followed by a lower case letter. For example, "The" could be coded as "Athe" where "A" directs the decompresser to capitalize the next letter. Alternatively, a more sophisticated model might automatically capitalize the first letter after a period, and insert a special symbol to encode exceptions.
Because newlines are semantically equivalent to spaces, it is sometimes useful to replace then with spaces, and encode in a separate stream the information to put them back. A simple transform is space stuffing, where a space is inserted after every newline. For example, this has the effect of merging the order 4 contexts " the" and "\nthe" (where \n is a linefeed) by replacing the latter with "\n the". Space stuffing does not help with multi-word contexts that span lines. However the alternative is to remove context that could predict newlines, such as periods at the end of a paragraph.
Word encoding is done in two passes. In the first pass, the text is parsed into words (sequences of AZ, az, and possibly UTF-8 characters in non-English alphabets) and counted. Words with counts below a threshold are discarded. In the second pass, words found in the dictionary are replaced with 1 or 2 byte codes (or 3 bytes for large dictionaries). The dictionary is listed at the beginning of the output, followed by the encoded data. Words not found in the dictionary and non-letters are passed unchanged.
Words are typically encoded with bytes from the part of the ASCII set that does not normally appear in text, namely 0..8, 11..12, 14..31, and 127..255. If capitalization modeling was done, then 65..90 (AZ) may also be used. If such bytes do appear, they must be preceded by an escape byte, designated as one of the above. The remainder of the alphabet may be used to encode words.
XWRT (XML Word Reducing Transform) by Przemyslaw Skibinski in Oct. 2007 performs dictionary encoding. The dictionary is appended as a header to the output with one word per line in decreasing order of frequency. The most frequent words are encoded with one byte.
The following table shows the effect of simple capitalization modeling (using "A" followed by lowercase), space stuffing, and word encoding using XWRT on book1 from the Calgary corpus on 4 compressors. The -f option selects the minimum word frequency for inclusion in the dictionary. The number in parenthesis shows the dictionary size that results. (The exact options are -o -l0 -c -s -n -w -m256 for xwrt 3.2 to turn off other transform options. There is no space or capitalization modeling).
book1 zip -9 sr2 bzip2 -9 ppmonstr
------- ------- ------- ------- -------
No transform 768,771 312,502 276,364 232,598 203,247
Capitalization 785,101 311,696 275,124 231,594 202,650
Space stuffing 785,393 313,640 275,161 229,988 202,274
Both 801,723 312,856 273,864 229,861 201,706
xwrt -f3 (4307) 366,323 265,721 246,897 233,928 211,382
xwrt -f6 (2806) 378,289 267,522 246,760 231,675 208,801
xwrt -f20 (789) 449,233 278,639 254,227 230,243 204,897
xwrt -f100 (174) 542,670 290,268 262,575 228,904 202,832
The table shows that space stuffing and capitalization usually help, but that word encoding becomes less effective as the compression improves. It is nevertheless useful for compressing the large text benchmark where memory constraints are severe because it reduces the size of the input. The top 3 programs use it. Capitalization modeling is also useful, but space stuffing is not because line breaks are only used to separate paragraphs.
XWRT is ranked sixth (as of Feb. 2010) on the large text benchmark when used with its built in LPAQ6 compressor. The optimal setting in this case is -f200 to select a dictionary size of about 40,000 words. It does slightly better (ranking fifth) as a preprocessor to ppmonstr with option -f64.
paq8hp12 and drt|lpaq9m, both by Alexander Ratushnyak, are ranked second and third on the large text benchmark and are the basis of winning entries for the Hutter prize. These both use a custom dictionary of about 44,000 words. The higher frequency words are grouped semantically, such as "son" with "daughter" and "monday" with "tuesday". Among the lower frequency words, they are grouped by common suffix (alphabetical order when reversed) to make the dictionary compress smaller.
durilca_kingsize is the top ranked program on the large text benchmark, but only because it uses 13 GB of memory, vs. 2 GB. It uses a dictionary of about 124,000 words. These are also grouped semantically, but by an automated process that clustered words in context space. The algorithm was not documented, but the idea is roughly to group words together if they are likely to appear in the same context.
Symbol ranking, or move-to-front (MTF), is a transform in which the alphabet is maintained in a queue from newest to oldest and coded by its position. The idea is that the most recently seen symbol is the most likely to occur in the future.
srank is a symbol ranking compressor by Peter Fenwick in 1996. An order 3 context hash without collision detection is mapped to a queue of length 3 representing the 3 most frequently seen bytes in that context. These are Huffman coded with 1, 3, or 4 bits respectively. Long runs of first place bytes are run length encoded with 12 bits to encode the run length. Bytes not seen in the queue are modeled in an order 0 pseudo-MTF queue using 7 bits for the first 32 positions and 12 bits for the other 224. It is called "pseudo-MTF" because when a byte is observed, it is moved only about half way to the front with some random dithering. This is an optimization for speed. It allows a fast update of an index into the queue. The order 3 hash table maximum size is 2 18 queues (1 MB memory).
sr2 is an improved (but slower) symbol ranking compressor by Matt Mahoney in Aug. 2007. An order 4 context hash is mapped to a table of 2 20 3 byte MTF queues and a counter for consecutive first place hits ranging from 0 to 63. When the first place byte is observed, the counter is incremented. For all other values, the counter is reset to 1 if in the queue or 0 if not. The new value is pushed to the front of the queue and the others pushed back. For example, the sequence ABCCC would result in the queue (C,B,A,3) with C at the front. A subsequent B would result in (B,C,A,1). A subsequent D would result in (D,B,C,0).
A byte is first Huffman coded and then arithmetic coded. The point of the Huffman code is to reduce the number of arithmetic coding operations for better speed. Suppose the queue contains (c1,c2,c3,n). The coding and next state is as follows:
Input Code (c1 c2 c3 n) next state
----- ---- --------------------------
Initial (0, 0, 0, 0)
c1 0 (c1, c2, c3, min(n+1, 63))
c2 110 (c2, c1, c3, 1)
c3 111 (c3, c1, c2, 1)
other c 10cccccccc (c, c1, c2, 0)
The bits are coded using a direct context model with a count ranging from 2 to 128 (section 4.1.2). For n? 4, the context is order 1 plus n plus the previous bits of the current symbol. For n < 4, the model is the same except order 2.
The following comparison is for the Calgary corpus as 14 files compressed separately.
Program Size Compr Decompression
--------- --------- ---- --------
srank -C8 1,281,984 0.20 0.20 sec.
sr2 975,208 0.48 0.50 sec.
During development, it was observed that an order 3 context sometimes compressed better on smaller files, but order 4 works better on larger files. Increasing the hash table beyond 2 20 did not help much, in spite of the fact that more memory almost always helps any algorithm.
Arithmetic decoding is slightly slower than encoding. Recall that the steps to compress are:
For decompression:
Modern processes can reorder instructions and execute them in parallel. During compression, steps 2 and 3 are independent, so they can overlap. During decompression, the model cannot be updated until the bit has finished being decoded.
A Burrows-Wheeler Transform sorts the input by its right context. By bringing together characters with similar contexts, the transformed data can be more easily compressed with a fast adapting order 0 model. Shown below is a portion of the Burrows Wheeler transform of book1 from the Calgary corpus (with newlines converted to spaces for clarity). The column in bold is the BWT.
BWT block --+ +--- Sorted on this column
\ /
VV
ING. Her cul p ability lay in her m
e of the ins t ability of a woman?
hat the desi r ability of her exist
tion, from i n ability to guide inc
husband's i n ability to meet the
nervous exci t ability, he returned
stimony, pro b ability, induction -
le of respec t ability, were now si
of a ship's c abin, with wood slid
new riding- h abit -- the most ele
mostly her h abit hen excited, he
's virtuous h abit of entering the
new riding- h abit of myrtle-green
aracter and h abit, and seemed so
e no riding- h abit, looked around
besides the h abitable inn itself,
ceived no in h abitant for the spac
' by the in h abitants of Caster+
ttle <P 61> h abitation, and here
every human h abitation, and the h
those old-in h abited walls. Acesta a fost
ntly to old h abits and usages, si
o imply his h abitual reception of
rgrass, who h abitually spoke on a
at everybody abjured her -- for w
osed all the able-bodied men upon
en the favou r able-con+ junction s
ame to the s t able-door and looked
d been answe r able.! ' " We must
The BWT is the column in bold. It is "... ptrnntbtchhhhhhhhhhhhhhh rtr...".
BWT is best suited for stationary sources. For example, a sorted list of words would be compressed poorly because local rules (newline is followed by "A", later changing to "B") become spread throughout the transform. For these cases, separating different data types into independently compressed blocks can improve compression. Otherwise, the largest possible block size should be used.
BWT compression depends on the alphabet order. Best compression is obtained when related symbols such as letters or digits that make similar predictions are grouped together. The ASCII character set already has this property, but is not optimal.
Compressors that use BWT are called context sorting or block sorting algorithms. A typical implementation is to divide the input into fixed sized blocks (as large as memory allows) and sort an array (of the same size) of pointers into the block. The sort order is the lexicographical order of the string to which it points, wrapping around to the beginning of the block if necessary. Wrapping around can be avoided by appending a virtual "end of block" symbol which is lexicographically less than any of the symbols in the alphabet (eg -1).
A straightforward sort using a fast sorting algorithm like a radix sort or quicksort can be very slow on highly redundant or compressible data. Quicksort requires on average O (n log n) string comparisons, but worst case string comparison is O (n), as in the case of a block of identical bytes. This would result in O (n 2 log n) run time. There are several ways to avoid the problem of long common prefixes.
There are fast suffix sorting algorithms that use 5 or 6 times the block size (5n to 6n) in memory. One of the earliest is a suffix tree sorting algorithm developed by Ukkonen (1995), which is O (n) but requires 13n-15n memory. The idea is to build a tree in which each path from root node to a leaf node represents one of the n suffixes of the block. Each edge is labeled with a string of one or more bytes (represented as a pair of pointers into the block) Each of the n leaf nodes is a pointer to the start of the suffix. Each internal node except possibly the root node is required to have two or more children (in lexicographical order from left to right). This limits the number of nodes to at most 2n. The algorithm is to build the tree and then output the sorted suffix pointers by a recursive inorder traversal from the root node:
traverse(node):
if the node is a leaf then
output block[leaf-1]
altfel
for each child from left to right
traverse(child).
A suffix tree for the string "BANANA" with end of block symbol "$" is shown below. The downward arrows are called transitions. Each transition is labeled with a string of one or more bytes. The dotted lines are called suffix links. They form a path from the longest suffix represented by a nonterminal node (having a child $) to successively shorter suffixes back to the root (which represents the empty suffix). The suffix links are maintained to make the tree building algorithm efficient.

A suffix tree of "BANANA$" (from Wikipedia ).
Ukkonen's algorithm is to start with just a root node and scan the block from left to right in a single pass. For each byte, the suffix pointers form a linked list of nodes that need to be updated. An update consists of appending a child node, splitting edges as needed, and updating the suffix pointers. The suffix pointer traverse stops when there is already a child node for the character being added.
A suffix array represents a suffix tree using just the block, the sorted list of pointers, and the longest common prefix (LCP), which is the number of initial bytes shared with the previous suffix after sorting. The suffix array of "BANANA$" is shown:
Index Suffix LCP
----- ------ ---
6 $ 0
5 A$ 0
3 ANA$ 1
1 ANANA$ 3
0 BANANA$ 0
4 NA$ 0
2 NANA$ 2
Yuta Mori has benchmarked several suffix array sorting algorithms. The fastest are his own libdivsufsort, MSufSort, used in M99 and M03 by Michael Maniscalco, and archon4r0 used in the dark archiver. All of the sorting algorithms are open source.
It is rather surprising that a BWT block can be inverted to recover the original data. The only other information needed is the new position of the original first byte. We are given a BWT string BWT[0..n-1] of length n, and the location p (0? p < n) of the position of the first byte BWT[p]. The algorithm to output the original string is:
T = sort(BWT)
Repeat n times:
output BWT[p]
move p from the i'th location of c in T to the i'th location of c in BWT
For example, suppose that BWT[0..5] = "NNBAAA" is the BWT of "BANANA". dupa cum se arata. (To simplify the explanation, we omit the $, but it works the same either way).
BWT Sorted context
\ /
N ABANA
N ANABA
B ANANA p=2
A BANAN
A NABAN
A NANAB
Create T[0..5] = "AAABNN". Avem acum:
P
0 1 2 3 4 5
BWT = NNBAAA
T = AAABNN
Pasii sunt:
output BWT[2] = B
p is the third A in T. Move to p=5, the third A in BWT.
output BWT[5] = A
p is the second N in T. Move to p=1, the second N in BWT.
output BWT[1] = N
p is the second A in T. Move to p=4, the second A in BWT.
output BWT[4] = A
p is the first N in T. Move to p=0, the first N in BWT.
output BWT[0] = N
p is the first A in T. Move to p=3, the first A in BWT.
output BWT[3] = A.
As an optimization, we may represented the sorted array T solely by the starting position of each of the 256 sequences of byte values, for example A=0, B=3, C=3,..., N=4, O=6. Furthermore, we can construct in advance a list NEXT[0..n-1] such that NEXT[p] is the next move for p. For example, NEXT[2] = 5 would be the first move. To build this list we scan BWT and use T to count the occurrence of each byte value.
for i in 0..n-1 do
NEXT[T[BWT[i]]++] = i
In C++, the inverse BWT looks like this:
// Invert and output the BWT in bwt[0...n-1] starting at p
void invert_BWT(unsigned char *bwt, int n, int p) {
// Collect cumulative counts of bwt:
// t[i] = number of bytes < i
int t[257]={0}; // cumulative counts
for (int i=0; i<n; ++i)
++t[bwt[i]+1];
for (int i=1; i<257; ++i)
t[i]+=t[i-1];
assert(t[256]==n);
// Build linked list
int *next=calloc(n, sizeof(int)); // linked list
assert(next); // out of memory?
for (int i=0; i<n; ++i)
next[t[bwt[i]]++]=i;
assert(t[255]==n);
// Traverse and output list
for (int i=0; i<n; ++i) {
assert(p>=0 && p<n);
putc(bwt[p], out);
p=next[p];
}
free(next);
}
bzip2 is a popular open source BWT based file compressor developed in 1996 by Julian Seward. It takes an option -1 through -9 to select a block size of 100 KB to 900 KB. -9 generally gives the best compression. The compression algorithm is as follows:
bzip2 uses 2 to 6 Huffman tables, which are selected every 50 symbols to make the code adaptive. The tables are kept in a MTF queue. The selection code is unary coded. A unary code for a number n is n 1 bits and a 0. For example, 4 = 11110.
The Huffman tables are coded as a sequence of lengths. The lengths are delta coded, ie as the difference from the previous length. A difference is coded as 0 = 0, 10 = -1, 11 = +1, repeating as needed. A bitmap is used to mark unused queue selection codes, which are omitted from the sequence. The bitmap is divided into 16 16-bit words, where a 0 bit means the code is not used. If all 16 bits are 0, then the word is omitted. One additional 16 bit word is used to mark which words are omitted.
The original bzip was arithmetic coded, which is better suited for a fast adapting model. It was replaced with a Huffman code due to patents (now expired) on arithmetic coding.
BBB (big block BWT) is an open source file compressor written by Matt Mahoney in Aug. 2006. It has two innovations: a "slow" mode that allows blocks up to 80% of available memory, and a context mixing model of the BWT sequence. When released, it was top ranked on the large text benchmark among BWT compressors because it was the only program that could fit the 1 GB test file into a single block on a 2 GB machine.
To context sort a large block, it is first divided into 16 smaller blocks of 4-byte pointers which are sorted with wraparound using std::stable_sort() (normally a generic comparison sort such as quicksort ). The pointers are then written to 16 temporary files, which are merged to produce the final result.
The inverse transform does not build a linked list, because this takes 4 times as much memory as the block size. Recall that the inverse transform first sorts the bytes in the block into an array T (represented by 256 cumulative counts), and that p points to the next output byte in the block. The steps to be iterated are:
Normally, step 2 is done by traversing a link in the list NEXT. Instead, BBB searches the block for the i'th occurrence of c. To do this efficiently, it first consults an index that locates every 16'th occurrence of c in BWT to get the approximate location, and searches linearly from there. This index takes 1/4 as much memory as the BWT block.
BBB uses an order 0 indirect context model (section 4.1.3) followed by 6 SSE stages and bitwise arithmetic coding. The model uses 5 MB of memory. Se pare ca aceasta:
0.5 0.25 0.5
+ SSE1 ->+ +---->----+ +---->----+
| | | | | |
ICM ->+ +-> SSE3 -> SSE4 +-> SSE5 -+-+-> SSE6 -+-> Encoder
| |
+ SSE2 ->+
0.5
The ICM takes a bytewise order 0 context, ie just the previous bits of the current byte. Recall that an ICM maps a context to a bit history (an 8 bit state), which is mapped to a slow adapting probability table.
Each of the SSE (section 4.3.3) maps a context and a probability (stretched and quantized to 32 levels) to a new probability interpolated between the two nearest quantized outputs. SSE1 and SSE2 are both order 0, but SSE1 is fast adapting (learning rate 1/32) and SSE2 is slow adapting (learning rate 1/512). The two predictions are averaged in the linear domain. The SSE in BBB update both quantized table entries above and below the input probability, unlike ZPAQ which updates only the nearest.
SSE3 takes a bytewise order 1 context. SSE4 takes the previous but not the current byte as context, plus the run length quantized to 4 levels (0, 1, 2..3, 4+).
SSE5 takes a sparse order 1 context of just the low 5 bits and a gap of 1 byte, ie 5 of the last 16 bits, plus the current byte:...xxxxx........ The output is averaged linearly with the input with weight 3/4 to the output.
SSE6 takes a 14 bit hash of the order 3 context. It is averaged linearly with its input with weight 1/2 each.
The table below compares the models for bzip2 and BBB on the Calgary corpus with each file compressed separately. In both cases the block size is 900 KB, which is large enough to hold each file in a single block. BBB is run in both fast and slow modes. About half of the compression time in both cases is due to PIC, which has long runs of 0 bytes. Unlike bzip2, BBB has no protection against long string comparisons while sorting.
Note also that compressing all of the data together as a tar file makes compression worse. As mentioned, BWT is poorly suited for mixed data types.
Program Calgary Compr Decomp (seconds) calgary.tar
-------- ------- ----- ------ -----------
bzip2 -9 828,347 0.68 0.42 860,097
bbb cfk900 785,672 10.33 1.46 (fast mode) 800,762
bbb ck900 785,672 13.74 5.54 (slow mode)
Most modern BWT based compressors use some type of suffix array sorting algorithm for the forward BWT transform. MSufSort version 2 is a fast and memory efficient (about 6n block size) algorithm developed by Michael Maniscalco (Puglisi, 2005; Puglisi, Maniscalco, 2006a). Later an improved version, MSufSort v3, used a different algorithm.
Recall that a suffix array (SA) is a list of pointers to the suffixes of a string in lexicographical order. Thus, the SA for "BANANA$" is (6,5,3,1,0,4,2), where "$" precedes all other characters. The BWT is calculated by outputting the characters to the left of each of the pointers, eg "ANNB$AA".
MSufSort v2 actually calculates the inverse suffix array (ISA). The ISA lists the lexicographical rank of each character in the string. For example, the ISA of "BANANA$" is (4,3,6,2,5,1,0). There is a simple algorithm for converting an ISA to an SA:
for each i in 0..n-1 do
SA[ISA[i]] = i
However, because SA and ISA each require 4n memory, the inversion is typically done in place. This can be achived by using one bit of each element (such as the sign bit) to mark those that have already been moved (Puglisi, 2005).
for i from 0 to n-1 do
if ISA[i] > 0
start := i
suff := i
rank := ISA[i] - 1
repeta
tmp := ISA[rank]
ISA[rank] = -suff (use sign bit to mark as moved)
suff := rank
rank := tmp
until rank = start
ISA[rank] := -suff
SA sorting based on comparison sorts have worst case run times of O (n 2 ) or worse because of the O (n) time required to compare strings when the average LCP is large, as in the case of highly redundant data. A common feature of efficient algorithms, such as the 15 described by (Puglisi, Smyth, Turpin, 2007), is that they exploit the fact that if two suffixes have a common prefix of m characters and their order is known, then we also know the order of their first m suffixes. For example, once we know that "ANA$" comes before "ANANA$", then we also know the order of their next m = 3 suffixes, specifically that "NA$" comes before "NANA$", "A$" comes before "ANA$", and "$" comes before "NA$". Suffix trees implicitly exploit this feature by merging the m common characters into a single edge, thus achieving O (n) sort times, although at a high cost in memory.
MSufSort v2 is faster in practice than a suffix tree, and uses less memory, which allows for better compression because blocks can be larger. However its run time is difficult to analyze because of its complexity.
MSufSort v2 begins by partially sorting the suffixes by their first two characters using a counting sort and storing the result as a stack of 64K linked lists reading right to left in the input string. To save space, the list pointers are overlayed on the ISA, and the head of each list is stored in a separate stack which is popped in lexicographical order. If a list contains only one element, then it is replaced by its rank in the ISA. The sign bit is used to distinguish ranks from pointers. If a list has more than one element, then that list is split into 64K new lists by a counting sort keyed on the next two characters, and the head of each list is pushed on the stack maintaining lexicographical order. Each stack entry has a head pointer and a suffix length. ~ denotes the last element in a list. De exemplu:
i 0 1 2 3 4 5 6 Stack
BANANA $ Input string
ISA[i] ~ ~ ~ 1 2 ~ Initial chains (5,2) A$, (3,2) AN, (0,2) BA, (4,2) NA
ISA[i] ~ ~ ~ 1 2 -1 Rank singleton A$ (3,2) AN, (0,2) BA, (4,2) NA
ISA[i] ~ ~ ~ ~ 2 -1 Split AN (3,4) ANA$, (3,4) ANAN, (0,2) BA, (4,2) NA
ISA[i] ~ ~ ~ -2 2 -1 Rank singleton ANA$ (3,4) ANAN, (0,2) BA, (4,2) NA
ISA[i] ~ -3 ~ -2 2 -1 Rank singleton ANAN (0,2) BA, (4,2) NA
ISA[i] -4 -3 ~ -2 2 -1 Rank singleton BA (4,2) NA
ISA[i] -4 -3 ~ -2 ~ -1 Split NA (4,4) NA$, (2,4) NANA
ISA[i] -4 -3 ~ -2 -5 -1 Rank singleton NA$ (2,4) NANA
ISA[i] -4 -3 -6 -2 -5 -1 Rank singleton NANA, stack is empty, so done.
There are two important optimizations. First, we note that during the chain splitting, if the rank of a suffix one or two characters to the right is known, then it must precede other suffixes where this is not the case. Those suffixes can also be sorted using those ranks as keys, and don't have to be pushed on the stack. For example, at the point where the chain NA at locations 2 and 4 is sorted, the positions at 3 and 5 have already been ranked (-2 and -1 respectively). Thus, we already know from those ranks that position 5 must come first. This optimization is known as induced sorting.
This optimization is insufficient to prevent worst case O (n 2 ) behavior on data with high average LCP. To handle this, MSufSort v2 detects repeating strings (like "ANAN") and applies the observation that if the suffix order of the last repeat is known, then the same order can be applied to all of the remaining prefixes and they can be written into the ISA. Once those ranks are applied, other rotations of the repeating string (like "NANA") are ranked by induced sorting.
The open source dark compressor by Dmitry Malyshev uses Itoh-Tanaka suffix array sorting (Puglisi, Smyth, Turpin, 2007, sec. 3.3) for the forward BWT transform. Itoh-Tanaka has O (n 2 log n) worst case complexity, so additional steps such as LZP preprocessing are needed to avoid long common prefixes.
Itoh-Tanaka classifies each input byte by whether or not it is larger than the byte following it. De exemplu:
0 1 2 3 4 5 6
Input: BANANA $
Type: LSLSLLS
We denote a character as type L if it is lexicographically larger than the byte after it and S (smaller) if it is smaller or equal. Itoh-Tanaka sorts the S characters using ordinary string comparison, then induces the sort order of the L bytes in a single pass. The first step is to put the unsorted type S characters into the suffix array, SA. We note that SA has 256 buckets plus the terminating sentinel, and furthermore each bucket is divided into two parts with all of the L bytes on the left and S on the right. Thus:
SA: $={6} A={- 1 3} B={-} N={- -}
Next we sort the right halves of all of the buckets that have two or more type S symbols. Itoh and Tanaka use a multikey quicksort although any string sorting algorithm could be used. Recall that the quicksort algorithm picks one element at random, called a pivot, then moves all elements smaller than the pivot to the left side of the array and all others to the right side, with the pivot in between. Then it recursively quicksorts the two halves. In a multikey quicksort, the array is divided into three parts (less, equal, greater) based on the first character of the string. The left and right parts are recursively sorted, and the middle part is recursively sorted using the next character of the string as the sort key, similar to a radix sort. After sorting we have:
SA: $={6} A={- 3 1} B={-} N={- -}
We now have enough information in SA to induce the rest of the elements. We can use the fact that when the order of two suffixes in SA is known and they are both preceded by the same character, then those two longer suffixes must also be in the same order. For example, if we know that suffix 6 ($) comes before 4 (NA$) and both are preceded by "A", then it must be the case that suffix 5 (A$) comes before 3 (ANA$).
The procedure for the induced sort is to scan SA left to right. For each pointer already filled in, we look at the character to its left. If it is type L then we place it in the leftmost available position in the appropriate bucket. De exemplu:
SA: $={6} A={- 3 1} B={-} N={- -}
Scan 6. Input[5] = 'A' is type L, so we put 5 in bucket A.
SA: $={6} A={5 3 1} B={-} N={- -}
Scan 5. Input[4] = 'N' is type L, so we put 4 at the left of bucket N.
SA: $={6} A={5 3 1} B={-} N={4 -}
Scan 3. Input[2] = 'N' is type L, so we put 2 at the next open spot in bucket N.
SA: $={6} A={5 3 1} B={-} N={4 2}
Scan 1. Input[0] = 'B' is type L, so we put 0 at the first open spot in bucket B.
SA: $={6} A={5 3 1} B={0} N={4 2}
Scan 0. There is no character to the left, so continue.
Scan 4. Input[3] is type S so continue.
Scan 2. Input[1] is type S so continue.
We now have the complete suffix array, (6,5,3,1,0,4,2). Besides the input array and SA, the only additional memory needed is an array of 256 pointers to the first available position in each bucket.
LibDivSufSort is an open source library for suffix array sorting by Yuta Mori. It is used for the forward BWT in bsc (Ilya Grebnov), mcomp, (the engine for WinRK by Malcolm Taylor), and was used in early versions of bcm (Ilia Muraviev).
DivSufSort improves on Ikoh-Tanaka by directly sorting only the last S type suffix in each run of consecutive type S. Then it induces the rest of the S's in a right to left scan, filling each bucket in from the right side. For each element SA[i], if input[i-1] is type S, then its index is put at the rightmost open position in the appropriate bucket (but overwriting the original rightmost S elements). This process is just like the induced sort of Itoh-Kanaka, but in the opposite direction. After this step, the type L suffixes are sorted as in Itoh-Kanaka. This 3 step method was first described by (Ko, Alura, 2003).
DivSufSort would still have O (n 2 log n) complexity due to the string sorting in the first step. To avoid this problem, it detects and induces tandem repeats like with MSufSort v2.
MSufSort version 3 uses a completely different (and faster) suffix array sorting algorithm than version 2 (Puglisi, Maniscalco, 2006b). It uses slightly more than 5n memory. It is used in the M03 compressor.
MSufSort v3 builds the SA rather than the ISA. It first uses a counting sort on the first two bytes to sort into 64K buckets. Then each bucket is sorted using the DivSufSort algorithm with detection and induction of tandem repeats as in MSufSort v2 to avoid delays caused by long common prefixes. Typically about 30% of the suffixes need to be sorted and the rest are induced. Strings are sorted using multikey quicksort (in increasing lexicographical order to support tandem repeat induction). As an optimization, groups smaller than 64 elements are sorted using insertion sort. Also, to control the size of the recursion stack, the sorting algorithm is changed to a heapsort after a recursion depth of 48.
A Burrows-Wheeler transform of a string consists of the characters of that string sorted by context, plus the index of the new position of the original start of the string. This index is needed because the context sorted string is not unique. For example, the BWT of BANANA (without a sentinel) is (NNBAAA,2), and the BWT of its rotation ANANAB is (NNBAAA,5). Thus, the BWT is not a bijection.
Note also that adding the sentinel does not make BWT a bijection either. Although the sentinel makes the context sort unique, it increases the size of the output string.
David A. Scott discovered in 2007 a bijective variant of the BWT (called BWTS for BWT Scottified) which omits the need to transmit the index of the starting string. Because the output of the BWTS is the same size as the input, all strings also have a valid inverse BWTS. The algorithm is described in (Kufleitner, 2009), who also extended the idea to the sort (Schindler) transform.
The idea is to divide the input string into a set of smaller strings such that the starting index of each substring is at a known, fixed location, such as at the first character. Specifically, the input is partitioned into a lexicographically nonincreasing sequence of Lyndon words. A Lyndon word is a string that lexicographically precedes any of its rotations. For example, the Lyndon factorization of BANANA is (B)(AN)(AN)(A). A Lyndon factorization is unique and can be calculated in O (n) time.
Recall that the inverse BWT is to use a counting sort to calculate T=sort(BWT) and then to build and traverse a linked list from the i'th occurrence of c in T to the i'th occurrence of c in BWT. In a BWT, this linked list usually forms a single loop that returns to the initially transmitted index when complete. (A repeating string will have multiple loops but will decode the same way). In the BWTS, the list forms one loop for each Lyndon word. The end of each word is detected by traversing the list back to its start. The next loop starts at the first unused element. The Lyndon words are restored right to left. For example, we compute the BWTS of "BANANA" by sorting the rotations of its 4 Lyndon words, and taking the last character of the sorted column as the BWTS:
BB
B AN AN AA
A Lyndon rotations NA sort AN last character N
N -------> ----------> -----> AN ---------------> N
A AN AN BB
N NA NA A
A NA A
AA
(When sorting, string comparison wraps around. Thus, B would fall between BA and BC).
To compute the inverse BWTS, we construct a linked list just like with a BWT.
i T BWTS Next
- - ---- ----
0 AA 0
1 AN 4
2 AN 5
3 BB 3
4 NA 1
5 NA 2
The linked list has 4 cycles: (0), (1,4), (2,5), (3). Reversing the order of the cycles and concatenating, we get (3,2,5,1,4,0). The corresponding elements of T spell out "BANANA".
A BWTS is 4 bytes smaller than a BWT. Experimentally, the effect on compression is small. The following table shows the effect of a BWT and BWTS followed by encoding using an adaptive order 0 coder (fpaq0p, learning rate 1/32), and an indirect order 0 model (fpaq0f2). For BWT, MSufSort 3.1.1 was used. It adds a 4 byte starting index to the beginning. For BWTS, the BWTS program by Scott was used with BLOCKSIZE set to 4000000 in the source code. In both cases, the input file is sorted in a single block. The test files are calgary.tar (mixed data types) and book1 from the Calgary corpus (uniform data).
File Encoder BWT BWTS
----------- ------- ------- -------
book1 fpaq0p 244,106 244,096
calgary.tar fpaq0p 985,906 985,885
book1 fpaq0f2 220,338 220,344
calgary.tar fpaq0f2 845,526 845,549
One possible benefit of a bijective compression algorithm is for encryption. Because all possible decryptions are valid inputs to the decompresser, it eliminates one possible test that an attacker could use to check whether a guessed key is correct. (However, one should not assume that there are not other tests, and should still choose strong keys). Scott has also written a bijective arithmetic coder, arb255. Thus, it is possible to make the entire compression algorithm a bijection.
A predictive filter is a transform which can be used to compress numeric data such as audio, images, or video. The idea is to predict the next sample, and then encode the difference (the error) with an order 0 model. The decompresser makes the same sequence of predictions and adds them to the decoded prediction errors. Better predictions lead to smaller errors, which generally compress better.
The simplest predictive filter is a delta code. The predicted value is just the previous value. For example, the sequence (5,6,7,9,8) would be delta coded as (5,1,1,2,-1). A second pass would result in (5,-4,0,1,-3).
Delta coding works well on sampled waveforms containing only low frequencies (relative to the sampling rate), such as blurry images or low sounds. Delta coding computes a discrete derivative. Consider what happens in the frequency domain. A discrete Fourier transform represents the data as a sum of sine waves of different frequencies and phases. In the case of a sine wave with frequency? radians per sample and amplitude A, the derivative is another sine wave with the same frequency and amplitude?A. From the Nyquist theorem, the highest frequency that can be represented by a sampled waveform is? or half the sampling rate. Frequencies above 1 radian per sample will increase in amplitude after delta coding, and lower frequencies will decrease. Thus, if high frequencies are absent, it should be possible in theory to reduce the amplitude to arbitrarily small values by repeated delta coding.
Eventually this fails because any noise (which is not compressible) in the prediction is added to noise in the sample with each pass. (Noise has a uniformly distributed spectrum, so its high frequency components are amplified by delta coding). Noise can come either from the original data or from quantization (rounding) errors during sampling. These are opposing sources. Decreasing the number of quantization levels removes noise from the original data but adds quantization noise.
The images below show the effects of 3 passes of delta coding horizontally and vertically of the image lena.bmp (a widely used benchmark image). The original image is in BMP format, which consists of a 54 byte header and a 512 by 512 array of pixels, scanned in rows starting at the bottom left. Each pixel is 3 bytes with the numbers 0..255 representing the brightness of the blue, green, and red components. The image is delta coded by subtracting the pixel value to the left of the same color, and again on the result by subtracting the pixel value below. (The order of the two encodings does not matter). To show the effects better, 128 is added to all pixel values (which does not affect compression). Thus, a pixel equal to its neighbors appears medium gray.

Imagine originale

Delta coded once horizontally and vertically.

Delta coded twice.

Delta coded 3 times.
The original image is 786,486 bytes (with or without delta coding). The following table shows the compressed sizes when compressed with an order 0 indirect context model (ICM-0), with each of the 3 colors compressed in a separate stream.
ICM-0
-------
lena.bmp 569,299
delta 1 511,316
delta 2 645,634
delta 3 768,154
For comparison, Image Magick compresses to PNG format with size 474,573, and the top ranked paq8px_v67 -6 to 412,641 bytes.
Details: The ICM-0 model was implemented in ZPAQ 1.10 using the following configuration:
comp 0 0 0 0 1 0 icm 7 (indirect context model using 2 7+6 bytes) hcomp b++ a=ba== 3 if (b is 0,1,2 depending on color) a=0 b=0 endif a<<= 9 *d=a halt (context is color in bits 10..9 + order 0) post 0 end
lena.bmp can be compressed to 499,139 bytes by performing the color transform (red, green, blue) to (red-green/4, green, blue-green*3/4), then delta coding and modeling with ICM-0. The transform was tuned to this image, but is not optimal for all images. For many others, the transform (red-green, green, blue-green) works well. The transform works because when one color is brighter, the others tend to be too. Thus, one color can predict the others. The ideal transform depend on the average color of the image.
A linear filter is a finite impulse response filter with n taps that predicts a sample x i in the sequence x 1 x 2..., dupa cum urmeaza:
prediction =? j=1..n w j x ij
where w j is called the j'th coefficient. A delta filter is the special case of the coefficient array n = 1, w = (1). Two passes of a delta filter is equivalent to n = 2, w = (2, -1), and 3 passes is equivalent to n = 3, w = (3, -3, 1).
An adaptive filter is a linear filter whose coefficients are adjusted to reduce prediction errors. A simple update rule is:
w j := w j +?x ji (x i - prediction), j = 1..n
where? is the learning rate. The update rule is unstable because of a positive feedback loop: when the error is large, it can lead to large updates which could increase the error even more. An adaptive filter must compensate by limiting the magnitudes of the weights and updates.
It is often possible to find transforms that improve compression for specialized data types. We mention two.
The E8E9 transform is used to compress x86 executable code (EXE or DLL files). In x86, a CALL or JMP instruction (E8 or E9 hex) is followed by a 4 byte address (LSB first) relative to the program counter. Compression can be improved by converting to an absolute address, because code often contains many calls or jumps to the same address. The transform consists of searching for a byte with the value E8 or E9 hex (232 or 233), interpreting the next 4 bytes as a 32 bit number, and adding the offset from the beginning of the input file. The decompresser does the same, except that it subtracts the offset. E8E9 is used in CAB format (for CALL instructions only) and in many top end compressors. Recent versions of PAQ8PX by Jan Ondrus also transform conditional branch addresses.
In x86-64, all references to static memory (not just JMP and CALL) are relative addresses. Currently transforms for x86-64 are not yet widely used.
Precomp is a program by Christian Schnaader that searches its input for segments of data compressed in deflate (zip) format and uncompresses them. This can improve compression if the (now larger) data is compressed with a better algorithm. Many applications and formats use deflate compression internally, including PDF, PNG, JAR (Java archive), ODT (OpenOffice) and SWF (Shockwave Flash).
To make the inverse transform bitwise identical, precomp tests by recompressing the data with zlib and comparing it. Recall that LZ77 is not a bijection. There are many different ways to compress a string that will decompress the same way. Precomp relies on the fact that most applications use zlib rather than write their own implementation. Still, precomp must test 81 combinations of options to find one that compresses to exactly the original data, and then stores those options in the output. If it fails to find a match (even in valid deflate data), then it must insert additional data.
Precomp can be used by itself. It is also built into two compressors, lprepaq (precomp+lpaq6) and paq8o8pre (precomp+paq8o8). paq8o8pre -7 compresses flashmx.pdf from the Maximum Compression corpus to 1,821,939 bytes in 692 seconds. As of Feb. 2010 the program has not yet been benchmarked and the best result is 3,549,197 bytes by WinRK 3.1.2. The improvement is obtained partly by unzipping many embedded BMP images and compressing them with paq8o8's specialized BMP model (which is also top ranked on rafale.bmp ).
The open source file compressor M1x2 v0.6 by Christopher Mattern in Feb. 2010 uses order 1 Huffman coding as a preprocessor to a context mixing model. The idea is to reduce the size of the input to make compression faster. The model contexts are aligned on Huffman code boundaries instead of byte boundaries. The order 1 coder is actually 256 tables selected by the previous byte. Huffman codes are limited to 12 bits in length to simplify the implementation.
Lossy compression refers to discarding unimportant information. Generally this means compressing images, video, or audio by discarding data that the human perceptual system cannot see or hear.
Lossy compression is a hard AI problem. To illustrate, speech could theoretically be compressed by transcribing it into text and compressing it with standard techniques to about 10 bits per second. We are nowhere near that.
Even worse, we could imagine a lossy video compressor translating a movie into a script, and the decompresser reading the script and creating a new movie with different details but close enough so that the average person watching both movies one after the other would not notice any differences. We may use a result by Landauer (1986) to estimate just how tiny this script could be. He tested people's memory (over a period of days) over a wide range of formats such as words, numbers, pictures and music, and concluded that the human brain writes to long term memory at a fairly constant rate of about 2 bits per second. Currently we need 10 7 bits per second to store DVD quality MPEG-2 video.
The state of the art is to apply lossy compression only at a very low level of human sensory modeling, where the model is well understood.
All image formats, even BMP, may be regarded as a form of lossy image compression. An uncompressed image is normally a 2 dimensional array of pixels, where each pixel has 3 color components (red, green, blue) represented as an integer with a fixed range and resolution. A pixel array is an approximation of a 2 dimensional continuous field where the light intensity at any point would be properly described as a continuous spectrum. Note how lossy compression is applied:
A BMP image uses 8 bits per pixel per color, which matches the resolution of most monitors. Each value is an 8 bit number ranging from 0 (darkest) to 255 (brightest). The values are proportional to apparent light intensity, not actual intensity. The actual intensity is gamma corrected by the monitor by raising it to the power of? = 2.2. Thus, a pixel value of 200 is a little over 4 times as bright as a pixel value of 100, although it only looks twice as bright.
The GIF image format is lossless except that it uses a color palette of up to 256 colors. The format is an array of 8 bit indexes into the palette. The limited number of colors noticeably reduces the quality of color photographs, although it is sufficient for grayscale or diagrams. A GIF file may contain multiple images for animations.
GIF uses LZW compression (section 5.3.1) with a maximum dictionary size of 4K. When the table is full, it is discarded and re-initialized. It reserves two codes to initialize the table and to mark the end of data.
Use of GIF was discouraged due to a patent on LZW, which is now expired.
PNG is a lossless image format. Images are normally 8 bits per pixel but can be more. A pixel has 3 color components and an optional fourth component for an alpha channel to indicate transparency.
PNG is compressed by predictive filtering (section 5.6) followed by deflate (section 5.2.2). There are 5 filters which can be selected for each scan line. The image is scanned left to right starting at the top. Let A, B, and C be the previously coded neighboring pixels of the predicted pixel x:
CB
Un x
The 5 possible predicted values are 0, A, B, (A+B)/2, or Paeth. The Paeth filter is to predict A, B, or C, whichever is closest to A + B - C. The Paeth filter usually gives the best compression.
TIFF is an image container format. Most commonly it is used for uncompressed images when BMP cannot be used because more than 8 bits per pixel or more than 3 color components are needed. TIFF supports several lossless compression modes. The most common is delta coding (subtracting the pixel to the left) followed by LZW.
JPEG is the most widely used representation for photographic images. It uses lossy compression. It exploits two limitation of human visual perception. First, the eye has different degrees of sensitivity to brightness variation depending on spatial frequency, peaking at 0.1 to 0.2 degrees (a few pixels). Second, the eye is much less sensitive to color variation at high spatial frequencies. The compression steps for baseline JPEG are as follows:
The color transform from RGB (red, green, blue) to YCbCr is:
Y = 0.299 R + 0.587 G + 0.114 B (black-white)
Cb = 128 - 0.168736 R - 0.331264 G + 0.5 B (yellow-blue)
Cr = 128 + 0.5 R - 0.418688 G - 0.081312 B (green-red)
The eye is less sensitive to fine detail in Cb and Cr than in Y, so these two are (optionally) downsampled 2 to 1 by averaging 2 by 2 blocks of pixels into 1 pixel. The inverse transform is:
R = Y + 1.402(Cr - 128)
G = Y - 0.34414(Cb - 128) - 0.71414(Cr - 128)
B = Y + 1.772(Cb - 128)
The DCT represents 8 by 8 blocks of pixels in the spatial frequency domain. The 64 DCT coefficients S uv, u, v in 0..7, of the 8 by 8 pixel block S xy, x, y in 0..7, are computed as follows:
S uv =?(u)?(v)? x=0..7? y=0..7 S xy cos[?/8 (x+1/2) u] cos[?/8 (y+1/2) v]
where?(0) = 1/8 1/2 and?(1..7) = 1/4 are normalizing factors. u is the horizontal spatial frequency and v is the vertical spatial frequency. The image below shows the contribution of each of the 64 S uv DCT coefficients to an 8 by 8 pixel block with u reading across and v reading down. The S 00 coefficient is at the top left.


Left: 8 by 8 DCT (from Wikipedia ). Right: zig-zag encoding order of the S uv coefficients (from Wikipedia ).
The inverse DCT has the same form:
S xy =?(x)?(y)? u=0..7? v=0..7 S uv cos[?/8 (u+1/2) x] cos[?/8 (v+1/2) y]
Each of the 64 coefficients in Y and the 64 in Cb and Cr are quantized by dividing by one of 128 values from two quantization tables and rounding. Because the eye is less sensitive to high spatial frequencies (u and v large), especially in the two chroma components, these divisors can be larger.
The coefficient S 00 is the average brightness of the 8 by 8 block. It is called the "DC" coefficient. It is the only one that depends significantly on neighboring blocks, so it is delta coded by subtracting the DC coefficient of the last coded block of the same color. The other 63 coefficients are called "AC".
For most images, the high spatial frequencies will be small except in regions with fine detail. Therefore the coefficients are reordered in zigzag order by increasing u+v, resulting in the largest coefficients appearing first.
The coefficients are grouped into runs of R zeros followed by one nonzero value, where R ranges from 0 to 15. The nonzero coefficient is a 12 bit signed number, but is usually near 0. It is coded as an S bit signed number, where S ranges from 1 to 12, followed by S extra bits. For example, the sequence 0,0,0,0,0,3 would be coded as R=5,S=2,11. The RS code (52) would be Huffman coded, and the two bits "11" (binary 3) would follow uncompressed. Negative numbers are coded by subtracting 1 and sending the same number of bits. For example, -3 would be coded as "00", which are the last 2 bits of -4. After the last nonzero coefficient, a RS code of 00 marks the end of block.
There are 4 Huffman tables for the RS codes, one each for DC-Y, DC-color, AC-Y, and AC-color. The tables are transmitted by sending 16 lists of RS codes (1 byte each) having code lengths of 1 through 16. Each list is preceded by one byte to indicate the length of the list. Other data such as the quantization tables are sent uncompressed. Huffman codes are packed in MSB to LSB order.
JPEG supports inserting restart codes into the Huffman coded data to mark the start of independently compressed image slices for error recovery. There is also an end of image symbol. The Huffman code is designed so that symbols can be found without decoding. Symbols are marked with a starting FF byte (11111111 binary). No symbol is assigned a Huffman code of all 1 bits. Also, if a byte of all 1 bits is coded, then it is followed by a 0 byte which the decoder skips.
The JPEG specification describes several modes in addition to baseline, described above. About 95% of images are baseline JPEG. The rest are mostly progressive mode. In this mode, a coarse approximation of the image is sent first so that the receiver can start displaying it before the rest of it is received. Progressive mode uses two techniques to do this. One is spectral selection, in which the low frequency DCT coefficients are sent first. The other is progressive approximation, in which the high order bits of the coefficients are sent first. Usually both techniques are combined.
JPEG allows up to 4 colors (for an alpha channel) and 12 bits per pixel. These modes are rare, but are supported by the IJG reference implementation. Images may also be grayscale by dropping the Cb and Cr components.
The JPEG standard also describes arithmetic coding as an alternative to Huffman coding, and a lossless hierarchical mode in which successively higher resolution images are sent as differences from the previous one. Neither of these two were implemented by IJG or any subsequent software because the methods were patented at the time.
In 2002, Forgent claimed US patent 4,698,672 on JPEG, specifically the invention of using a single code to represent a run length followed by a second value, which is used in the RS codes. By April 2004, Forgent announced that it had collected US$90 million from 30 companies and filed patent infringement suits against 31 others. In May 2006 the USPTO ruled that the claims of the patent related to JPEG were invalid due to prior art. The patent expired 5 months later.
The images below show the effects of the color transform and DCT. The first image was created using IJG 's public domain software cjpeg to convert lena.bmp (786,486 bytes) to JPEG (23,465 bytes). This is 17.5 times smaller than the best result obtained with lossless compression.


Left: lena.bmp, 786,486 bytes. Right: JPEG created with cjpeg -quality 50 -optimize, 23,465 bytes.
The -quality setting sets the quantization tables. These range from 16 for the Y-DC coefficient to 99 for high frequency coefficients. The -optimize parameter creates the best possible Huffman tables.
The images below show a JPEG image separated into its three color components by setting all of the other coefficients to 0. All images below are high quality (-quality 100) without chroma downsampling as above.

Y coefficients only.


Left: Cb only. Right: Cr only.
The images below separate some of the different frequency coefficients.

S 00 (DC)


First two AC coefficients. Left: S 01. Right: S 10. Note that these are color images.


Left: S 20. Right: S 11.
The following image illustrates the eye's insensitivity to fine detail in Cb and Cr. All of the 63 AC coefficients in Cb and Cr are set to 0, and yet the effect is barely noticeable. Compare with S 00 above when the Y AC coefficients are also removed.

All 63 AC chroma coefficients set to 0.
Even though JPEG is already compressed, there are lossless algorithms that can improve the compression by 20% to 30% and decompress to the original JPEG image.
On Jan. 7, 2005, Darryl Lovato of Allume (now Smith Micro) announced that their Stuffit archiver would compress JPEG files by 30%. Until then, it was widely believed that JPEG files could not be compressed further because they were already compressed. Indeed, most compression algorithms will not compress JPEG files more than about 1 to 2%. On the ACT benchmark by Jeff Gilchrist, an early version compressed 3 test images by 25% to 30%. The Maximum Compression JPEG benchmark gives similar results for newer versions.
According to US Patent 7,502,514, filed Jan. 4, 2005, the compressor uses a transform that partially decompresses the JPEG image back to the quantized DCT coefficients, undoing all of the lossless steps. These values are then compressed with a proprietary model. Some details are described in the patent. There are separate models for S 00 (the DC coefficient), S 0v, S u0 (DC in one dimension), and S uv for u, v > 0. The DC coefficient is modeled with a predictive filter. The decompresser decompresses the coefficients and inverts the transform by repeating the normal lossless JPEG compression steps to restore the original image. JPEG encoding from the DCT coefficients onward is deterministic, so the result is bitwise identical. The patent was issued to Lovato and Yaakov Gringeler. Gringeler was the deleloper of the (now unsupported) Compressia archiver.
PAQ versions beginning with PAQ7 in Dec. 2005 also include a JPEG model for baseline images (but not the less common progressive mode). Both PAQ and Stuffit compress to about the same size, but the PAQ compressor is much slower. PAQ is open source, so its algorithm can be described in better detail. It uses a context model instead of a transform. The context model decodes the image back to the DCT coefficients like the Stuffit algorithm, but instead this information is used as context to predict the Huffman coded data for both compression and decompression without a transform. The PAQ algorithm is not patented.
The PAQ model has evolved over time but all versions work basically the same way. In paq8px_v67 released in Nov. 2009, the Huffman codes are predicted using 28 indirect context models. These are mixed with 3 mixers. The output of those 3 are adaptively mixed and passed through 2 SSE stages and then arithmetic encoded. All contexts are computed on Huffman code boundaries and combined with earlier bits in the same Huffman code.
The most important context is the combination of the component (Y, Cb, Cr) and the u,v coordinates (horizontal and vertical spatial frequency) to one of 192 values. JPEG uses effectively only 4 values because it uses the same Huffman tables for Cb and Cr, and the same tables for all 63 AC components. The PAQ model distinguishes them. Coefficients with large u and v tend to be smaller. Note that what is being modeled is a Huffman code representing a run of zeros along a zig-zag path followed by a nonzero coefficient with extra uncompressed bits appended.
The second most important context is a prediction based on the partial DCT in one dimension of the 8 pixels along each of the two adjacent edges in the neighboring 8 by 8 blocks to the left and above. This model was added by Jan Ondrus beginning in July 2007 (starting with paq8fthis). (Ondrus, 2009). The model reflects the constraint that neighboring pixels in adjacent blocks should have similar values. More precisely, the 1 dimensional vertical DCT of the two adjacent columns of 8 pixels in two horizontally adjacent blocks should have similar values. Likewise, the horizontal DCT of adjacent rows of pixels in vertically adjacent blocks should be similar. A partial DCT is computed using one of the two summations in the DCT equation above, for example in the horizontal direction:
S u =?(u)? x=0..7 S x cos[?/8 (x+1/2) u],
where?(0) = 1/8 1/2,?(u) = 1/4, u > 0.
At the beginning of a block, the context is computed for the two neighboring blocks. For the block to the left, the DCT is inverted in the horizontal direction producing 8 columns each containing a DCT of the 8 pixels in each column. The rightmost column coefficients C 70..C 77 provide the context. For each coefficient S uv being predicted, only the coefficient C 7v of the neighboring block is used as context. Furthermore, the context C 7v is updated by computing the partial inverse DCT of the current block in the horizontal direction and incrementally subtracting from C 7v the coefficient of the leftmost column, S 0v. Thus, C 70..C 77 represents at all times the difference between the vertical DCT of the two neighboring columns of pixels. For a smooth image, these coefficients will all tend toward small values as the higher frequency coefficients of the current block are coded. A similar calculation is done on the top row of the current block and the bottom row of the block above.
There are many other contexts which have a smaller effect on compression. These include preceding coefficients or their RS or Huffman codes within the same block, or corresponding coefficients in neighboring blocks or in earlier components (colors) in the same block. All of these contexts may be rounded or combined to provide additional contexts. The two most important contexts are also used as contexts for the mixers and SSE stages.
WinZIP added a JPEG compression algorithm to version 12.0 in Sept. 2008. The format is specified precisely to allow third party developers to decompress WinZIP compressed JPEG files. The algorithm is very fast, but does not compress as well as PAQ or Stuffit. On the Maximum Compression JPEG test, it is ranked third at about 18% compression, vs. 24% for PAQ and Stuffit. The algorithm handles the most common formats: baseline, progressive, and extended sequential with 8 or 12 bits per pixel and 1 to 4 component colors.
WinZIP uses a transform to partially decode a JPEG image back to the DCT coefficients. The Y, Cb, and Cr components are compressed in separate streams. Within each 8 by 8 DCT block, the coefficients are coded in reverse zigzag order from highest spatial frequency (S 77 ) to DC (S 00 ) starting with the first nonzero coefficient. The DC value is predicted using a more sophisticated model and replaced with a prediction error. The coefficients are then encoded using a variable length code similar to an exponential Golomb code. The numbers are coded in the order: range, magnitude, and sign, each modeled with different contexts, with special cases for magnitudes of 0 and 1. Contexts depend on u, v, and other coefficients in the same block and the blocks to the left and above the current block. Only one context is computed for each bit. Then the context is passed to a binary arithmetic coder described in US patent 4,791,403 by IBM (now expired). The context model is tightly integrated into the coder. It is essentially a direct context model with a fixed learning rate. The encoder is modified to support encoding extra bits without compression with a fixed probability of 1/2.
The WinZIP algorithm describes 3 functions used to compute contexts:
Magnitudes uses as contexts are typically scaled logarithmically and quantized to n discrete levels using the function cat(x, n) = min(n-1, ceil(log 2 (x+1))).
Coefficients are represented using a variable length code of the form (zero, pivot, range, magnitude, sign) with each part coded using different contexts. The parts have the following meaning:
De exemplu:
Number zero pivot range magnitude sign
------ ---- ----- ----- --------- ----
-2 1 1 0 1 1
-1 1 0 1
0 0
1 1 0 0
2 1 1 0 0 0
3 1 1 0 1 0
4 1 1 10 00 0
5 1 1 10 01 0
6 1 1 10 10 0
7 1 1 10 11 0
8 1 1 110 000 0
9 1 1 110 001 0
For each block, the position of the first nonzero coefficient is coded as an ordinary 6 bit number. The context is the previous 0 to 5 bits and cat(sum(U 00 ) + sum(L 00 ), 13), ie the sum of the absolute values of the 126 adjacent AC coefficients of the two neighboring blocks. For the AC coefficients S uv, the contexts are as follows:
The DC coefficient is predicted by the weighted average of predictions p U from the block above and p L from the block to the left.
prediction = (w U p U + w L p L ) / (w U + w L ).
The prediction is estimated from the neighboring block and the gradient given by the first AC coefficient of both blocks.
p U = U 00 - 2.2076(U 01 - S 01 )
p L = L 00 - 2.2076(L 10 - S 10 )
The prediction weights account for the smoothness of the image in the horizontal and vertical directions. This is indicated by the absence of high frequency coefficients in the predicted direction.
w U = 2 -|? i=1..7 (|U i0 | - |S i0 |)|
w L = 2 -|? i=1..7 (|L 0i | - |S 0i |)|
The DC coefficient prediction error is coded as above. The context for the zero, pivot, range, and magnitude is cat(sum(0,0), 13). The context for the sign is the 3 bits from the signs of the two neighboring DC signs and the sign of the prediction.
Note that contexts do not include the previously coded bits of the current number. This context is implicit in the bit position for the zero, pivot, range, and sign, but not in the magnitude. This does not hurt compression too much because the magnitude bits tend to be random.
The context and bits are fed to the IBM binary arithmetic coder. The coder has a tightly integrated context model which is essentially a direct context model with a fixed learning rate. The coder is designed to avoid multiplication and division operations by representing the range in the logarithmic domain and using addition and subtraction instead. It uses lookup tables to convert to and from the linear domain when addition or subtraction is required. In 1987 when the coder was designed, table lookup was faster than multiplication, although this is no longer true on modern processors.
The coder maintains a range (low, low+R). It receives as input a bit to be coded, y, a prediction for the most probable symbol, MPS=0 or MPS=1, and a prediction, p, for the MPS between 1/2 and 1. It always codes the MPS in the lower part of the range. Thus, most of the updates have the form (low, R) := (low, pR), which is a simple addition in the log domain. In the less likely case that y? MPS, then the update is (low, R) := (low+(1-p)R, (1-p)R), which requires a table lookup to convert low to the linear domain and back. Some care must be taken with the tables so that the two symbol probabilities don't add up to more than 1 due to rounding.
The coder represents the range and probabilities as 16 bit fixed point integers with 10 bit fractional parts in the linear domain and 12 bit fractional parts in the log domain. It outputs and normalizes one byte at a time. This also requires a table lookup to convert to the linear domain.
The context model quantizes probabilities into 48 discrete values between 1/2 and 1. This results in a coding inefficiency of less than 0.2% for most of the probability range.
Each entry in the context table contains a prediction, p, for the MPS, the choice of MPS, a counter k, and a threshold, LRM. The counter and threshold are used to decide when it is time to update the prediction. When the MPS or LPS (least probable symbol) is coded, the prediction should be increased or decreased respectively, but increasing or decreasing p by one step every time would be too fast.
Initially k is 0 and LRM is set to a fraction of the current range, R, depending on a table lookup of the current value of p. Each time an MPS is coded, the range gets smaller. When the range reaches LRM, p is increased and LRM is reset again as a fraction of R depending on p. The amount of the increase depends on k, the LPS count. If k is 5 or more, then p is unchanged. If k is in the range 2 to 4, then p is increased by 1 in the log domain, which has the effect of reducing the LPS probability (1-p) by about 10% to 15%. If k is 1 then (1-p) is halved. If k is 0 then (1-p) is halved again.
When an LPS is coded, p should be decreased, but again not by a whole step. If k is less than 5 then it is incremented and p is unchanged. Otherwise the fraction of the distance moved by R toward LRM is used as an implicit count of MPS symbols to determine the update. If R has moved less than 1/4 of the way to LRM then p is decremented once in the log domain, or about 10% to 15% in the linear domain. If R has moved between 1/4 and 1/2 of the way toward LRM, then p is decremented twice. If R has moved between 1/2 and 3/4 then p is halved in the linear domain (decreased by about 6 to 10). If R has moved more than 3/4, then p is divided by 4. Then k is reset to 0 and a new LRM is chosen based on a fraction of the range depending on p. The table is designed so that for any p, the ratio of increments to decrements is 1-p to p, so that p reflects the true probability.
It should be noted that the IBM model and coder was developed for encoding binary and grayscale images. Because the context model is tightly integrated with the coder, it lacks the flexibility required to implement context mixing algorithms or to fine tune the learning rate for stationary or adaptive models. Its design goals were for speed, not maximum compression, at a time when computation was expensive.
PackJPG is a JPEG compressor by Mattias Stirner (Stirner and Seelmann, 2007). As of Dec. 2009 it was ranked fourth on the Maximum Compression JPEG benchmark with 17% compression, similar to WinZIP.
PackJPG uses a simple order 1 context model and arithmetic coder to encode the DCT coefficients. However, it changes the scan order. First, the image is divided into 4 approximately equal sized sets of blocks representing regions of high and low detail as indicated by the EOB (end of block) value. EOB indicates the last nonzero coefficient in zigzag order. EOB tends to be higher in parts of the image with a lot of fine detail. Each set of blocks is compressed as a separate stream.
Within each EOB set, the S uv coefficients are encoded as separate streams. Instead of encoding the coefficients within one block in zigzag order, the scan order is from block to block for the same value of u, v, and EOB category. The scan direction depends on u and v. If u > v (in the upper right corner of uv space in the first figure in section 6.1.5), then S uv represents a low spatial frequency in the vertical direction and the image is scanned in columns because these coefficients are more likely to be correlated with their neighbors in the vertical direction. If u < v (lower left corner of uv space), then the image is scanned horizontally. If u = v (on the diagonal in uv space, including the DC coefficient), then the image is scanned in a meandering path described by a Hilbert curve. In a Hilbert curve, the image is divided into 4 quadrants and scanned in a U shaped path. Each quadrant is further subdivided and scanned recursively, for example:
1 4
| | (Quadrant scan order)
| |
2--------3
2 1 16 15
3 4 13 14 (Scan order within quadrants)
6 5 12 11
7 8 9 10
The DC coefficient S 00 is first predicted using a Paeth filter. Recall that the prediction is whichever of its three neighbors is closest to (left + above - aboveleft). Then the prediction error is encoded.
Coefficients are encoded as a sign, category, and extra bits, similar to the variable length codes used in baseline JPEG. The context is the category of the previous coefficient.
The EOB list must also be stored to allow decoding. The EOB list is compressed using a 2-D context. For each block, its category is encoded using the categories of its two previoulsly coded neighboring blocks as context.
Video approximates continuously moving images by using a sequence of still images, called frames. The neural circuitry of the human visual system has a delayed response to light on the order of tens of milliseconds. Thus, a frame rate of at least 24 to 30 per second produces a sensation that is nearly indistinguishable from continuous motion. However, simply flashing images at this rate would produce a noticeable flicker. The eye can detect flicker at rates of up to about 75 flashes per second. Sensitivity to flicker increases in bright light, toward the blue end of the spectrum, and in peripheral vision away from the fovea where visual acuity is sharpest. Thus, a computer monitor viewed up close requires a higher refresh rate than a television viewed from a distance. Movie theatres display 24 frames per second and remove flicker by flashing each frame on the screen two to four times.
Video frames do not have to have as much resolution as still images. The eye moves in saccades, jumping from one part of the image to another at a rate of 30 to 70 times per minute. In still images, the eye is attracted to regions of high contrast such as edges or corners, and to areas of interest such as faces. In text, the eye jumps from word to word. This requires all of the image to be displayed in fine detail. In video, there is not enough time to look at more than one part of a frame before the next frame is displayed. Thus, the rest of the frame can be displayed at a low resolution.
Between saccades, the eye tracks moving objects smoothly. If a frame is displayed more than once or for more than a small fraction of the frame interval, then the effect is to blur the object as the eye moves across each frame.
Time sampling of images can produce artifacts such as the wagon wheel effect, where a spoked wheel appears to spin slowly backward. This artifact is analogous to the Moire effect caused by spatial sampling of a repeating pattern in still images.
NTSC is one of three standards for analog television, the others being PAL and SECAM, used in different parts of the world. NTSC standardized black and white television in 1941 and color TV in 1953 in North America. It was used until 2009 for over the air broadcasts in the US, when it was replaced by HDTV.
NTSC is displayed at 29.97 frames per second. Each frame consists of 525 horizontal scan lines starting at the top left corner of the screen. To reduce flicker, the display is interlaced: each frame is divided into two fields which alternately display the even and odd numbered lines. (PAL and SECAM use 625 scan lines at a rate of 50 fields or 25 frames per second). NTSC is an analog format, so there is no concept of a "pixel". However, the luma (brightness) signal is transmitted over a band that extends about 4.5 MHz above the carrier. This corresponds to a Nyquist sampling rate of 9 million pixels per second, equivalent to about 571 pixels per scan line.
When color TV was introduced in 1953, a chroma signal was added without increasing the bandwidth or breaking compatibility with black and white TV sets. The spectrum allocation is shown below.

NTSC frequency allocation (from Wikipedia ).
The video signal is split into three color components similar to YCbCr as in JPEG. The black-white (luma) signal is unchanged. It is amplitude modulated in the same band of 4.95 MHz. The blue-yellow and red-green signals are transmitted in a smaller band with a width of 2 MHz that overlaps the luma signal. A narrower band is possible because the eye is less sensitive to high spatial frequencies in chroma (especially blue-yellow) than luma. The two signals are amplitude modulated 90 degrees out of phase, allowing them to be separated by the receiver.
Because the luma and chroma overlap, the color signal can produce black and white artifacts and vice versa. The carrier frequencies are carefully chosen so that the artifacts of successive frames cancel out, reducing their visibility.
MPEG is the most widely used format for video compression. The most commonly used versions are as follows.
Although video files can stand alone, they are more often embedded in a container format such as AVI or streamed through a Flash player. MPEG-2 defines a transport stream for over the air transmission of HDTV by encapsulating the data in 188 byte packets with error correction.
MPEG-1 and MPEG-2 use a compression algorithm similar to JPEG, but obtain additional compression by delta coding between frames with motion compensation. There are 3 types of frames, designated I (inter-frame), P (predictive), and B (bidirectional). An I-frame can be decoded by itself. A P-frame is described in terms of differences from the previous frame. A B-frame is described in terms of difference from both the previous and next frame. A typical pattern is one I-frame every 0.5 seconds, repeating the sequence IBBPBBPBBPBBPBB. To facilitate decoding, the frames are sent out of order with the B frames sent after any future frame it depends on. The decoder then reorders the frames before displaying them. The reason for having I frames every 0.5 seconds is to allow decoding to start from the middle of a video after rewinding or fast forwarding.
Motion compensation is implemented by dividing a frame into 16 by 16 macroblocks. A macroblock in a P or B frame is decoded using a motion vector which points to a same sized region of the previous (or next) decoded image with a specified offset horizontally and vertically. After motion compensation, P and B frames are encoded using a JPEG-like algorithm as with I frames.
It is up to the encoder to find good matches in adjacent frames for encoding macroblocks. The encoder calculates the differences from the decoded image, not the original, by encoding and then decoding the adjacent frame.
Frame compression is like JPEG except that it uses fixed Huffman tables (called variable length codes) and fixed quantization tables with only a scale factor transmitted. MPEG is often transmitted at a constant bit rate, which is achieved by adjusting the quantization scale factor as needed. For DVDs, the maximum bit rate is about 10 Mbits/second. The result is that scenes with lots of motion are transmitted at a lower resolution.
MPEG-4/AVC (H.264) differs from MPEG-1/2 in that it supports arithmetic coding in addition to Huffman coding. It also supports variable sized macroblocks, from 4 by 4 to 16 by 16, with motion vectors pointing to any of 16 adjacent frames (or 32 fields) in 1/4 pixel resolution. Fractional motion vectors are obtained by using a 6 tap filter to infer half pixel intensities, followed by simpler interpolation. It also uses a deblocking filter.
Lossy audio compression uses a psychoacoustic model to determine which part of the signal can be discarded without changing the original sound. The human ear can only perceive sounds in the range 20 Hz to 20 KHz. Sensitivity peaks around 1 KHz to 5 KHz, the middle of the range of human speech. Frequency resolution is 3.6 Hz in the range 1 KHz to 2 KHz.
Like the eye, the ear perceives sound on a logarithmic scale. The range of hearing is from 0 decibels, the threshold of hearing, to 120 decibels, which is loud enough to be painful and cause hearing damage. An increment of 10 decibels (dB) represents an increase in power by a factor of 10, although we perceive it as closer to twice as loud. An increment of 20 dB represents an increase in amplitude by a factor of 10 (because power is proportional to amplitude squared).
The logarithmic scaling is partially due to the masking effect, in which a sound decreases sensitivity to other sounds at different frequencies that occur at the same time (frequency masking) or other sounds at the same frequency that occur earlier or later (temporal masking). The graph below illustrates how frequency masking affects the threshold of hearing. Temporal masking has an exponentially decaying effect, starting from about 20 milliseconds before the sound to 100 milliseconds afterward.

The effect of frequency masking on the threshold of human hearing (from Wikipedia ).
Humans can perceive the direction of a sound source to an accuracy of about 3 degrees. Stereoscopic sound perception depends on two effects. First, a sound is louder in the ear closer to the source. Second, there is a time delay in reaching the further ear because sound travels at about 300 meters per second through air. Earphones can reproduce both of these effects, but stereo speakers typically do not. The sound seems to come from one speaker or the other or from some point in between. This has led to sound systems with more than 2 channels.
High frequency sounds are harder to locate when the distance between the ears is more than 1/2 wavelength because the phase shift is ambiguous and because neurons can't fire fast enough to transmit phase information. This occurs at around 1.5 KHz. This suggests an approach of transmitting stereo information (left minus right) at a lower bandwidth.
The simplest form of lossy audio compression is to filter out the high frequencies where most of the information is located. AM radio discards frequencies above 10.5 KHz. FM uses a bandwidth of 15 KHz for the mono signal (left plus right) and 13 KHz for the stereo signal.
DS0 digital telephony uses a sampling rate of 8 KHz, which requires filtering out all audio above the Nyquist rate of 4 KHz. In practice, audio above 3.3 to 3.5 KHz is filtered out. Each sample is 8 bits which is companded, or quantized on a logarithmic scale. Essentially, each 8 bit value is a floating point representation of a 14 bit integer (78 dB dynamic range) using a sign bit, 3 exponent bits and 4 mantissa bits. This 64 Kbit/second signal is sufficient to reproduce speech in spite of the fact that some sounds such as /s/ are almost entirely outside the bandwidth (4 to 8 KHz).
CD audio is stored uncompressed. It consists of two channels sampled at 44.1 KHz (22.05 maximum frequency) at 16 bits per sample (90 dB range). The bit rate is 1411.2 Kbit/second.
Lossy audio formats such as MP3, AAC, Dolby, and Ogg Vorbis are based on dividing the audio into blocks of samples, computing the modified (overlapped) discrete cosine transform (MDCT), quantizing, and transmitting the coefficients without further compression. Quantization uses the psychoacousitic model to determine the appropriate precision at each frequency. All formats support a wide range of sampling rates, compressed bit rates, and number of channels. All but Ogg Vorbis are protected by patents.
All of these formats support joint frequency encoding, a technique which compresses the stereo (left minus right) signal. Because the ear can detect intensity differences but not timing differences at high frequencies, this part of the stereo signal can be removed and replaced with information to control the overall intensity for each channel.
MP3 (MPEG-1 layer III) was the first widely used compressed format for encoding music on the Internet. Audio is divided into blocks of either 576 samples, or 192 samples to encode transients (rapid changes in the audio signal). Two channels (left and right) are supported. Good quality is achieved at a bit rate of 128 Kbits/seconds, or 9% of uncompressed CD audio. Bit rates can be variable.
AAC (Advanced Audio Codec, MPEG-2 part 7) is the default audio format used by Apple's iPod and iTunes. It is understood by most music players, phones, and video game consoles. AAC supports a greater range of bit rates and sampling rates than MP3. Block sizes are 1024 and 128 samples (or 960 and 120 depending on sampling rates). Good quality audio is about 96 Kbits/second.
Dolby Digital (AC-3, ATSC A/52) is the audio format used in DVDs and HDTV. It supports 5.1 Surround Sound. The "5.1" refers to 5 channels (left front, right front, left rear, right rear, center) and a low frequency subwoofer channel.
Ogg is a free, open source container for the Vorbis audio format. At a bit rate of 96 Kbits/second it has an audio quality slightly better than AAC and better than MP3.
The MDCT computes N frequency coefficients from 2N samples x 0..x 2N-1 in 2 adjacent blocks. The overlap is necessary to prevent artifacts where the blocks join together. The coefficients are computed:
X k =? i=0..2N-1 w i x i cos[(?/N)(i + (N+1)/2)(k + 1/2)], k = 0..N-1
The coefficients X k represent frequencies ranging from 1/2?N up to the Nyquist rate 1/2. The inverse transform (IMDCT) has the same form:
y i = (w i /N)? k=0..N-1 X k cos[(?/N)(i + (N+1)/2)(k + 1/2)], i = 0..2N-1
The inverse transform has N inputs and 2N outputs. To complete the transform, the two overlapping sets of samples, y i and y i+N, from adjacent blocks are added together. To minimize boundary artifacts, the window function weights w 0..2N-1 are selected such that the weights at the ends (near 0 and 2N-1) go to 0 and are 1 in the middle, usually with a rounded shape. Because each weight is used twice in each block, they must satisfy w i 2 + w i+N 2 = 1. MP3 and AAC use the window:
w i = sin[(?/2N)(i + 1/2)].
Vorbis uses:
w i = sin{(?/2) sin 2 [(?/2N)(i + 1/2)]}
Dolby AC-3 uses a Kaiser-Bessel derived window, which has a similar shape.
Data compression is the art of finding short descriptions for long strings. Every compression algorithm can be decomposed into zero or more transforms, a model, and a coder. Programarea este o problema rezolvata. Given a symbol with probability p, Shannon proved that the best you can do is code it using log 2 1/p bits. An arithmetic coder does this efficiently.
There is no general procedure for finding good models or prediction algorithms. It is both an art and a hard problem in artificial intelligence. There is (provably) no test to tell you if a string can be compressed or if a better model exists.
Prediction is closely related to understanding. If you understand Chinese, then you can predict a sequence of Chinese symbols. This principle can be applied to context modeling. Useful contexts are semantically independent units, for example, words in text, instructions in executable code, fields in a database, or recognizable features in an image. Different symbols that have similar meanings should be treated as if they were the same context. For example, in text, it is useful to merge upper and lower case, spaces and newlines, or related words like "someone" and "somebody" or "cold" and "wet". A context model for images would distinguish blue from green pixels but ignore fine differences. The best compressors combine the predictions of many independent models.
Preprocessing transforms are optimizations that sacrifice compression for speed and memory. Often, the output can be compressed with a simple order 0 or low order model. A transform by itself does not compress. It may hide useful contexts and add arbitrary information that makes the output ultimately larger. A good transform should minimize these effects. Thus, a BWT compressor that works on words as units would compress text better than the usual case of sorting bytes. Likewise, an LZ77 compressor that replaced duplicate strings on whole word boundaries would be preferred. Dictionary preprocessors improve both BWT and LZ77 compression by forcing those transforms to split the input on word boundaries. The best compressors on the large text benchmark use dictionary preprocessing, but I believe that is because the benchmark is tightly constrained by memory. When computers with hundreds of gigabytes become available, I believe that the top ranked programs will no longer use large dictionaries.
Lossless compressors ignore meaningless data in selecting contexts. Meaningless or random data has no predictive value and is itself not compressible. A lossy compressor not only ignores the meaningless data, but also discards it completely. Deciding which data is meaningful is a hard AI problem that applies to both lossless and lossy compression. Both require a deep understanding of human cognitive psychology.
Coding theory says that the vast majority of strings do not have simple descriptions, so it is rather remarkable that compressible strings are so common in practice. Solomonoff, Kolmogorov, and Chaitin independently proposed that strings have a universal probability proportional to 2 -|M|, where M is its shortest description, independent of the language in which M is written. Kolmogorov proved that there is no algorithm for finding such descriptions in any language. Hutter showed that the compression problem, if it were computable, would solve the general AI problem of optimizing arbitrary utility functions. In effect, he proved Occam's Razor, which is the foundation of all science: the simplest theory that explains the past is the best predictor of future events.
The prevalence of compressible strings, and thus science, depends on two facts. First, that all strings are the result of computable (or finitely describable) processes, and second, that shorter programs or descriptions are more likely than longer ones. To show the second, consider a probability distribution over the infinite set of all finite length descriptions. Any such distribution must favor shorter descriptions. Consider any description M having probability p > 0. There can be at most a finite number (1/p) of more likely descriptions, and therefore an infinite number of less likely descriptions. Of the latter, there can be at most a finite number (2 |M| - 1) that are shorter than M. Therefore there must be an infinite number of less likely descriptions that are longer than M, for all M.
The question remains whether all strings found in the real world are created by computable or finitely describable processes. This must be true for finite strings, but there are known to exist, at least in mathematics, infinite length strings such as Chaitin's constant? (the probability that a random program will halt) that are not computable. In fact, the vast majority of infinite length strings do not have finite length descriptions. Could there exist phenomena in the real world that have infinite length descriptions that are not compressible? For example, would it be possible to take an infinite number of measurements or observations, or to measure something with infinite precision? Do there exist infinite sources of random data?
The laws of physics say no. At one time it was believed that the universe could be infinitely large and made up of matter that was infinitely divisible. The discoveries of the expanding universe and of atoms showed otherwise. The universe has a finite age, T, about 13.7 billion years. Because information cannot travel faster than the speed of light, c, our observable universe is limited to an apparent 13.7 billion light years, although the furthest objects we can see have since moved further away. Its mass is limited by the gravitational constant, G, to a value that prevents the universe from collapsing on itself.
A complete description of the universe could therefore consist of a description of the exact positions and velocities of a finite number (about 10 80 ) of particles. But quantum mechanics limits any combination of these two quantities to discrete multiples of Planck's constant, h. Therefore the universe, and everything in it, must have a finite description length. The entropy in nats (1 nat = 1/ln(2) bits = 1.4427 bits) is given by the Bekenstein bound as 1/4 of the area of the event horizon in Planck units of area hG/2?c 3, a square of 1.616 x 10 -35 meters on a side. For a sphere of radius Tc = 13.7 billion light years, the bound is 2.91 x 10 122 bits.
We now make two observations. First, if the universe were divided into regions the size of bits, then each volume would be about the size of a proton or neutron. This is rather remarkable because the number is derived only from T and the physical constants c, h, and G, which are unrelated to the properties of any particles. Second, if the universe were squashed flat, it would form a sheet about one neutron thick. Occam's Razor, which the computability of physics makes true, suggests that these two observations are not coincidences.
I thank Ivo Danihelka, Szymon Grabowski, Aki Jantti, Harri Hirvola, Christopher Mattern, Ilia Muraviev, Jan Ondrus, Alexander Ratushnyak, Friedrich Regen, Steve Richfield, Sami Runsas, David A. Scott, Eugene Shelwien, Ali Imran Khan Shirani, Yan King Yin, and Bulat Ziganshin for helpful comments on this book. Thanks to Osman Turan for convering this book to XHTML compliant. Please send corrections or comments to matmahoney at yahoo dot com.
T. Bell, IH Witten. JG Cleary (1989), Modeling for Text Compression, ACM Computing Surveys (21)4, pp. 557-591.
C. Bloom (1998), Solving the Problems of Context Modeling.
M. Burrows, DJ Wheeler (1994), A Block-sorting Lossless Data Compression Algorithm, Digital Systems Research Center.
G. Chaitin (1966), On the length of programs for computing finite binary sequences. Journal of the ACM, 13:547-569.
G. Cleary, WJ Teahan (1995), Experiments on the zero frequency problem, Proc. Data Compression Conference, 480.
G. Cormack, N. Horspool (1987), Data compression using dynamic Markov modeling, Computer Journal 30:6 (December).
P. Deustch (1996), RFC 1951 - DEFLATE Compressed Data Format Specification version 1.3.
D. Dube, V. Beaudoin (2006), Improving LZ77 data compression using bit recycling, Seoul: International Symposium on Information Theory and its Applications. ISITA2006.
D. Dube, V. Beaudoin (2010), Lossless data compression via substring enumeration, Proc DCC.
J. Duda (2007), Optimal encoding on discrete lattice with translational invariant constrains using statistical algorithms (section 3).
M. Gagliolo (2007), Universal search. Scholarpedia, 2(11):2575.
S. Grabowski, J. Swacha (2010), Language-independent word-based text compression with fast decompression, Proc. MEMSTECH, Polyana, Ukraine.
D. Huffman (1952), A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE: 1098-1101.
M. Hutter (2004), Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer, Berlin.
M. Hutter et al. (2007), Algorithmic probability. Scholarpedia, 2(8):2572.
H. Itoh, H. Tanaka (1999), An efficient method for in memory construction of suffix arrays, Proceedings of the IEEE String Processing and Information Retrieval Symposium, pp. 81-88.
P. Ko, S. Aluru (2003), Space-efficient linear time construction of suffix arrays, Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, pp. 200-210.
A. Kolmogorov (1965), Three approaches to the quantitative definition of information. Problems Inform. Transmission, 1, 1-7.
M. Kufleitner (2009) On bijective variants of the Burrows-Wheeler transform, Proc. Prague Stringology Conference.
T. Landauer (1986), How much do people remember? Some estimates of the quantity of learned information in long term memory, Cognitive Science (10) 477-493.
NJ Larsson, K. Sadakane (1999), Faster suffix soting, Technical report LU-SC-TR:99:214, Lund University, Sweden.
S. Legg, M. Hutter (2006), A Formal Measure of Machine Intelligence, Proc. Annual machine learning conference of Belgium and The Netherlands (Benelearn-2006). Ghent.
LA Levin (1973), Universal sequential search problems. Problems of Information Transmission, 9(3):265--266.
LA Levin (1984), Randomness Conservation Inequalities: Information and Independence in Mathematical Theories. Information and Control, 61:15-37.
M. Li and PMB Vitanyi (2007) Applications of algorithmic information theory. Scholarpedia, 2(5):2658.
M. Mahoney (2000), Fast Text Compression with Neural Networks, Proc. AAAI FLAIRS, Orlando.
M. Mahoney (2002), The PAQ1 Data Compression Program
M. Mahoney (2005), Adaptive Weighing of Context Models for Lossless Data Compression, Florida Tech. Technical Report CS-2005-16.
G. Manzini and P. Ferragina (2002) Engineering a lightweight suffix array construction algorithm (extended abstract).
Y. Matias, SC Sahinalp (1999), On the Optimality of Parsing in Dynamic Dictionary Based Data Compression, Proc. tenth annual ACM-SIAM symposium on Discrete algorithms, 943-944.
DT Meyer, WJ Bolosky (2011), A Study of Practical Deduplication, FAST-2011.
G. Nong, S. Zhang, WH Chan (2009) Linear suffix array construction by almost pure induced sorting, Proc. 19'th IEEE Data Compression Conference.
J. Ondrus (2009), Bezztratova komprese JPEG grafiky (Lossless compression of JPEG images), thesis, Univerzita Karlova v Praze (in Czech).
SJ Puglisi (2005), Exposition and analysis of a suffix sorting algorithm, Technical Report Number CAS-05-02-WS (May, 2005), Dept of Computing and Software, McMaster University, Hamilton, Ontario, Canada.
SJ Puglisi, MA Maniscalco (2006a), Faster lightweight suffix array construction, 17th Australasian Workshop on Combinatorial Algorithms (AWOCA'06), Joe Ryan & Dafik (eds.) pp. 16-29.
SJ Puglisi, MA Maniscalco (2006b), An efficient, versatile approach to suffix sorting, ACM Journal of Experimental Algorithmics.
SJ Puglisi, WF Smyth, AH Turpin (2007), A taxonomy of suffix array construction algorithms, ACM Computing Surveys 39(2).
J. Rissanen (1976), Generalized Kraft Inequality and Arithmetic Coding, IBM Journal of Research and Development 20(3): 198–203.
J. Schmidhuber, S. Heil (1996), Sequential Neural Text Compression, IEEE Trans. on Neural Networks 7(1): 142-146.
C. Shannon, W. Weaver (1949), The Mathematical Theory of Communication, Urbana: University of Illinois Press.
CE Shannon (1950), Prediction and Entropy of Printed English, Bell Sys. Tech. J. 3:50-64.
D. Shkarin (2002), PPM: one step to practicality, proc. DCC.
P. Skibinski, S. Grabowski (2004), Variable-length contexts for PPM, Proc. Data Compression Conference (DCC).
P. Skibinski, S. Grabowski, S. Deorowicz (2005), Revisiting dictionary-based compression, Software - Practice and Experience, Wiley.
R. Solomonoff (1960), A Preliminary Report on a General Theory of Inductive Inference, Report V-131, Zator Co., Cambridge, Ma.
R. Solomonoff (1964), A Formal Theory of Inductive Inference, Information and Control, 7(1) 1-22, 7(2) 224-254.
M. Stirner and G. Seelmann (2007), Improved Redundancy Reduction for JPEG Files. Lisboa: Picture Coding Symposium CD-ROM Proceedings.
AM Turing (1950), Computing Machinery and Intelligence, Mind, 59:433-460.
E. Ukkonen (1995), On-line construction of suffix trees, Algorithmica 14(3): 249-260.
IH Witten, TC Bell (1991), The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression, IEEE Trans. on Information Theory, 37(4): 1085-1094.
J. Ziv, A. Lempel (1977), A universal algorithm for sequential data compression, IEEE Trans. Information Theory 23(3) 337-343.
J. Ziv, A. Lempel (1978), Compression of Individual Sequences via Variable-Rate Coding, IEEE Transactions on Information Theory, 24 (5), 530-536.
Zongjie Tu, Shiyong Zhang (2006) A Novel Implementation of JPEG 2000 Lossless Coding Based on LZMA, Proc. Sixth IEEE International Conference on Computer and Information Technology, 140.
Note: Website links are current as of Feb. 2010.