Monday, October 23, 2006

Pi compression

This idea I've had for a long time, and if it were mathematically and computationally feasable, it would probably have already been done. But never mind all that. I still like the idea, and I'm still going to spout it off.

I chose pi because I just like it, but e or any of those other fun infinite-digit, non-repeating numbers would probably work, and possibly be better. An unfathomably good pseudo-random number generator might even work. But let's call it pi for now, because the important thing is that it's this infinite sequence in which, theoretically, every possible sub-sequence exists. Somewhere in that infinitely-long series of digits, there'll be a hundred sequential zeros. Somewhere, there'll be a million. And somewhere, there'll be a sequence that matches the sequence of ASCII values for this blog entry, and every other text you've ever read (and ever will read... but that's a different story).

The idea is, whatever data you have, your term paper, that movie you swear you didn't download, or that log of leaked AOL searches, that data is already encoded somewhere in Pi. It may be several centillion digits out, but it's there. Theoretically, you just need to let a grep algorithm slide along Pi until it finds a match, and then just remember the offset and the length to get it back out. Suddenly all files, all volumes, all anything, can be represented by two numbers.

Parts of this have already been implemented, though on a scale much too small to be useful. (Four billion binary digits of Pi? Pff, weak sauce.)
Finding sequences in pi:
Getting hex digits (half-bytes) out of pi:

Our CPUs' abilities to crunch numbers aren't advancing as fast as our networks' abilities to pump big bunches of data around, so this whole thing isn't terribly practical. Still, you could back up your whole hard drive on a floppy diskette. The idea is fun even if it wouldn't be worth doing.

Let's take it a little further, though. Maybe we can still do something useful with those first four billion binary digits. Play around with that Pi-search link above. It only uses a 5-bit character set rather than ASCII's 8, but let's experiment anyway. I'm going to search for "penduin".
In 5-bit-per-character binary, that's:
The sequence is found at 8748556 binary digits in. 8748556 in binary is:

...that's a noticably smaller sequence than the binary representation of "penduin". Our data is 7 "bytes" long, which in binary is:
Those two numbers, the offset and length, happen to take fewer bits to represent than the string "penduin", in ascii or even in 5-bit characters. Houston, we have compression! But hang on, let's say that sequence didn't happen within 4 billion bits of pi (according to that site, an arbitrary 7-character sequence has just an 11% chance of being found that early). We could still find "pend" and "uin", though - the odds of finding three or four letters are essentially 100%. If we do that, though, we'd need more bits than the raw ascii just to keep the two offset/length pairs, let alone any padding we'd need for any sanely-packed descriptor.

That packed descriptor would be 35 bits per chunk: 32 bits (maximum value:
4,294,967,295) for offset, and 3 bits for length. Why 3? We're unlikely to find an arbitrary 8-letter pattern (8 is 4 digits, 1000 in binary) in just the first 4 billion bits of pi, so our algorithm won't even try. So every 35 bits in our "compressed" file points to the next up-to-7-letter chunk of our data. On a good day, that'll beat ASCII, which would take 56 bits to store 7 letters. But, we're still playing with 5-bit characters, a paltry subset of ASCII. We're also playing with odds where only 11% of our chunks will be able to point to the maximum 7 characters.

Looking at hex digits instead of 5-bit letters gives us a better picture of how useful this might be. Any byte (8 bits) can be expressed in 2 hex digits. Any four sequential bytes has about a 60% chance of being found in these first four billion bits of pi, and sequences of three and a half (7 hex digits) are pretty much guaranteed. 7 was the maximum length of our chunks from before anyway, so let's keep using that. So, to recap:
With our 35 bits descriptor, we can know where in pi to find up to three and a half bytes, or 28 bits, of data.
As far as compression algorithms go, that doesn't seem particularly good. :^) But we're chopping into 3.5-byte pieces or less because of the near statistical certainty of finding any 3.5 bytes in the first half-gigabyte of pi (four billion bits). Give us twice as many digits of pi, and we'd only need one more address bit, plus another bit or so for length, as longer patterns would have improved odds of being found. Keep bumping up the searchable pi digits and we could eventually hit a sweet spot beyond which we'd actually be able save disk space. Once that happens, we can keep on increasing the available pi digits to achieve better compression, but more interestingly, we can add another layer of logic in which we take our packed bits, chunk them up, and look for those sequences in pi. If we're going to do that, we need to pack an extra bit for each chunk to say either "I point to data" or "I point to a pointer", but if we've got so many pi digits to search that we're actually saving space every time, we can still wind up with simply a pi digit offset and size, which would lead to a lot more lookups, and eventually, chunk by chunk, the data we care about.

A tip of the hat to the first person to find this post within pi. It's there! Extra points if you also find my next post before I post it. :^)

edit from much much later:
I'm glad people keep finding this post and having fun with the idea!  I'm sorry to have gotten anybody's hopes up, but this is one of those fun ideas that is impossible in practice.  (Well it's possible, it just can't achieve any data compression.  :^)  See the comments below, including my own, for reasons this doesn't and can't work.


svoid said...

I wasn't able to find your entire next post in the digits of pi, but I was able to find this title, "Free Drivers Ed, Part 3 - That Spoiler Doesn't Make You Go Faster" :)

Penduin said...

Hah! Perfect. If only you'd given me the offset, I could just copy & paste the bytes that follow until the post was done, and I wouldn't have to bother writing it.

Gary said...

Hi, just to let you know I had the exact same idea some years ago, though the idea suggests breaking one or several laws of thermodynamics.
But then again, today, those laws are highly disputable...

steve said...

I had the same idea about 15 years ago but, like you, concluded that it wasn't currently feasible however I am still slowly working away at an implementation that would demonstrate the method. If you're interested in a collaberative effort I'd love to chat.

Frits said...

Maybe be you can use this metod to re/find the place in Pi where a certain file exist. Once this place is located, you search for the shortest sequence of number, beginning from the begin of the file, that exists only ONCE in Pi uptill this file shows up. This sequence is surely shorter that the original file, and it is a piece of cake to write a program to locate the original file when this sequence is given.
Sorry for my clumsy English. By the way, i thought this method out yesterday and was happy to find out that i am one of a bunch!

Frits said...

Maybe be you can use this method to re/find the place in Pi where a certain file exist. Once this place is located, you search for the shortest sequence of numbers, beginning from the begin of the file, that exists only ONCE in Pi uptill this file shows up. This sequence is surely shorter than the original file, and it is a piece of cake to write a program to locate the original file when this sequence is given.
Sorry for my clumsy English. By the way, i thought this method out yesterday and was happy to find out that i am one of a bunch!

Anonymous said...

I have a right to say that long before I read your post I already had this whole thing floating around in my mind also. It was nice to see that someone else also had these thoughts. One suggestion is that you add another dimesion to it by first adding a counting bit to the beginning. So for decoding you would get something like "7,8675....587362" which means process according to the PI sequence the stream after the comma, and then take that result after processing and add "6" in front of it. Process 6 more times. So in this way you could first compress a stream of data and add "1" in front of it. Then try compressing that result also using PI to something even smaller and add two in front of that. Eventually you would get a number of layers that won't compress anymore, but every time you are able to compress it you would get slightly smaller size until you reach the threshold. In this way I believe it would be possible to reduce a movie sized file down to a floppy disk. Better yet, you could make layers of compression in a way that as each layer is uncompressed it also gives a chuck of uncompressed data along with the next layer of compressed data, so that if its a music file or a movie file you can stream the little chuck of data that pops out each layer, as the rest of the movie unfolds itself.

Anonymous said...

Next thought is that if you can find a sequence in PI somewhere of a certain length, you suggested that you record the distance of the first character from the beginning of PI. But I suggest that you find any sequence in PI that you can from your stream of data, and then record where that sequence is in PI and in your file. Then start from that place in PI and go out both directions towards the beginning of PI and out farther away from the beginning of PI. The distance then to the next closest sequence is the best possibility. If you first parse your whole file so that you find as many of the sequences sections as possible, you can then make some analysis of how to order then so that the distance in PI from one section to the next is relative and shorter than from the beginning.

Roger Pack said...

I had a "similar thought" last night...

My guess is that you'd need something like a super computer to actually compress the text, though apparently somebody was able to compute pi to like 10 trillion digits in 72 hours on a home computer:

You could uncompress it quickly (I hope) with the BPP algorithm or something like it:

Anonymous said...

Yesterday's episode of Person of Interest ("2 Pi R", S02E11) mentioned this, with Finch explaining that everything is contained in Pi.

Then it came to my mind that it'd be great to be able to determine an offset and lenght inside Pi for everything as a compression algorithm.

And that's how I arrived here, maybe you get more visits from PoI viewers. It's one of the few TV series where tech-stuff is more or less science based and one could even believe it's feasible.

Jumping Mo said...

Thats sick i came to the same Idea also by watching Person of Interest S02E11.
On the way back from Work i went thru all of the stuff you mentioned here.
So lets make this happen by the way there might not be as much practicle application right now. However i see a HUGH benefit once we manage to get Quantum Computing working.
That would mean to Calculate PI and find strings within PI in very very short amount of time propably even Real Time.
I truly believe that Quantum Computing and somekind of PI Number based compression will be a revolution.
Iam not into datacompression right now and have some other projects going but i really wanna try this my self and try to optimize the algorithems.

Tommy Lacroix said...

It's an interesting idea but it doesn't would for a reason easy to overlook.

If Pi (or any file or sequence for that matter) contains every possible data sequence, the offset to the wanted data sequence will be as large as the data itself.

To view it differently, forget Pi, and imagine you can write a file with every possibility. That file would be as large as the entropy of your data (8 bits=256 bytes,32 bits=4GB) and your pointer would need to be as big as the data you're trying to point to.

For example, lets say we're trying to compress 8 bits. With 8 bits you've got 256 possibilities, 0 to 255. If you line up all the numbers, the position of any number will be between 0 and 255, thus will take 8 bits too.

Therefore, you won't save anything except maybe if you're lucky enough to find your data sequence with an offset low enough so it can be represented by a lower number of bits than the data itself. On average, however, you're compression ratio will be 0%, since average will take the luck out of the equation.

Penduin said...

Tommy: You're right, and I've had a draft of a follow-up to this post sitting around for a long time that goes into why it's just a (tantalizing) fantasy. Haven't gotten around to publishing yet though. :^) Soon, with any luck.

Anonymous said...

This entire idea is built on a fallacy. Pi hasn't been proven to contain every known sequence of numbers. It's not even intuitive that it should.

Simon Holzman said...

Since data varies (text vs binary, for example), the ideal compression algorithm would work dynamically, depending on what works best with the data provided.

The compression tool would try different compression methods, both on the whole file and on sections of the file to see what the most efficient method/s are and it could then combine methods into a single compressed file.

For example, if someone compressed a folder that included text files and music files, some of the individual files would compress best with something like Zip while others would work best with something like Flac. This tool could combine them into a single file basically comprised of chunks of data each saying "use compression method X for the following Y bytes".

This would mean that, if it were efficient to include "use the 53 bytes of pi starting at offset 137,243,521" (presumably in a more compact expression), it would be able to do so and, if it were better to have multiple statements selecting shorter strings from a shorter offset instead, that would also work.

Perhaps the first 64 bits would be a code for the algorithm to be used, how to determine the length of the data to be decompressed and whether this is the only chunk of data (if there are others, they would start at the end of this chunk and have their own initial 64 bits detailing how to decompress it).

One of the compression methods would be 'Customized' and would basically allow a non-standard algorithm to be embedded in the file. This would only be worthwhile if the non-standard algorithm was more efficient even with the embedded code (and the code would have extremely limited functionality so it would not be a security risk) but this would provide additional flexibility.

Penduin said...

Wow, somehow this still gets attention. This doesn't work. I know! It was a fun idea to explore, as long as I was sufficiently ignorant. :^) A few careful thought experiments (and in my case pseudocode) demonstrate without much effort that even if pi does contain every imaginable sequence (and it's been pointed out that this is not safe to assume) that this would be a silly way to try to achieve compression.

The thing that made my little brain understand that this is all bunk was to imagine a set of data that _does_ contain every imaginable sequence up to some length. (Could be a gigabyte or two bits, doesn't matter.) 00...00, 00...01, 00...10, 00...11 and on and on. Now try and implement any kind of chopping-up compression using that. It's no good -- the index size is identical to the data it points to, and no space can be saved.

I'm glad people still have fun reading this, and I think it still has value as a mental exercise, but I'll add a note to help avoid anyone taking this seriously. :^)