This idea I've had for a long time, and if it were mathematically and computationally feasable, it would probably have already been done. But never mind all that. I still like the idea, and I'm still going to spout it off.
I chose pi because I just like it, but e or any of those other fun infinite-digit, non-repeating numbers would probably work, and possibly be better. An unfathomably good pseudo-random number generator might even work. But let's call it pi for now, because the important thing is that it's this infinite sequence in which, theoretically, every possible sub-sequence exists. Somewhere in that infinitely-long series of digits, there'll be a hundred sequential zeros. Somewhere, there'll be a million. And somewhere, there'll be a sequence that matches the sequence of ASCII values for this blog entry, and every other text you've ever read (and ever will read... but that's a different story).
The idea is, whatever data you have, your term paper, that movie you swear you didn't download, or that log of leaked AOL searches, that data is already encoded somewhere in Pi. It may be several centillion digits out, but it's there. Theoretically, you just need to let a grep algorithm slide along Pi until it finds a match, and then just remember the offset and the length to get it back out. Suddenly all files, all volumes, all anything, can be represented by two numbers.
Parts of this have already been implemented, though on a scale much too small to be useful. (Four billion binary digits of Pi? Pff, weak sauce.)
Finding sequences in pi:
Getting hex digits (half-bytes) out of pi:
Our CPUs' abilities to crunch numbers aren't advancing as fast as our networks' abilities to pump big bunches of data around, so this whole thing isn't terribly practical. Still, you could back up your whole hard drive on a floppy diskette. The idea is fun even if it wouldn't be worth doing.
Let's take it a little further, though. Maybe we can still do something useful with those first four billion binary digits. Play around with that Pi-search link above. It only uses a 5-bit character set rather than ASCII's 8, but let's experiment anyway. I'm going to search for "penduin".
In 5-bit-per-character binary, that's:
The sequence is found at 8748556 binary digits in. 8748556 in binary is:
...that's a noticably smaller sequence than the binary representation of "penduin". Our data is 7 "bytes" long, which in binary is:
Those two numbers, the offset and length, happen to take fewer bits to represent than the string "penduin", in ascii or even in 5-bit characters. Houston, we have compression! But hang on, let's say that sequence didn't happen within 4 billion bits of pi (according to that site, an arbitrary 7-character sequence has just an 11% chance of being found that early). We could still find "pend" and "uin", though - the odds of finding three or four letters are essentially 100%. If we do that, though, we'd need more bits than the raw ascii just to keep the two offset/length pairs, let alone any padding we'd need for any sanely-packed descriptor.
That packed descriptor would be 35 bits per chunk: 32 bits (maximum value: 4,294,967,295) for offset, and 3 bits for length. Why 3? We're unlikely to find an arbitrary 8-letter pattern (8 is 4 digits, 1000 in binary) in just the first 4 billion bits of pi, so our algorithm won't even try. So every 35 bits in our "compressed" file points to the next up-to-7-letter chunk of our data. On a good day, that'll beat ASCII, which would take 56 bits to store 7 letters. But, we're still playing with 5-bit characters, a paltry subset of ASCII. We're also playing with odds where only 11% of our chunks will be able to point to the maximum 7 characters.
Looking at hex digits instead of 5-bit letters gives us a better picture of how useful this might be. Any byte (8 bits) can be expressed in 2 hex digits. Any four sequential bytes has about a 60% chance of being found in these first four billion bits of pi, and sequences of three and a half (7 hex digits) are pretty much guaranteed. 7 was the maximum length of our chunks from before anyway, so let's keep using that. So, to recap:
With our 35 bits descriptor, we can know where in pi to find up to three and a half bytes, or 28 bits, of data.
As far as compression algorithms go, that doesn't seem particularly good. :^) But we're chopping into 3.5-byte pieces or less because of the near statistical certainty of finding any 3.5 bytes in the first half-gigabyte of pi (four billion bits). Give us twice as many digits of pi, and we'd only need one more address bit, plus another bit or so for length, as longer patterns would have improved odds of being found. Keep bumping up the searchable pi digits and we could eventually hit a sweet spot beyond which we'd actually be able save disk space. Once that happens, we can keep on increasing the available pi digits to achieve better compression, but more interestingly, we can add another layer of logic in which we take our packed bits, chunk them up, and look for those sequences in pi. If we're going to do that, we need to pack an extra bit for each chunk to say either "I point to data" or "I point to a pointer", but if we've got so many pi digits to search that we're actually saving space every time, we can still wind up with simply a pi digit offset and size, which would lead to a lot more lookups, and eventually, chunk by chunk, the data we care about.
A tip of the hat to the first person to find this post within pi. It's there! Extra points if you also find my next post before I post it. :^)