Tuesday, November 13, 2007

Compression #4

Further thoughts on compression.

I've been looking through more interesting information, including the Huffman coding, which is merely a format for creating a prefix free code, which is variable in size.

I've also tried some experiements that are based on that code, and look very promising. By taking the basic format, and modifying it in very simple ways i've tested random numbers (32bits, experimental using 10base), phrases, and some generic patterns.

My findings, say that a compression that would yeild 32bits in regular huffman code, would yeild between 29 and 25 bits with the modified code. The tree structure is quite different, so storing the tree in it's new format (for 8bit characters) could originally be stored 10bits for every 8bit, but mine would take 12bits for every 8 bits. (expected, not tested)

With that in mind, the great use for it would be when it's used as an adaptive code, quite simply every single character starts off with a 1 weight, and is divided up. As they are added on, the weights would change a bit at a time. (No table needs to be stored, but it doesn't have as good compression originally starting)

The downside with the experimental format, is it requires large INT's (32-1024bits), or that it requires actual math, instead of pasting x bits in a specific pattern. But in that way, it can also work. With text, it could be compessed, X elements at a time, or as many as possible before it overflows. Then output it as a straight 32bit word. In that way, it wouldn't need a huge library, and could reasonably be added into most standard applications with little to no problem.

Current testing and building of the new experimental huffman code, (ideas welcome for name) are incomplete, but show promising results.

Depending on how LZMA works, it might also be implimented with little to no problem using the same ideal background, gaining an additional 5-30% compression, but this is just a thrown idea till testing and proff come forward.

A current test, which didn't turn out very well due to setup and formatting, but was promising was a x3 format. This format MIGHT have worked in certain times, but not always. I might speak more on this later when i have more time.

But at this moment, no compression can be yielded from encryption, rather compression then encryption is the only real option. For a real but-kicker, add a 2byte offset information, and 2byte size (4bytes). Then up till that offset, fill with random numbers, and afterwards fill with random numbers till it fills the original message size, then encrypt the whole thing. That will make it many many many levels harder to break, since they are trying to break random data or comrpessed (dense) data, and can't confirm nor test with. But that's just another random thought :P

Era

No comments: