Tuesday, November 13, 2007

Compression #4

Further thoughts on compression.

I've been looking through more interesting information, including the Huffman coding, which is merely a format for creating a prefix free code, which is variable in size.

I've also tried some experiements that are based on that code, and look very promising. By taking the basic format, and modifying it in very simple ways i've tested random numbers (32bits, experimental using 10base), phrases, and some generic patterns.

My findings, say that a compression that would yeild 32bits in regular huffman code, would yeild between 29 and 25 bits with the modified code. The tree structure is quite different, so storing the tree in it's new format (for 8bit characters) could originally be stored 10bits for every 8bit, but mine would take 12bits for every 8 bits. (expected, not tested)

With that in mind, the great use for it would be when it's used as an adaptive code, quite simply every single character starts off with a 1 weight, and is divided up. As they are added on, the weights would change a bit at a time. (No table needs to be stored, but it doesn't have as good compression originally starting)

The downside with the experimental format, is it requires large INT's (32-1024bits), or that it requires actual math, instead of pasting x bits in a specific pattern. But in that way, it can also work. With text, it could be compessed, X elements at a time, or as many as possible before it overflows. Then output it as a straight 32bit word. In that way, it wouldn't need a huge library, and could reasonably be added into most standard applications with little to no problem.

Current testing and building of the new experimental huffman code, (ideas welcome for name) are incomplete, but show promising results.

Depending on how LZMA works, it might also be implimented with little to no problem using the same ideal background, gaining an additional 5-30% compression, but this is just a thrown idea till testing and proff come forward.

A current test, which didn't turn out very well due to setup and formatting, but was promising was a x3 format. This format MIGHT have worked in certain times, but not always. I might speak more on this later when i have more time.

But at this moment, no compression can be yielded from encryption, rather compression then encryption is the only real option. For a real but-kicker, add a 2byte offset information, and 2byte size (4bytes). Then up till that offset, fill with random numbers, and afterwards fill with random numbers till it fills the original message size, then encrypt the whole thing. That will make it many many many levels harder to break, since they are trying to break random data or comrpessed (dense) data, and can't confirm nor test with. But that's just another random thought :P

Era

Monday, November 5, 2007

Compression #3 - Continued.. Part B

Further notes. This will take a while, however it's somewhere to start.
I've compiled a list i will give later on specifics, on how many bits it takes to do x number, and then to encode that number. I've seen it jump from 1 bit to 12 bits, or 180 bits. However the larger it gets, the more data i need to try and represent that. Not only that, it doesn't include required numbers of how many times it's used. If failed, the compression rate is much depreciated, not to mention the header information is a failure as well.

I'm also working on the following thoughts. Exact way to encode to get smallest size in all ways is currently up for ideas, and requires a lot of math, or large INT's that require some type of library. I will submit it later, although there are others avaliable for free, this is one i know works that i built, and is still in the process of being built.

K, the value is 5, using it in a 4bit (16) area. Simply, you add all other values on top, and when it goes too large to represent the number you minus it. (for 8bit, it was 8,28,56,70 Ect)

I've gotten numbers in the 12k (15bit) range, for starting, but representing 5-6 data parts (20-24 bits). Depending on what they were, 1-6 combinations, determined how it would be stored.

If you think of a better way to store the number, please tell me and i'll tell you if it will work. If i get this to work, GPL goes the compression idea :)

The 1 wouldn't include any further information since it's the same number 6 times.
The 2, would simply do the (2x2x2x2x2x2 = 64), which is within reach

The 3, (3x3x3x3x3x3 = 729)

The 4, (10 (4+6 for 1 and 2 reuses) x 6 (2 repeat inserts) x 2 (worse case, order of repeats) x 4x3x2x1 = 2880)

The 5, The best way to represent this is to give the proper output of (5x5 for duplicate and offset where to put it. Then 5x4x3x2x1 = 3000)

The 6, Simply each fall off as they go. (6x5x4x3x2x1 = 720)

Ok, So we compare Vs the worse case. if it's 24bit, then it can't exceed 16Mil, if it does, hopefully not higher than 32Mil, since that's 25+ bits.

Another way to help with getting the proper size, is get your max value to represent onbits in their combinations, then divide the area (16mil) against it. Then try to keep all your other represented values less than this value, and if you can do it you are guarenteed compression :)

I've also tried where i dropped the lowest value (1) and left an extra value in, if that was hit, we manually determine the 4bit value, which increases the value from 12bit, to about 17-18 bit, however it's still less than 24bit. However likely hood of hitting. Not high. (We're talking a 3byte combination here) But it did lower the main hit numbers quite a bit, but not nearly enough.

I am increasing the values, but verifying them can be very hard, considering the math involved, and manually building all the possible combinations for worse case senario (which is the only one i compare with).

A Note. If the higher chance of hitting a <=1 bitsize rather than a >=1 bitsize, it is a success, and will be built and tested to see if it is worth it. But don't expect huge compression in small files, rather a decent compression over a large area, but even that can fail.

Era

Thursday, November 1, 2007

Compression #3 - Continued

Further with compression i've attempted, and although theoretically will work in certain situations, won't guarentee endless compression as of yet. But i'm working on it.

1) High Low. Ever played the gambling game where you flip a card up, and you guess if the next card is high or low? Well, using this compression type has some limited success.

You need a previous history of your last value, and compare it against your next value. The odds are in your favor it's in the more likely in the larger area. But when it fails you refer to a fallback value, and re-encode the correct value using the smaller value. so.. Using very large INT's.

Tends to make 8bits to 9.5, 4bits gets to 8.8 bits? but not worth it.

2) Attempted, encoding X of unique values in a row, and specify a failure (non unique), which we then encode from our list. There are of course success on all levels, EXCEPT the highest level. Using this knowlage we can then encode using a lesser form of 1bits (combinations of a certain number of 1's on, example is 2 on's in a 8bit area is 28).

There are problems.

Problem 1) Can only go to 5 on a 8bit, and 6 on 16bit. Higher values, equal less compression.
Problem 2) highest value doesn't guarentee a repeat or failure, so our last value can't get compression for a smaller bitvalue.

Tried a straight 2different 3bit unique bytetype, which works. 8 x 7 = 56 of 64. (or 28, *2*1) However, there's the problem and chance of 1 in 8 that they are identicle. The failure rate or references (3-5bits) for successful non-matches in a row nearly negate any worth of value. IF it's every 8 (on average) then the 3bits cover itself, if larger or in smaller than 8, the bitsizes are lost on big math and it isn't worth it.

I've also tested using a 2level compression for bit sizes, and getting counts on types. However it's not yet successful, but not finnished testing either.

Era