Showing posts with label compression #2. Show all posts
Showing posts with label compression #2. Show all posts

Wednesday, March 12, 2008

Compression Idea #4

I've had a multitude of ideas, but recently i've decided to put one to the test.

I know yet again i drift from encryption, but this may be worth it.

I've considered the idea, that you take data, turn it into a puzzle with some redundancy, and then remove all the data you can, and compress the remainder that you can rebuild with. With this in mind i'm talking of something similar to Sudoku, although it may be more than 9x9, it may be more like 25x25 or something, too large for humans (shy of very very intelligent and bored people) doing. The exact method of this is uncertain as of yet. But i'll give you some information what i got so far.

1) I've successfully created a virtual sudoku environment, which works since it looks and feels like soduko, but can increase, but due to testing and building purposes can't go beyond the 9x9 yet.

2) I've programmed only one method (Called method1), which does the simplest of calculations and adds on the more obvious solutions, using this alone can solve puzzled rated easy, and maybe intermediate. I haven't fully tested, but the results were so astounding for a first run with no debugging i was surprised.

There are more methods to add, i think in the range of 4-5 more. When successfully able to do any puzzle on the go in seconds, the algorithm is ready for the next stages. Which i will outline now.

1) Build solver algorithm
2) Build converter from/to sudoku
3) Build number remover, which selectively removes numbers as many as it can, till it can do no more. (Expert level hopefully)
4) Experiment with compression until proper method is made that will hopefully compress the data down enough to compensate for the data, and make the whole venture worth it.

This is all going in hopes that the puzzle compressed is smaller then the original data was to build the puzzle in the first place. But with only one method, results for a solver (even if compression fails) will be promising enough to complete the project.

Era

Tuesday, November 13, 2007

Compression #4

Further thoughts on compression.

I've been looking through more interesting information, including the Huffman coding, which is merely a format for creating a prefix free code, which is variable in size.

I've also tried some experiements that are based on that code, and look very promising. By taking the basic format, and modifying it in very simple ways i've tested random numbers (32bits, experimental using 10base), phrases, and some generic patterns.

My findings, say that a compression that would yeild 32bits in regular huffman code, would yeild between 29 and 25 bits with the modified code. The tree structure is quite different, so storing the tree in it's new format (for 8bit characters) could originally be stored 10bits for every 8bit, but mine would take 12bits for every 8 bits. (expected, not tested)

With that in mind, the great use for it would be when it's used as an adaptive code, quite simply every single character starts off with a 1 weight, and is divided up. As they are added on, the weights would change a bit at a time. (No table needs to be stored, but it doesn't have as good compression originally starting)

The downside with the experimental format, is it requires large INT's (32-1024bits), or that it requires actual math, instead of pasting x bits in a specific pattern. But in that way, it can also work. With text, it could be compessed, X elements at a time, or as many as possible before it overflows. Then output it as a straight 32bit word. In that way, it wouldn't need a huge library, and could reasonably be added into most standard applications with little to no problem.

Current testing and building of the new experimental huffman code, (ideas welcome for name) are incomplete, but show promising results.

Depending on how LZMA works, it might also be implimented with little to no problem using the same ideal background, gaining an additional 5-30% compression, but this is just a thrown idea till testing and proff come forward.

A current test, which didn't turn out very well due to setup and formatting, but was promising was a x3 format. This format MIGHT have worked in certain times, but not always. I might speak more on this later when i have more time.

But at this moment, no compression can be yielded from encryption, rather compression then encryption is the only real option. For a real but-kicker, add a 2byte offset information, and 2byte size (4bytes). Then up till that offset, fill with random numbers, and afterwards fill with random numbers till it fills the original message size, then encrypt the whole thing. That will make it many many many levels harder to break, since they are trying to break random data or comrpessed (dense) data, and can't confirm nor test with. But that's just another random thought :P

Era

Monday, November 5, 2007

Compression #3 - Continued.. Part B

Further notes. This will take a while, however it's somewhere to start.
I've compiled a list i will give later on specifics, on how many bits it takes to do x number, and then to encode that number. I've seen it jump from 1 bit to 12 bits, or 180 bits. However the larger it gets, the more data i need to try and represent that. Not only that, it doesn't include required numbers of how many times it's used. If failed, the compression rate is much depreciated, not to mention the header information is a failure as well.

I'm also working on the following thoughts. Exact way to encode to get smallest size in all ways is currently up for ideas, and requires a lot of math, or large INT's that require some type of library. I will submit it later, although there are others avaliable for free, this is one i know works that i built, and is still in the process of being built.

K, the value is 5, using it in a 4bit (16) area. Simply, you add all other values on top, and when it goes too large to represent the number you minus it. (for 8bit, it was 8,28,56,70 Ect)

I've gotten numbers in the 12k (15bit) range, for starting, but representing 5-6 data parts (20-24 bits). Depending on what they were, 1-6 combinations, determined how it would be stored.

If you think of a better way to store the number, please tell me and i'll tell you if it will work. If i get this to work, GPL goes the compression idea :)

The 1 wouldn't include any further information since it's the same number 6 times.
The 2, would simply do the (2x2x2x2x2x2 = 64), which is within reach

The 3, (3x3x3x3x3x3 = 729)

The 4, (10 (4+6 for 1 and 2 reuses) x 6 (2 repeat inserts) x 2 (worse case, order of repeats) x 4x3x2x1 = 2880)

The 5, The best way to represent this is to give the proper output of (5x5 for duplicate and offset where to put it. Then 5x4x3x2x1 = 3000)

The 6, Simply each fall off as they go. (6x5x4x3x2x1 = 720)

Ok, So we compare Vs the worse case. if it's 24bit, then it can't exceed 16Mil, if it does, hopefully not higher than 32Mil, since that's 25+ bits.

Another way to help with getting the proper size, is get your max value to represent onbits in their combinations, then divide the area (16mil) against it. Then try to keep all your other represented values less than this value, and if you can do it you are guarenteed compression :)

I've also tried where i dropped the lowest value (1) and left an extra value in, if that was hit, we manually determine the 4bit value, which increases the value from 12bit, to about 17-18 bit, however it's still less than 24bit. However likely hood of hitting. Not high. (We're talking a 3byte combination here) But it did lower the main hit numbers quite a bit, but not nearly enough.

I am increasing the values, but verifying them can be very hard, considering the math involved, and manually building all the possible combinations for worse case senario (which is the only one i compare with).

A Note. If the higher chance of hitting a <=1 bitsize rather than a >=1 bitsize, it is a success, and will be built and tested to see if it is worth it. But don't expect huge compression in small files, rather a decent compression over a large area, but even that can fail.

Era

Thursday, November 1, 2007

Compression #3 - Continued

Further with compression i've attempted, and although theoretically will work in certain situations, won't guarentee endless compression as of yet. But i'm working on it.

1) High Low. Ever played the gambling game where you flip a card up, and you guess if the next card is high or low? Well, using this compression type has some limited success.

You need a previous history of your last value, and compare it against your next value. The odds are in your favor it's in the more likely in the larger area. But when it fails you refer to a fallback value, and re-encode the correct value using the smaller value. so.. Using very large INT's.

Tends to make 8bits to 9.5, 4bits gets to 8.8 bits? but not worth it.

2) Attempted, encoding X of unique values in a row, and specify a failure (non unique), which we then encode from our list. There are of course success on all levels, EXCEPT the highest level. Using this knowlage we can then encode using a lesser form of 1bits (combinations of a certain number of 1's on, example is 2 on's in a 8bit area is 28).

There are problems.

Problem 1) Can only go to 5 on a 8bit, and 6 on 16bit. Higher values, equal less compression.
Problem 2) highest value doesn't guarentee a repeat or failure, so our last value can't get compression for a smaller bitvalue.

Tried a straight 2different 3bit unique bytetype, which works. 8 x 7 = 56 of 64. (or 28, *2*1) However, there's the problem and chance of 1 in 8 that they are identicle. The failure rate or references (3-5bits) for successful non-matches in a row nearly negate any worth of value. IF it's every 8 (on average) then the 3bits cover itself, if larger or in smaller than 8, the bitsizes are lost on big math and it isn't worth it.

I've also tested using a 2level compression for bit sizes, and getting counts on types. However it's not yet successful, but not finnished testing either.

Era

Wednesday, October 24, 2007

Compression #3

A few ideas have come up as of late, a few ideas i thought were pretty good and am currently working on, but haven't been perfected.

First, is using the previously mentioned multiplication to add together several values hidden inside a number, and by using the appropriate divisor and getting it's remainder, you can rebuild the original data. In this way, you could in the 1-8 combinations, store the data in 15.7 bits, instead of 18-24 bits.

The first idea, involves using the above mentioned, and getting appropriate data. I'm still fiddling with it so i will de-classify it as i go. I have a 1-4 matches (2bits) and then can rebuild accordingly, with the next few bits for which matches we are working with. When a match is made, we only need to reference the internal match, therefore saving a lot of space. Example.

1 unique, 2 long (4 * 8) : 0,0
2 unique, 3 long (4*28*2*2) : 1,2,2
3 unique, 4 long (4*56*3*2*3) 5,4,3,4
4 unique, 4 long (4*70*4*3*2) 1,2,3,4

All levels offer some compression, except the fourth one, which takes up about 12.7 bits, and it's source is 12 bits. (I will give exact bit details and how i get the 8,28,56,70 numbers later)

I'm also testing with two current ideas. The first is +1, which goes something like this. Let's say i'm working with 3bits (8), so the multiplier is always +1 larger on all of them. The reason for this, is the final nth is a breakout protocol, and then afterwards is a number telling me what the repeat is. So if the example goes 1,2,5,4,3,2, this covers 3*6=18bits. The max number, is then 9*8*7*6*5*4*6 362880 (58980h) which is about 19 something bits. Obviously we can't get compression, but if the match was one larger, it would be equal or smaller in size, and then can be saved.

In order to save more bits, a very powerful large infinate int library would do well (i have a homemade one, which is working)

I will say more when i have more to say.

Era

Sunday, October 21, 2007

Compression #2 - notes

Here are notes on maximum bits in a larger area. This for example, says that if you have 12 active (on bits) in a 32bit area, there will be X combinations.

32:1 - 32 (5)
32:2 - 496 (9)
32:3 - 4960 (13)
32:4 - 35960 (16)
32:5 - 201376 (18)
32:6 - 906192 (20)
32:7 - 3365856 (22)
32:8 - 10518300 (24)
32:9 - 28048800 (25)
32:10 - 64512240 (26)
32:11 - 129024480 (27)
32:12 - 225792840 (28)
32:13 - 347373600 (29)
32:14 - 471435600 (29)
32:15 - 565722720 (30)
32:16 - 601080390 (30)
32:17 - 565722720 (30)
32:18 - 471435600 (29)
32:19 - 347373600 (29)
32:20 - 225792840 (28)
32:21 - 129024480 (27)
32:22 - 64512240 (26)
32:23 - 28048800 (25)
32:24 - 10518300 (24)
32:25 - 3365856 (22)
32:26 - 906192 (20)
32:27 - 201376 (18)
32:28 - 35960 (16)
32:29 - 4960 (13)
32:30 - 496 (9)
32:31 - 32 (5)
32:32 - 1 (0)

These are the values i've come across using a set of functions. 32: is the bits currently referred to, :x is the bits on. - ?? is the combinations, and (x) is the number of bits to encode it.
In order to get any compression, assuming you need to tell the 5bits (32), you would need to get the on bits between the 1-10 or 22-32. However there might be other things or way to use this, in much larger numbers it might be more useful, but are usually 2-3 bits away from it's middle ground.

full ref 4-64 bits.

Era

Compression #2 - Continued....

More on the topic.

I've tried a few different combinations. I've tried a pair encoding, which is a unique idea. In dense code it failed to be useful, but did successfully compress a text file to half it's size.

It's format is simple, a header with bitsizes, and number of different numbers we are working with. Then it encodes with a 0 to say we are using a number already mentioned. Encoded to best size. When a 1 is used, it encodes a new number, and says how many times it is used. It is then used right away once, and noted as it goes.

If you have already refferenced all different types, the 0/1 is no longer used, and is instead encoded with it's current values instead. When those values reduce to 0 left, they are removed, so the last x types will compress better than the rest.

-

Another experiment i will go into, trying to find the max number of values you can encode in a range while only being used once. It has a chance to succeed, assuming the header information can be encoded right, and a decent size (preferably, half or less than all the combinations. Example, 7bit has 128 combinations, and try to say if they are used with 64 bits or less.)

I'll later on give specifics on other useful information reguarding multiplication to maximize on saved space (bits)

And example of this, is 1-8 conbimations (24 bits, 3 each, all different). Using this, you can easily encode it without a lot of expense to 18 bits, but using multiplication (40k combinations) can be compressed to 15.6 bits. but if all the fields aren't used, and you need to specify which ones you do/don't use. 16+8=24, breaking even with the original encoding and not helping. (not making it worse though either.)

I'll get back with more thoughts and details later.

Era

Tuesday, October 16, 2007

Compression #2 - Continued

As a continued part of my previous post, i shall describe the exact details of the universal code that i'm experimenting with to use in this compression.

The Gamma coding, would easily follow this pattern. Bold is ending, ittalic is the 'continued' and normal is the other code.

0 - 0
1 - 10 - 1 already assumed at highest point
2 - 1100
3 - 1110
4 - 110100
5 - 110110

The second encoding uses higher bits per bit reference, higher compression but looses the first bit since it goes in reverse instead. Example, encodeing a 1101, you would do it 1011 and it would reverse it.

0 - 0
1 - 1100
2 - 1010
3 - 1110
4 - 1001100
5 - 1101100

Other codes are considered but would be tested and experimented later with. However, in the tests it does suggest that if i can compress the heading information (ints used) correctly, it could compress the actual data, from a 1024 bytes to 300 bytes, meaning i need to save the header into 700 bytes or less. Real fun.

Era

Monday, October 15, 2007

Compression #2

I'm making a second attempt, using the following tests and hopefully get something to work.

1) will use a variant on the Elias gamma coding, http://en.wikipedia.org/wiki/Elias_gamma_coding , but instead of using it the way it is referenced in wikipedia, it uses 2 changes. I specify how many bits will be used per every first true bit. (the 1, can be extra condensed when the first true will assume 1, since you wouldn't reference a 0, the lowest numbers save the most space)

2) For a single bit, as a separator i can reference all combinations, (in a 9-13 bitsize) and get the ones that are only used once. The compressor will see and decrease those numbers, and in the end reduce the size later required to compress it. The second run through would actually either have to a) give all numbers of the values, or b) all other values never go lower, which could be bad. (when it exceeds 50%, it references ones that are not present instead.)

In cases where no matter how it compresses it, it exceeds the number of bits for the combinations (10 bit is 1024), then it can use an alternate to reference all on/off that way, but no numbering index would be avaliable.

3) Uses a variant highest compression for combination of bits (not weighted or self modifying like gz, which is really good). This would yeild, say there's a max of 6 numbers, it would determine the lowest number of bytes to represent all of the numbers equally. Example:

0-5 (6 combinations)

type 1: 000, 001, 010, 011, 10, 11 (4*3 =12 + 4=16 bits)
type 2: 00, 01, 100, 101, 110 , 1110, 1111 (2*2=4 + 3*3=(4+9)13 + (4*2) =(8+13) 23 bits)

Some combinations alter, but in lower combinations type 1 works, later type 2 works for larger ones, which give extra savings of a larger number of bits. When decided it always uses it.

Some tests have shown promise, but at the same time the compression was done on a file that i was able to save an additional 300 bytes, from gziping it. Truely random, uncompressable data needs to be re-tested with it.