Wednesday, October 24, 2007

Compression #3

A few ideas have come up as of late, a few ideas i thought were pretty good and am currently working on, but haven't been perfected.

First, is using the previously mentioned multiplication to add together several values hidden inside a number, and by using the appropriate divisor and getting it's remainder, you can rebuild the original data. In this way, you could in the 1-8 combinations, store the data in 15.7 bits, instead of 18-24 bits.

The first idea, involves using the above mentioned, and getting appropriate data. I'm still fiddling with it so i will de-classify it as i go. I have a 1-4 matches (2bits) and then can rebuild accordingly, with the next few bits for which matches we are working with. When a match is made, we only need to reference the internal match, therefore saving a lot of space. Example.

1 unique, 2 long (4 * 8) : 0,0
2 unique, 3 long (4*28*2*2) : 1,2,2
3 unique, 4 long (4*56*3*2*3) 5,4,3,4
4 unique, 4 long (4*70*4*3*2) 1,2,3,4

All levels offer some compression, except the fourth one, which takes up about 12.7 bits, and it's source is 12 bits. (I will give exact bit details and how i get the 8,28,56,70 numbers later)

I'm also testing with two current ideas. The first is +1, which goes something like this. Let's say i'm working with 3bits (8), so the multiplier is always +1 larger on all of them. The reason for this, is the final nth is a breakout protocol, and then afterwards is a number telling me what the repeat is. So if the example goes 1,2,5,4,3,2, this covers 3*6=18bits. The max number, is then 9*8*7*6*5*4*6 362880 (58980h) which is about 19 something bits. Obviously we can't get compression, but if the match was one larger, it would be equal or smaller in size, and then can be saved.

In order to save more bits, a very powerful large infinate int library would do well (i have a homemade one, which is working)

I will say more when i have more to say.

Era

Sunday, October 21, 2007

Compression #2 - notes

Here are notes on maximum bits in a larger area. This for example, says that if you have 12 active (on bits) in a 32bit area, there will be X combinations.

32:1 - 32 (5)
32:2 - 496 (9)
32:3 - 4960 (13)
32:4 - 35960 (16)
32:5 - 201376 (18)
32:6 - 906192 (20)
32:7 - 3365856 (22)
32:8 - 10518300 (24)
32:9 - 28048800 (25)
32:10 - 64512240 (26)
32:11 - 129024480 (27)
32:12 - 225792840 (28)
32:13 - 347373600 (29)
32:14 - 471435600 (29)
32:15 - 565722720 (30)
32:16 - 601080390 (30)
32:17 - 565722720 (30)
32:18 - 471435600 (29)
32:19 - 347373600 (29)
32:20 - 225792840 (28)
32:21 - 129024480 (27)
32:22 - 64512240 (26)
32:23 - 28048800 (25)
32:24 - 10518300 (24)
32:25 - 3365856 (22)
32:26 - 906192 (20)
32:27 - 201376 (18)
32:28 - 35960 (16)
32:29 - 4960 (13)
32:30 - 496 (9)
32:31 - 32 (5)
32:32 - 1 (0)

These are the values i've come across using a set of functions. 32: is the bits currently referred to, :x is the bits on. - ?? is the combinations, and (x) is the number of bits to encode it.
In order to get any compression, assuming you need to tell the 5bits (32), you would need to get the on bits between the 1-10 or 22-32. However there might be other things or way to use this, in much larger numbers it might be more useful, but are usually 2-3 bits away from it's middle ground.

full ref 4-64 bits.

Era

Compression #2 - Continued....

More on the topic.

I've tried a few different combinations. I've tried a pair encoding, which is a unique idea. In dense code it failed to be useful, but did successfully compress a text file to half it's size.

It's format is simple, a header with bitsizes, and number of different numbers we are working with. Then it encodes with a 0 to say we are using a number already mentioned. Encoded to best size. When a 1 is used, it encodes a new number, and says how many times it is used. It is then used right away once, and noted as it goes.

If you have already refferenced all different types, the 0/1 is no longer used, and is instead encoded with it's current values instead. When those values reduce to 0 left, they are removed, so the last x types will compress better than the rest.

-

Another experiment i will go into, trying to find the max number of values you can encode in a range while only being used once. It has a chance to succeed, assuming the header information can be encoded right, and a decent size (preferably, half or less than all the combinations. Example, 7bit has 128 combinations, and try to say if they are used with 64 bits or less.)

I'll later on give specifics on other useful information reguarding multiplication to maximize on saved space (bits)

And example of this, is 1-8 conbimations (24 bits, 3 each, all different). Using this, you can easily encode it without a lot of expense to 18 bits, but using multiplication (40k combinations) can be compressed to 15.6 bits. but if all the fields aren't used, and you need to specify which ones you do/don't use. 16+8=24, breaking even with the original encoding and not helping. (not making it worse though either.)

I'll get back with more thoughts and details later.

Era

Tuesday, October 16, 2007

Compression #2 - Continued

As a continued part of my previous post, i shall describe the exact details of the universal code that i'm experimenting with to use in this compression.

The Gamma coding, would easily follow this pattern. Bold is ending, ittalic is the 'continued' and normal is the other code.

0 - 0
1 - 10 - 1 already assumed at highest point
2 - 1100
3 - 1110
4 - 110100
5 - 110110

The second encoding uses higher bits per bit reference, higher compression but looses the first bit since it goes in reverse instead. Example, encodeing a 1101, you would do it 1011 and it would reverse it.

0 - 0
1 - 1100
2 - 1010
3 - 1110
4 - 1001100
5 - 1101100

Other codes are considered but would be tested and experimented later with. However, in the tests it does suggest that if i can compress the heading information (ints used) correctly, it could compress the actual data, from a 1024 bytes to 300 bytes, meaning i need to save the header into 700 bytes or less. Real fun.

Era

Monday, October 15, 2007

Compression #2

I'm making a second attempt, using the following tests and hopefully get something to work.

1) will use a variant on the Elias gamma coding, http://en.wikipedia.org/wiki/Elias_gamma_coding , but instead of using it the way it is referenced in wikipedia, it uses 2 changes. I specify how many bits will be used per every first true bit. (the 1, can be extra condensed when the first true will assume 1, since you wouldn't reference a 0, the lowest numbers save the most space)

2) For a single bit, as a separator i can reference all combinations, (in a 9-13 bitsize) and get the ones that are only used once. The compressor will see and decrease those numbers, and in the end reduce the size later required to compress it. The second run through would actually either have to a) give all numbers of the values, or b) all other values never go lower, which could be bad. (when it exceeds 50%, it references ones that are not present instead.)

In cases where no matter how it compresses it, it exceeds the number of bits for the combinations (10 bit is 1024), then it can use an alternate to reference all on/off that way, but no numbering index would be avaliable.

3) Uses a variant highest compression for combination of bits (not weighted or self modifying like gz, which is really good). This would yeild, say there's a max of 6 numbers, it would determine the lowest number of bytes to represent all of the numbers equally. Example:

0-5 (6 combinations)

type 1: 000, 001, 010, 011, 10, 11 (4*3 =12 + 4=16 bits)
type 2: 00, 01, 100, 101, 110 , 1110, 1111 (2*2=4 + 3*3=(4+9)13 + (4*2) =(8+13) 23 bits)

Some combinations alter, but in lower combinations type 1 works, later type 2 works for larger ones, which give extra savings of a larger number of bits. When decided it always uses it.

Some tests have shown promise, but at the same time the compression was done on a file that i was able to save an additional 300 bytes, from gziping it. Truely random, uncompressable data needs to be re-tested with it.

Monday, October 8, 2007

Compression?

I've attempted an experimental compression idea, although it doesn't do very well.

Here's an overview of my thoughts. I would take a stream of dense encrypted data, using my OTP pad, then find all matches and large matches and use a structure to handle it. it goes through several rotations, before it just comes up kinda dead. The way i was considering using it, although it seems to be failing, is it would use the following format.

1bit - 1 for encoded, 0 for skip and comeback later at different window.
length is not known ahead of time, so it's kinda... eh?
12bits - lookup table back
2bits - 2-5 bytes long match
= 15bits.

Each 0 on the first bit, increases the bitsize by one, so if you do 6 rotations, eventually it's 21bits, shy of 3 bytes.

in small chunks, (4k and less) it's 1/2 the numbers are 1 byte, 100-200 are 2 byte, and the rest are 3-4 byte (1-3 overall).

In order to test it's usefulness i've attempted to do a bit count, using it's best compression outputting, to express how big it would be when done. when it days 8000bits original, it comes back 10,000-12,000 bits. Ocationally i get it closer or equal to the original value, but at that point there's no compression or it is simply far too much work (16Meg of data for a 24-26bit data structure hoping for 3-6 byte matches). It doesn't seem to work

I will consider different forms of compression to add in the cipher, but dense data just equals, probability of 0 and 1 is 50% each.. Equally. No matter how many levels it goes, it is about even and when tested in very large and different sized numbers, it's very equal and nearly exactly what the value should be.

Era