Difference between revisions of "Kosinski compression"

From Sega Retro

m (Fixing links)
(Tidying up Compression Theory and correcting a few errors, and adding a small note on KosM)
Line 8: Line 8:
 
===Basic Format===
 
===Basic Format===
  
The Kosinski compression scheme is very powerful and includes both run-length encoding (RLE) and also relative backreferences to optimally compress repeating patterns.
+
The Kosinski compression scheme is very powerful and includes support for both uncompressed data and two types of [[wikipedia:run-length encoding|run-length encoding]] to optimally compress repeating patterns.
  
 
The compressed data follows the following format:
 
The compressed data follows the following format:
Line 16: Line 16:
 
</pre>
 
</pre>
  
I will call the A field the 'description' field. This is always 2 bytes in length, starting at the beginning of the compression, and is followed by the B field, which i'll call the 'data' field. This goes on for as long as necessary to cover all the data described in the 'description' section. After that, the pattern repeats until the 'end of compression' sequence is given, which i'll explain at the end.
+
The A field is referred to as the description field. This is always 2 bytes in length, starting at the beginning of the compressed data, and is followed by the B field, which is referred to as the data field. This goes on for as long as is necessary to cover all the data described in the description field. After that, the pattern repeats until the end of compression sequence is is encountered.
  
 
===Description Field===
 
===Description Field===
  
The 'description' field is made up of 16 bits in little endian bit (not byte) order, which means that the bytes are in the correct order, but each byte is read backwards.  So for example, if the description field is:
+
The description field is made up of 16 bits in little endian bit (not byte) order, which means that the bytes are in the correct order, but each byte is read backwards.  So for example, if the description field is:
  
 
<pre>FF FE</pre>
 
<pre>FF FE</pre>
  
then the description field, in bits, in the order we care about, will be:
+
then the description field, in bits, in the correct order for interpretation, will be:
  
 
<pre>[1111 1111] [0111 1111]</pre>
 
<pre>[1111 1111] [0111 1111]</pre>
  
Now, starting left to right, we interpret each of the bits in the following way:
+
Now, going from left to right, each bit is interpreted in the following way:
  
* If the bit is a 1, The corresponding data object in the data field is uncompressed.  This only uses 1 bit to denote.
+
* If the bit is a 1, it indicates uncompressed data.
* If the bit is a 0, followed by a 1, the corresponding data object uses backreferencing.  This only uses 2 bits to denote.
+
* If the bit is a 0, followed by a 1, it indicates separate run-length encoding.
* If the bit is a 0, followed by a 0, the corresponding data object is run-length encoded. The two bits immediately after the '00' are also used to denote how many times the RLE is copied.  This uses 4 bits to denote.
+
* If the bit is a 0, followed by a 0, it indicates inline run-length encoding. The two bits following this give the copy count - 2.
* If the end of the description field occurs, start reading a new description field from the next 2 bytes of the data.
+
 
* If the end of compression sequence occurs and there is still description field bits left, ignore them.
+
If the end of the description field occurs, a new description field is read from the next 2 bytes of the data. If the end of compression sequence occurs and there are still description field bits left, they are ignored.
  
 
===Uncompressed Data===
 
===Uncompressed Data===
  
Bytes with a corresponding description of '1' indicate that the corresponding byte in the data field is already uncompressed.  This only refers to a single byte. So for example, the following block of kosinski compression:
+
If a 1 is read in the description field, the corresponding byte in the data field is already uncompressed and simply copied to the destination as-is. So for example, the following block of Kosinski-compressed data:
  
 
<pre>FF FF 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F</pre>
 
<pre>FF FF 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F</pre>
Line 46: Line 46:
 
<pre>00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F</pre>
 
<pre>00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F</pre>
  
because all bits in the description field are 1, indicating that each byte in the data field (00 through 0F) are already uncompressed.
+
because all bits in the description field are 1, indicating that each byte in the data field (00 through 0F) is already uncompressed.
 
 
===Run Length Encoding===
 
 
 
Run length encoding works when '00 XX' is found in the description field.  There are two parameters to the encode,  the first is the number of times the data is repeated, which is encoded in the 'XX' found in the description field.  This number (possible values are 0 through 3) is increased by 2 to get the number of bytes used to repeat the pattern.  Therefore:
 
  
<pre>
+
===Inline Run-Length Encoding===
00 00 - RLE repeated for 2 bytes
 
00 01 - RLE repeated for 3 bytes
 
00 10 - RLE repeated for 4 bytes
 
00 11 - RLE repeated for 5 bytes
 
</pre>
 
  
The other parameter is the data object, which is only a single byte.  Rather than this byte describing the data to be repeated, it actually describes the 'previous number of bytes' to repeat, encoded as a negative number.  Therefore:
+
Inline run-length encoding is indicated by 00 XX in the description field. The XX is incremented by 2 to get the number of bytes to copy, and the corresponding byte in the data field is added to -256 to get the offset from the current position in the uncompressed stream of the byte to start copying from. So for example (including the description field):
 
 
<pre>
 
Data byte FF (-1) - Repeat the last 1 byte
 
Data byte FE (-2) - Repeat the last 2 bytes
 
Data byte FD (-3) - Repeat the last 3 bytes
 
Data byte FC (-4) - Repeat the last 4 bytes
 
Data byte FB (-5) - Repeat the last 5 bytes
 
</pre>
 
 
 
So for example (including the description field):
 
  
 
<pre>
 
<pre>
Line 85: Line 66:
 
</pre>
 
</pre>
  
starts with an uncompressed [1] value (25), then a RLE value repeated for 3 bytes [00 01] of FF, which means the 1 previous byte(s) .. which is 25 repeated 3 times, followed by 11 more uncompressed [1] bytes (01 through 0C).
+
starts with an uncompressed [1] value (25), then an inline RLE value repeated for 3 bytes [00 01] with an offset of -$100 + $FF = -1, which means the 1 previous byte (which is 25) is repeated 3 times, followed by 11 more uncompressed [1] bytes (01 through 0C).
  
Note: Keep in mind that the bitfield after 01 in the description field is NOT how many times the pattern is repeated, but for how many ''bytes'' the pattern is repeated. Using 11 (repeat for 5 bytes) and a value of FB (copy the last 5 bytes) only repeats the pattern once. Note that anything farther back than the last 5 bytes cannot be repeated entirely using this method because the largest number of bytes you can repeat is 5. Anything that needs to be repeated that is larger than 5 bytes should use backreferencing.
+
'''Note:''' Keep in mind that the repeat count in the description field is ''not'' how many times the pattern is repeated, but for how many bytes the pattern is repeated. Using a repeat count of 11 (3 + 2 = 5 bytes) and an offset of $FB (-$100 + $FB = -5) only repeats the pattern once. Note that anything farther back than the last 5 bytes cannot be repeated entirely using this method because the largest possible repeat count is 5. Anything that needs to be repeated for longer than 5 bytes needs to use separate run-length encoding.
  
===Relative Backreferencing===
+
===Separate Run-Length Encoding===
  
Backreferencing is marked with in the description field with a '01' and refers to a data object 2 or 3 bytes wide, depending on the second byte. The format of this data object is:
+
Separate run-length encoding is indicated in the description field with a 01 and has 2 or 3 corresponding bytes in the data field, depending on the second byte. The format of these bytes is:
  
 
<pre>
 
<pre>
 
[LLLL LLLL] [HHHH HCCC]
 
[LLLL LLLL] [HHHH HCCC]
 
</pre>
 
</pre>
or
+
if there are two bytes or
 
<pre>
 
<pre>
 
[LLLL LLLL] [HHHH H000] [CCCC CCCC]
 
[LLLL LLLL] [HHHH H000] [CCCC CCCC]
 
</pre>
 
</pre>
the second format is used when the C bits of the second byte are all unset.  The values of L and H are negative.
+
if there are three bytes. The three byte format is used when the C bits of the second byte are all unset.
 
 
What this type of data object means is that the output should be copied from a location ''x'' bytes back, and copied for ''c + 2'' bytes.  9 bytes of data can be copied using the 2 byte format, and 257 bytes using the 3 byte format.
 
  
The offset to the data to be copied, ''x'' is calculated using:  <pre>x = L - (256 * (H + 1))</pre>
+
In the two byte format, -8192 + HHHHH * 256 + LLLLLLLL gives the offset from the current position in the uncompressed stream of the byte to start copying from, and CCC + 2 gives the number of bytes to copy. In the three byte format, the offset is calculated in the same way but the copy count is given by CCCCCCCC + 1. Therefore the maximum number of bytes that can be copied is 9 in the 2 byte format and 256 in the 3 byte format.
  
 
So for example, in:
 
So for example, in:
Line 110: Line 89:
 
F0 F8 40 = [1111 0000] [1111 1000] [0100 0000]
 
F0 F8 40 = [1111 0000] [1111 1000] [0100 0000]
 
</pre>
 
</pre>
''x'' would be -15, or 15 bytes backward since ''-15 - (256 * (-1 + 1)) = -15''.  Also in this example, the data at that offset would be repeated for 66 bytes (''c+2'' where ''c'' is 64 [40]).
+
the offset would be -16, or 16 bytes backward since -8192 + 31 * 256 + 240 = -16.  Also in this example, 65 bytes would be copied (CCCCCCCC + 1 where the value of the C bits is 64 [40]).
 +
 
 +
Two special cases occur when the three byte format is used and the third byte is either 0 or 1, and are described below.
  
=== End of Compression Sequence ===
+
====End of Compression sequence====
  
The end of compression sequence is a 3-byte format backreference where L is 0, H is -1 (all bits set), and C is 0:
+
The end of the compressed data is indicated by a three byte data field corresponding to separate run-length encoding in which the third byte is 0.
  
<pre>
+
====Read Next Description sequence====
00 F8 00
 
</pre>
 
  
Obviously this sequence would not come up elsewhere in the compressed data because it is literally a backreference to itself.
+
Alternatively, if the third byte is 1 rather than 0 it indicates that the current description is to be discarded and a new one to be read.
  
 
=== One Final Example ===
 
=== One Final Example ===
Line 136: Line 115:
 
   First 14 data bytes are uncompressed (1111 1111 1111 11)
 
   First 14 data bytes are uncompressed (1111 1111 1111 11)
  
       Output: 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5
+
       Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5
  
   The next object is RLE (00) but we need more description data, so we start the next chunk:
+
   The next description (00) indicates inline RLE but more description data is needed, so the next description field is read:
 
   FC 15 translates to [0011 1111] [1010 1000]
 
   FC 15 translates to [0011 1111] [1010 1000]
  
   The RLE goes for 2 bytes (00) and is copied from the previous 2 bytes (FE from data):
+
   2 bytes (00) are to be copied and the offset is -$100 + $FE = -2:
  
       Output: 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5
+
       Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5
  
 
   The next 7 data bytes are uncompressed (11 1111 1)
 
   The next 7 data bytes are uncompressed (11 1111 1)
  
       Output: 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
+
       Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
  
   The next data bytes are a backreference (01).  These bytes (FF FF) translate to a backreference
+
   The next description (01) indicates separate RLEThe corresponding data bytes (FF FF) translate to an offset of
   of ''x = -1'' for ''c+2 = 9'' (the last 1 output byte repeated for 9 output bytes).
+
   -8192 + 31 * 256 + 255 = -1 and a copy count of 7 + 2 = 9 (i.e. the last 1 output byte repeated for 9 output bytes).
  
       Output: 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
+
       Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
 
               30 30 30 30 30 30 30 30 30
 
               30 30 30 30 30 30 30 30 30
  
   The next data bytes are also a backreference (01). These bytes (00 F8 00) translate to an end
+
   The next description also indicates separate RLE. The three byte format is used and the third byte is 0, so
   of compression sequence.  The rest of the description field is ignored and the compression is
+
   the rest of the description field is ignored and the compression is complete.
  complete.
 
 
    
 
    
       Output: 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
+
       Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
 
               30 30 30 30 30 30 30 30 30
 
               30 30 30 30 30 30 30 30 30
  
Line 165: Line 143:
  
 
==Kosinski Moduled compression==
 
==Kosinski Moduled compression==
 +
 
Kosinski Moduled compression (KosM compression) is a variant of standard Kosinski compression, used by ''[[Sonic 3 & Knuckles]]''. KosM compressed data starts off with a 2-byte header. The upper nibble is the total number of modules minus 1, and the lower three nibbles are the uncompressed size of the last module in bytes (all other modules have a fixed uncompressed size of $1000 bytes). As a result, the entire header word can be read as the total uncompressed size of the data. The only special case occurs when the header is $A000, in which case the game reads it as $8000. After the header are the actual modules. Each module can be read as standard Kosinski-compressed data, and is padded out to a size which is a multiple of $10 bytes.
 
Kosinski Moduled compression (KosM compression) is a variant of standard Kosinski compression, used by ''[[Sonic 3 & Knuckles]]''. KosM compressed data starts off with a 2-byte header. The upper nibble is the total number of modules minus 1, and the lower three nibbles are the uncompressed size of the last module in bytes (all other modules have a fixed uncompressed size of $1000 bytes). As a result, the entire header word can be read as the total uncompressed size of the data. The only special case occurs when the header is $A000, in which case the game reads it as $8000. After the header are the actual modules. Each module can be read as standard Kosinski-compressed data, and is padded out to a size which is a multiple of $10 bytes.
 +
 +
KosM can be thought of as an archival format for multiple pieces of Kosinski-compressed data, and is used as an alternative to [[Nemesis compression]] for compressing art. The maximum uncompressed module size was kept as $1000 bytes because the intermediate RAM buffer to which the data is decompressed before being DMAed to VRAM has a size of $1000 bytes.
  
 
[[Category:Data Formats]]
 
[[Category:Data Formats]]

Revision as of 15:43, 21 June 2010

Kosinski compression is the name given to a compression format used in Sonic games for the Sega Genesis/Mega Drive. The name first showed up on SHaC and was to pay tribute to the person who cracked the format, Brett Kosinski. It is a variation of the LZSS algorithm.

Kosinski compression is used to compress the following data types:

Compression Theory

Basic Format

The Kosinski compression scheme is very powerful and includes support for both uncompressed data and two types of run-length encoding to optimally compress repeating patterns.

The compressed data follows the following format:

AA AA BB BB .. AA AA BB BB ..

The A field is referred to as the description field. This is always 2 bytes in length, starting at the beginning of the compressed data, and is followed by the B field, which is referred to as the data field. This goes on for as long as is necessary to cover all the data described in the description field. After that, the pattern repeats until the end of compression sequence is is encountered.

Description Field

The description field is made up of 16 bits in little endian bit (not byte) order, which means that the bytes are in the correct order, but each byte is read backwards. So for example, if the description field is:

FF FE

then the description field, in bits, in the correct order for interpretation, will be:

[1111 1111] [0111 1111]

Now, going from left to right, each bit is interpreted in the following way:

  • If the bit is a 1, it indicates uncompressed data.
  • If the bit is a 0, followed by a 1, it indicates separate run-length encoding.
  • If the bit is a 0, followed by a 0, it indicates inline run-length encoding. The two bits following this give the copy count - 2.

If the end of the description field occurs, a new description field is read from the next 2 bytes of the data. If the end of compression sequence occurs and there are still description field bits left, they are ignored.

Uncompressed Data

If a 1 is read in the description field, the corresponding byte in the data field is already uncompressed and simply copied to the destination as-is. So for example, the following block of Kosinski-compressed data:

FF FF 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

simply produces the following output:

00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

because all bits in the description field are 1, indicating that each byte in the data field (00 through 0F) is already uncompressed.

Inline Run-Length Encoding

Inline run-length encoding is indicated by 00 XX in the description field. The XX is incremented by 2 to get the number of bytes to copy, and the corresponding byte in the data field is added to -256 to get the offset from the current position in the uncompressed stream of the byte to start copying from. So for example (including the description field):

F1 FF 25 FF 01 02 03 04 05 06 07 08 09 0A 0B 0C

outputs to:

25 25 25 25 01 02 03 04 05 06 07 08 09 0A 0B 0C

because the description field

F1 FF = [1000 1111] [1111 1111]

starts with an uncompressed [1] value (25), then an inline RLE value repeated for 3 bytes [00 01] with an offset of -$100 + $FF = -1, which means the 1 previous byte (which is 25) is repeated 3 times, followed by 11 more uncompressed [1] bytes (01 through 0C).

Note: Keep in mind that the repeat count in the description field is not how many times the pattern is repeated, but for how many bytes the pattern is repeated. Using a repeat count of 11 (3 + 2 = 5 bytes) and an offset of $FB (-$100 + $FB = -5) only repeats the pattern once. Note that anything farther back than the last 5 bytes cannot be repeated entirely using this method because the largest possible repeat count is 5. Anything that needs to be repeated for longer than 5 bytes needs to use separate run-length encoding.

Separate Run-Length Encoding

Separate run-length encoding is indicated in the description field with a 01 and has 2 or 3 corresponding bytes in the data field, depending on the second byte. The format of these bytes is:

[LLLL LLLL] [HHHH HCCC]

if there are two bytes or

[LLLL LLLL] [HHHH H000] [CCCC CCCC]

if there are three bytes. The three byte format is used when the C bits of the second byte are all unset.

In the two byte format, -8192 + HHHHH * 256 + LLLLLLLL gives the offset from the current position in the uncompressed stream of the byte to start copying from, and CCC + 2 gives the number of bytes to copy. In the three byte format, the offset is calculated in the same way but the copy count is given by CCCCCCCC + 1. Therefore the maximum number of bytes that can be copied is 9 in the 2 byte format and 256 in the 3 byte format.

So for example, in:

F0 F8 40 = [1111 0000] [1111 1000] [0100 0000]

the offset would be -16, or 16 bytes backward since -8192 + 31 * 256 + 240 = -16. Also in this example, 65 bytes would be copied (CCCCCCCC + 1 where the value of the C bits is 64 [40]).

Two special cases occur when the three byte format is used and the third byte is either 0 or 1, and are described below.

End of Compression sequence

The end of the compressed data is indicated by a three byte data field corresponding to separate run-length encoding in which the third byte is 0.

Read Next Description sequence

Alternatively, if the third byte is 1 rather than 0 it indicates that the current description is to be discarded and a new one to be read.

One Final Example


Compressed Data:

  FF 3F 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5
  FC 15 FE C3 44 78 88 98 44 30 FF FF 00 F8 00

Decompression process:

  FF 3F translates to [1111 1111] [1111 1100]
  First 14 data bytes are uncompressed (1111 1111 1111 11)

      Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5

  The next description (00) indicates inline RLE but more description data is needed, so the next description field is read:
  FC 15 translates to [0011 1111] [1010 1000]

  2 bytes (00) are to be copied and the offset is -$100 + $FE = -2:

      Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5

  The next 7 data bytes are uncompressed (11 1111 1)

      Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30

  The next description (01) indicates separate RLE.  The corresponding data bytes (FF FF) translate to an offset of 
  -8192 + 31 * 256 + 255 = -1 and a copy count of 7 + 2 = 9 (i.e. the last 1 output byte repeated for 9 output bytes).

      Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
              30 30 30 30 30 30 30 30 30

  The next description also indicates separate RLE. The three byte format is used and the third byte is 0, so
  the rest of the description field is ignored and the compression is complete.
  
      Output: 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5 C3 44 78 88 98 44 30
              30 30 30 30 30 30 30 30 30

Kosinski Moduled compression

Kosinski Moduled compression (KosM compression) is a variant of standard Kosinski compression, used by Sonic 3 & Knuckles. KosM compressed data starts off with a 2-byte header. The upper nibble is the total number of modules minus 1, and the lower three nibbles are the uncompressed size of the last module in bytes (all other modules have a fixed uncompressed size of $1000 bytes). As a result, the entire header word can be read as the total uncompressed size of the data. The only special case occurs when the header is $A000, in which case the game reads it as $8000. After the header are the actual modules. Each module can be read as standard Kosinski-compressed data, and is padded out to a size which is a multiple of $10 bytes.

KosM can be thought of as an archival format for multiple pieces of Kosinski-compressed data, and is used as an alternative to Nemesis compression for compressing art. The maximum uncompressed module size was kept as $1000 bytes because the intermediate RAM buffer to which the data is decompressed before being DMAed to VRAM has a size of $1000 bytes.