Text compression: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
(→‎Dictionary compression: maybe structure like this instead)
Line 1: Line 1:
'''Text compression''' refers to techniques that allow fitting more text data into a smaller space.
'''Text compression''' refers to techniques that allow fitting more text data into a smaller space.


== Dictionary compression ==
== Dictionary compression and DTE ==


Dictionary compression is a technique where part of the character set is
Dictionary compression is a technique where part of the character set is
Line 11: Line 11:
Sometimes the compression may be applied recursively, where the dictionary string
Sometimes the compression may be applied recursively, where the dictionary string
itself may contain references to other dictionary strings.
itself may contain references to other dictionary strings.
Dual-tile encoding, or DTE for short, is a special case of dictionary compression. It is also known as [https://en.wikipedia.org/wiki/Byte_pair_encoding byte-pair encoding], or digram coding.
In this case, the dictionary strings are all two bytes long.


Example implementations:
Example implementations:
=== Simon's Quest (NES) ===


In ''Simon's Quest'' (NES) (all versions), the font is 256 characters long, although only a small part of that is actual text symbols used in dialog text. The following bytes have a special meaning:
In ''Simon's Quest'' (NES) (all versions), the font is 256 characters long, although only a small part of that is actual text symbols used in dialog text. The following bytes have a special meaning:
Line 20: Line 25:
* FC = Denotes the end of a substring. Restores the string pointer saved by opcode FD.
* FC = Denotes the end of a substring. Restores the string pointer saved by opcode FD.
* 00..FB = Print this character.
* 00..FB = Print this character.
There is room for only one saved string pointer, so substrings can not refer to other substrings, unless it is to terminate the entire string. The substring mechanism is used in the Japanese diskette version of the game. The cartridge versions of the game still contain support for this mechanism, even though the actual text data does not utilize it.
Dictionary strings are arbitrary length. There is room for only one saved string pointer, so substrings can not refer to other substrings, unless it is to terminate the entire string. The substring mechanism is used in the Japanese diskette version of the game. The cartridge versions of the game also support this mechanism, even though the actual text data does not utilize it.
 
=== Chrono Trigger (SNES) ===


In ''Chrono Trigger'' (SNES) (all versions), the font is 768 characters long, but the strings are 8-bit. The following special bytes are defined:
In ''Chrono Trigger'' (SNES) (all versions), the font is 768 characters long, but the strings are 8-bit. The following special bytes are defined:
Line 29: Line 36:
* 21..xx = Reference to a dictionary string. xx is a compile-time constant that determines the length of the dictionary. This number is 0x9F in the USA version and 0x3F in the Japanese version.
* 21..xx = Reference to a dictionary string. xx is a compile-time constant that determines the length of the dictionary. This number is 0x9F in the USA version and 0x3F in the Japanese version.
* xx+1..FF = Print this character.
* xx+1..FF = Print this character.
Dictionary strings are not applied recursively. The dictionary strings are stored in length-string format without an end delimiter.
Dictionary strings have a length limit of 255 bytes. They are not applied recursively. The dictionary strings are stored in length-data format without an end delimiter.
 
=== Dual-tile encoding ===
 
Dual-tile encoding, or DTE for short, is a special case of dictionary compression. It is also known as [[https://en.wikipedia.org/wiki/Byte_pair_encoding byte-pair encoding]], or digram coding.
In this case, the dictionary strings are all two bytes long.


== Bitrate reduction methods ==
== Bitrate reduction methods ==

Revision as of 22:44, 28 April 2016

Text compression refers to techniques that allow fitting more text data into a smaller space.

Dictionary compression and DTE

Dictionary compression is a technique where part of the character set is reserved to denote references to a "dictionary". If the byte falls into this range, a string is copied from the dictionary rather than the byte being copied verbatim. As this compression technique does not require knowledge of past data, it is very easy to implement on machines having little memory like the NES.

Sometimes the compression may be applied recursively, where the dictionary string itself may contain references to other dictionary strings.

Dual-tile encoding, or DTE for short, is a special case of dictionary compression. It is also known as byte-pair encoding, or digram coding. In this case, the dictionary strings are all two bytes long.

Example implementations:

Simon's Quest (NES)

In Simon's Quest (NES) (all versions), the font is 256 characters long, although only a small part of that is actual text symbols used in dialog text. The following bytes have a special meaning:

  • FF = End of text rendering.
  • FE = Newline.
  • FD = Save current string pointer. The next byte determines the string number; all consecutive characters will be read from that string rather than the current one.
  • FC = Denotes the end of a substring. Restores the string pointer saved by opcode FD.
  • 00..FB = Print this character.

Dictionary strings are arbitrary length. There is room for only one saved string pointer, so substrings can not refer to other substrings, unless it is to terminate the entire string. The substring mechanism is used in the Japanese diskette version of the game. The cartridge versions of the game also support this mechanism, even though the actual text data does not utilize it.

Chrono Trigger (SNES)

In Chrono Trigger (SNES) (all versions), the font is 768 characters long, but the strings are 8-bit. The following special bytes are defined:

  • 00 = End of string.
  • 01 = Read next byte; print character (byte+0x100).
  • 02 = Read next byte; print character (byte+0x200).
  • 03..20 = Various text effects, references to item tables, and references to party member names.
  • 21..xx = Reference to a dictionary string. xx is a compile-time constant that determines the length of the dictionary. This number is 0x9F in the USA version and 0x3F in the Japanese version.
  • xx+1..FF = Print this character.

Dictionary strings have a length limit of 255 bytes. They are not applied recursively. The dictionary strings are stored in length-data format without an end delimiter.

Bitrate reduction methods

Fixed-bit encoding

When the character set is small, such as 64 characters at most, strings could be encoded in a bitstream that packs 6 bits per character rather than 8 bits per character. This results in 20 % reduction of data size.

Variable-bit encodings

Huffman encoding

LZ based methods