Text compression
Text compression refers to techniques that allow fitting more text data into a smaller space.
Dictionary compression and DTE
Dictionary compression is a technique where part of the character set is reserved to denote references to a "dictionary". If the byte falls into this range, a string is copied from the dictionary rather than the byte being copied verbatim. As this compression technique does not require knowledge of past data, it is very easy to implement on machines having little memory like the NES.
Sometimes the compression may be applied recursively, where the dictionary string itself may contain references to other dictionary strings.
Dual-tile encoding, or DTE for short, is a special case of dictionary compression. It is also known as byte-pair encoding, or digram coding. In this case, the dictionary strings are all two bytes long.
Example implementations:
Simon's Quest (NES)
In Simon's Quest (NES) (all versions), the font is 256 characters long, although only a small part of that is actual text symbols used in dialog text.
Value | Meaning |
---|---|
FF | End of text. If used in a substring, will not return to the main string. A string that ends in FD can omit the trailing FF, if the substring ends in FF. |
FE | Newline. |
FD | Save current string pointer. The next byte determines the string number; this number will be used to calculate the new string pointer. Interpreting will continue from that address. |
FC | Denotes the end of a substring. Restores the string pointer saved by opcode FD. |
00–FB | Print this character. |
Dictionary strings are arbitrary length. There is room for only one saved string pointer, so substrings can not refer to other substrings, unless it is to terminate the entire string. The substring mechanism is used in the Japanese diskette version of the game. The cartridge versions of the game also support this mechanism, even though the actual text data does not utilize it.
Chrono Trigger (SNES)
In Chrono Trigger (SNES) (all versions), the font is 768 characters long, but the strings are 8-bit.
Value | Meaning |
---|---|
00 | End of string. |
01 | Read next byte; print character byte+0x100. |
02 | Read next byte; print character byte+0x200. |
03–20 | Various text effects, references to item tables, and references to party member names. |
21–xx | Reference to a dictionary string. xx is a compile-time constant that determines the length of the dictionary. This number is 0x9F in the USA version and 0x3F in the Japanese version. |
xx+1–FF | Print this character. |
Dictionary strings have a length limit of 255 bytes. They are not applied recursively. The dictionary strings are stored in length-data format without an end delimiter.
Bisqwit's ppu_read_buffer test (NES)
In Bisqwit's emulator test ROM ppu_read_buffer ([1]) the font is 128 symbols long, and in addition to the alphabet it includes some pre-rendered substrings in variable-width font. The ROM uses a combination of DTE and dictionary for text compression. For DTE, the compression is applied recursively, in both DTE bytes. Additionally the string may contain a jump label to another string, permitting the reuse of the same string in different test descriptions.
Value | Meaning |
---|---|
00 | End of string. |
01–80 | Print this symbol. After printing, if the stack pointer is wrong, pop a byte and interpret it rather than loading the next byte from the string. |
81–FE | Push DTE_TABLE1[n-0x81] in stack, and interpret DTE_TABLE0[n-0x81]. |
FF | A 16-bit word follows. This word is loaded as a new string pointer and loading continues from that address. The old string pointer is not saved. |
Bitrate reduction methods
Fixed-bit encoding
When the character set is small, such as 64 characters at most, strings could be encoded in a bitstream that packs 6 bits per character rather than 8 bits per character. This results in 20 % reduction of data size.
Main article: Fixed Bit Length Encoding
Variable-bit encodings
In variable-bit encodings different symbols are stored in different number of bits.
Example code in C for storing and retrieving variable-bit length integers:
#include <limits.h> // CHAR_BIT void PutBits(void* memory, unsigned* bitpos, unsigned long V, unsigned nbits) { unsigned char* buffer = (unsigned char*)(memory); while(nbits > 0) { unsigned bytepos = *bitpos/CHAR_BIT, bits_remain = CHAR_BIT-*bitpos%CHAR_BIT, bits_taken = CHAR_BIT-bits_remain; unsigned bits_to_write = std::min(nbits, bits_remain); unsigned value_mask = (1 << bits_to_write)-1; unsigned value_to_write = V & value_mask; buffer[bytepos] = (buffer[bytepos] & ~(value_mask << bits_taken)) | (value_to_write << bits_taken); V >>= bits_to_write; nbits -= bits_to_write; *bitpos += bits_to_write; } } unsigned long GetBits(const void* memory, unsigned* bitpos, unsigned nbits) { const unsigned char* buffer = (const unsigned char*)(memory); unsigned long result = 0, shift=0; while(nbits > 0) { unsigned bytepos = *bitpos/CHAR_BIT, bits_remain = CHAR_BIT-*bitpos%CHAR_BIT, bits_taken = CHAR_BIT-bits_remain; unsigned bits_to_take = std::min(nbits, bits_remain); unsigned v = (buffer[bytepos] >> bits_taken) & ((1 << bits_to_take)-1); result |= v << shift; shift += bits_to_take; nbits -= bits_to_take; *bitpos += bits_to_take; } return result; }
Ideally, you would store common symbols using few bits and infrequent symbols using more bits. Which brings us to…
Huffman coding
A special case of variable-bit encodings is Huffman coding.
Huffman coding defines the optimal coding for a given character set based on a table of frequencies that each character is used.
Arithmetic coding
Huffman coding is also a special case of arithmetic coding. In arithmetic coding, each symbol is represented by a range of binary fractions rather than a particular number of bits. As arithmetic coding was covered by a number of patents up to 1993, and it is calculation intensive, chances are no game uses it, unless part of e.g. LZMA compression, which was released in 1999.