PYC.Compression 1.0.1

.NET 9.0

dotnet add package PYC.Compression --version 1.0.1

NuGet\Install-Package PYC.Compression -Version 1.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="PYC.Compression" Version="1.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="PYC.Compression" Version="1.0.1" />
                    

                            Directory.Packages.props

<PackageReference Include="PYC.Compression" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add PYC.Compression --version 1.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: PYC.Compression, 1.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#addin nuget:?package=PYC.Compression&version=1.0.1
                    

                            Install PYC.Compression as a Cake Addin

#tool nuget:?package=PYC.Compression&version=1.0.1
                    

                            Install PYC.Compression as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

PYC

PYC is a word sequence compression. It is a simple compression method that interprets the input as a sequence of words and compresses it by replacing repeated words by its index of first occurence.

Those indexes are then compressed by a dynamic canonical Huffman tree which allows code length up to 22 bits.

The tree is stored with the compressed data and is itself encoded in a special way.

The compression is quite fast in contrast to algorithms like LZ77 as it has not to search for reoccuring patterns. You could say that the word indexes encode a byte sequence match of length 2. But it does not need any offset or length information. The index itself is enough and has always the same size.

As indexes (repeated symbols) are less common as all possible 2^16 symbols, they are ideal to encode them with Huffman coding in most cases.

The tricky part is to store the tree, as it might encode way more than 256 symbols and can consume a lot of space.

The compression works pretty good on binary data with repeated word values, uncompressed assemblies, uncompressed bitmaps with low color count and even text data.

The algorithm has some limitations. Most notably the maximum of 22 bits per index encoding. For very large amounts of different words, a compression is not possible. Therefore it cannot compress several files including benchmark files like enwik8 etc.

Compression

The raw data is interpreted as a sequence of little endian 16-bit values (words). If the raw data size is odd a padding byte with value zero is appended to the data.

If a word value is found for the first time, an auto increased index is assigned which starts at zero. So the first found word gets index 0, the second found word index 1 and so on.

If the same word is encountered again, it is replaced by its index.

In addition a counter is used to track the count of sequential words and sequential indexes.

The compressed output is then generated as a bit stream. To identify words and indexes in the output, header bytes are used. As every compressed data has to start with at least one normal word value, the first header just gives the number of following words. As a byte can express values from 0 to 255 but a count of 0 is impossible, the first header stores the number of words minus 1. So a header value of 5 means that 6 words will follow.

Words are written directly as 16 bits to the output stream. Note that this is basically big endian encoding as the bits are written to the stream from highest to lowest bit. This is true in general for writing data to the bit stream.

After the first word sequence, the next header must be inserted. This and all following headers now have a special format but they are still always 8 bits in size. Dependent on the data a header can take one of three formats:

If the value is less than 0x80 (128) it again gives a number of words minus 1, so it can encode 1 to 128 words.
If the value is less than 0xC0 (192) the lowest 6 bits give a number of indexes minus 1, so it can encode 1 to 64 indexes.
Otherwise the lowest 6 bits specify a bit mask where starting from lowest bit, each bit states if a symbol (0) or index (1) follows.

The last two cases can also be interpreted as:

If the upper two bits are 10, the lowest 6 bits give a number of indexes minus 1.
If the upper two bits are 11, the lowest 6 bits specify the bit mask.

The bit mask can be used if words and indexes change frequently. In general it is beneficial if the amount of words and the amount of indexes together is less or equal than 6. For example if you have 5 words, then 1 index and then words again, you can encode the 5 words and the index in one bit mask header. Otherwise it would need two headers.

If a sequence of words exceeds 128, the same header can be used again of course. Same with sequences of indexes which exceed an amount of 64.

The header 0xc0 is the end marker. It marks the end of the compressed data. In general it would mean that it is a bit mask header but as all bits are 0, this would mean that just 6 words follow. But this can be expressed with the first header format so this header is not needed. So never use header 0xc0 to express a word sequence. Always end the compressed data with that header.

Whenever writing an index, it is Huffman encoded. A canonical Huffman tree is used which depends on the frequency an index occurs. For example if the input data contains word 0x1234 only once, the index frequency is zero as it is never used. If another word appears 5 times, the frequency of its index would be 4, as the index only appears 4 times while the word itself appears once.

As the index is associated even if it is not used again, the frequency of 0 might be common. Those indexes are encoded in the Huffman tree as having a length of 0, as they are never used. This is why the symbol length 0 is encoded small when storing the pretree for the main Huffman tree (see Tree encoding below).

Tree encoding

The Huffman tree is first based on a simple frequency table of all word indexes. Canonical huffman codes are then generated from this table. If an index is not used at all (frequency = 0), then it is stored with a length of zero in the tree.

Only the length of the indexes are stored. First the total amount of indexes is stored as a 16 bit value. This includes all indexes up to the highest which is actually used (= repeated word).

Then the lengths of each index is stored in a bitstream. The lengths are stored in a special way to save space.

There is a static Huffman table for the lengths of the index lengths (aka the pretree).

Length	Bit representation
0	000
12	001
13	010
6	0110
7	0111
8	1000
9	1001
10	1010
11	1011
2	11000
3	11001
4	11010
5	11011
14	11100
15	111010
16	111011
17	111100
1	1111010
18	1111011
19	1111100
20	1111101
21	1111110
22	1111111

From those information a canonical huffman tree is generated in both the encoder and decoder.

The encoder writes the lengths with this Huffman tree. Non-used indexes have length 0.

For larger gaps of unused indexes (multiple lenght entries with value 0) there is a special encoding.

Whenever a length of 0 is read, the next bits are checked to determine the amount of following zero lengths.

Bit pattern	Amount
`0`	0
`100..110`	1..3
`1110000..1111111`	4..19

For a single zero length this wastes 1 bit, for 2 zero lengths in a row it is not changing the bit count (as the length zero needs 3 bits to encode). For 3 zero lengths in a row, it saves 3 bits. For 4 to 19 zero lengths it saves 5, 8, 11, ..., 50 bits respectively.

This encoding can only express up to 19 zero lengths in a row, but you can repeat this of course. So for example a sequence of 76 zero lengths can be expressed by four of such encodings saving 200 bits in total (28 bits instead of 228 bits).

Data format

Note that the whole tree data section always starts at a full byte boundary!

Offset	Length	Description
0x00	2	Number of indexes
0x02	n	Tree data

n is an arbitrary size but in total it should match the amount of bits needed to encode all the given indexes.

For every given index (0 up to the number of indexes minus 1) a length is stored with the mentioned bit encoding. From those lengths the main Huffman tree for the indexes can be reconstructed.

Note that after the tree data, it is ensured that following data will start at a full byte boundary!

Decompression

After the tree, the actual compressed data follows. It is a sequence of 8-bit headers, 16-bit symbols and encoded indexes.

The data starts with an 8-bit header. This header directly specifies how many symbols follow. The value must be increased by 1, as 0 symbols at the beginning would not make sense. So an amount of 1 to 256 symbols can be given here.

Then this amount of symbols is read as 16-bit values. Note that the values are big endian as we effectively read from a bitstream.

For each symbol, its index of occurence is tracked. So for example the first symbol will be associated to index 0, the second symbol to index 1, and so on.

After the symbols the next header is read as 8-bit. Now this and all following headers can have one of three encodings:

If the value is less than 0x80 (128) it again gives a number of symbols minus 1, so it can encode 1 to 128 symbols.
If the value is less than 0xC0 (192) the lowest 6 bits give a number of indexes minus 1, so it can encode 1 to 64 indexes.
Otherwise the lowest 6 bits specify a bit mask where starting from lowest bit, each bit states if to read a symbol (0) or index (1).

The last two cases can also be interpreted as:

If the upper two bits are 10, the lowest 6 bits give a number of indexes minus 1.
If the upper two bits are 11, the lowest 6 bits specify the bit mask.

The bit mask is useful if small amounts of symbols and indexes are mixed. It is more efficient to store them in one byte than to specify multiple headers with small amounts.

For example if the header is 0xD8 which in binary is 1101 1000, it means that there will be 3 symbols, then 2 indexes and then 1 symbol again. The header bits are 11 and the meaningful bits in order are 0 0 0 1 1 0.

Based on each header the data is read as symbols or indexes. Symbols are read as 16-bit BE values as before. Indexes are read with the main Huffman tree. The read index can be transformed to the associated symbol via a mapping table.

For example if the first symbol was 0xABCD, it is associated to index 0. If later the index 0 is read, it means that the symbol 0xABCD is meant.

Going forward more and more indexes are tracked and can be used. For text data and uncompressed binary data with many similar word values, this can lead to a good compression ratio. The fewer different indexes are used, the better the compression will be.

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- PYC.Core (>= 1.0.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
1.0.1	129	5/20/2025
1.0.0	127	5/20/2025