language agnostic - Should I provide consistency checks in the Huffman tree building algorithm for DEFLATE? -


RFC-511 is a simple algorithm that restores the Huffman tree from the list of code length, which is described in the following manner:

  1) Calculate the number of codes for each code length Bl_count [N] length N, N & gt; = 1 Enter the number of codes. 2) Find the numerical value of the minimum code for each code length: code = 0; Bl_count [0] = 0; (Bits = 1; bits & lt; = MAX_BITS; bits ++) {code = (code + bl_count [bits -1]) & lt; & Lt; 1; Next_code [bits] = code; } 3) Assign numeric values ​​for all the codes, use constant values ​​for all the codes of the same length with the base values ​​set in step 2. The code that is never to be used (which is a bit of zero length) should not be specified as a value. {Len = tree [n] for (N = 0; n & lt; = max_code; n ++) Lane; If (lane! = 0) {tree [n]. Puzzle = napolax [lane]; Next_code [lane] ++; }  

But there is no data compatibility check in the algorithm. On the other hand, it is clear that the length list may be invalid, the length values ​​can not be invalid due to the encoding in 4 bits, but, for example, some codes may have more codes than encoded for the length.

What are the minimum set of checks? Or is there no need for such a check which I miss?

zlib checks that the length of the code is both full, that is, it uses all bit patterns, And does not overflow this bit pattern. An approved exception occurs when there is a single symbol with length 1, in which case the code is allowed to be incomplete (bit 0 means symbol, 1 bit unspecified is).

Disallow random, corrupt, or improperly coded data with this high probability in zlib stream and earlier. This is another type of force which was suggested in another answer, where you can optionally allow an incomplete code and only return an error when an undefined code appears in compressed data.

To calculate perfection, you start with the number of bits in the code k = 1 , and the number of possible code n = 2 . There are two possible one-bit codes that you decrease the number of the code n length 1, n - = a [k] . Then you increase the k to see the two codes, and you repeat the n , reduce the number of two-bit code when you do, then n < / Code> should be zero. If at any point, n becomes negative, you can pause right there because you have an invalid set of code length if you are doing, then n If you have more than zero, you have an incomplete code.


Comments