Stateless ZIP library - SLZ
Branch | name | Last version | Released | Links | Notes |
Development | master | 1.2.1 | 2022-10-23 | git / web / github | may be broken |
1.2 | 1.2-stable | 1.2.1 | 2022-10-23 | git / web / github | Stable version |
Introduction
SLZ is a fast and
memory-less stream compressor which produces an output that
can be
decompressed with zlib or gzip. It
does not implement decompression at
all, zlib is perfectly fine for this.
The purpose is to use SLZ in situations where a
zlib-compatible stream is
needed and zlib's resource usage would be too high while the compression ratio
is not critical. The typical use case is in
HTTP servers and gateways which
have to compress many streams in parallel with little CPU resources to assign
to this task, and without having to thottle the compression ratio due to the
memory usage. In such an environment, the server's
memory usage can easily be divided by
10 and the CPU usage by 3. In addition its high performance made it fill a gap in
network backup applications.
How does it work ?
While
zlib uses 256 kB of memory per stream in addition to a few tens of bytes
for the stream descriptor itself,
SLZ only stores a stream descriptor made of
28 bytes. Thus it is particularly suited to environments having to deal with
tens to hundreds of thousands of concurrent streams.
The key difference between zlib and SLZ is that
SLZ is stateless in that it
doesn't consider the data previously compressed as part of its dictionary. It
doesn't hurt compression performance when it is fed in large enough chunks (at
least a few kB at once), but it
must not be used if it has to read less than
a few kB at a time, or the caller will have to implement the
input buffer to
maximize the compression ratio. For example, an HTTP gateway may defragment all
the parts from a chunked-encoded response buffer into a linear buffer that is
then fed at once to SLZ. This temporary buffer would then be static and shared
between all streams running on the same thread.
A side effect from its stateless nature is that it can emit a zlib-compatible
stream
without being vulnerable to CRIME-like attacks by processing the
sensitive and attacker-controlled data in distinct batches. Since
no dictionary
persists between calls, there is no risk that attacker-controlled data is used
to guess sensitive data using dictionary lookups.
Another induced benefit which was not initially seeked is that
SLZ is on
average 3 times as fast as zlib in its fastest mode. This is the result of the
simpler matching and encoding mechanisms involved which permit even more
optimisations.
Reference tests
Some performance tests were run using version 1.0 on various types of data showing anything between
227 MB/s and 753 MB/s per core on the test laptop depending on the difficulty
to find matching strings and on the amount of data emitted as literals (hence not
compressed). In the end, comparative tests were performed using some
reference
datasets that are already being used for other libraries. This performance
report was produced using the
Silesia Corpus to get something comparable to
what is reported by the LZ4 utility here :
https://github.com/Cyan4973/lz4.
New tests would be deserved on more recent versions since performance on x86_64
and armv8 was further improved.
The tests were run on a single core of a core i5-3320M at 3.3 GHz (turbo mode).
Product | Input BW | Compression ratio | Notes |
lz4-r124 | 352 MB/s | 2.09 | |
lzo-2.09 | 347 MB/s | 2.10 | |
slz-142 | 266 MB/s | 2.12 | deflate format |
slz-142 | 214 MB/s | 2.12 | gzip format |
zlib-1.28 -1 -h | 88 MB/s | 1.64 | huffman only |
zlib-1.28 -1 -r | 81 MB/s | 1.65 | RLE only |
zlib-1.28 -1 -F | 70 MB/s | 2.32 | fixed huffman |
zlib-1.28 -1 | 70 MB/s | 2.74 | |
SLZ was included in Inikep's
lzbench benchmark
suite, which compares a wide range of compression tools. The project maintains an updated
performance table where other solutions can be compared.
Only zlib and SLZ can emit a stream that a regular zlib-compatible browser can
decompress (RFC1950/1951/1952 aka zlib, deflate, gzip). Here
SLZ is 3 times
faster than zlib for a less compressed payload (29% larger than zlib). Tests
run on other file types (including HTML, CSS, JS) show that the ratio of
compression speed between SLZ and zlib varies from 2.5 to 3.5. The output
compressed stream is often 24 to 32% larger for SLZ than zlib.
Since SLZ focuses a lot on performance savings gained by
keeping most of the
workload in CPU cache, it
scales very well when multiple cores are used in
parallel (low memory bandwidth) as shown in this tests run on a quad-core ARM
Cortex A17 (RK3288) running at 1.8 GHz and the same input data :
#Cores | Input size | Time(s) | BW/core(MB/s) | Total BW(MB/s) |
1 | 211938580 | 3.575 | 59.3 | 59.3 |
2 | 211938580 | 3.669 | 57.8 | 115.5 |
3 | 211938580 | 3.697 | 57.3 | 172.0 |
4 | 211938580 | 3.748 | 56.5 | 226.2 |
Increasing compression ratio in HTTP gateways
Since some products
automatically throttle the compression level based on the
CPU usage or the available memory, on high speed networks, zlib can end up
showing a much lower average compression ratio in order to fill an uplink. With
SLZ,
no extra memory usage is needed for compression, and the
CPU usage is
reduced by 2.5-3.5 times, meaning that the
throttling will happen at much higher data
rates. The net effect is a
significant improvement on the compression ratio
compared to zlib at a given resource usage.
HAProxy running on a single core of a core i5-3320M, with 500 concurrent users,
the default 16kB buffers, and accessing
www.haproxy.org (76 kB) shows the
following measures with traffic leaving the machine via a gigabit Ethernet link
(hence the 976 Mbps of IP traffic) after being compressed with zlib and SLZ,
respectively :
At 100% CPU (no CPU usage limit) :
Library | BW In | BW Out | BW Saved | Ratio | memory VSZ/RSS | hits/s |
zlib | 480 Mbps | 188 Mbps | 292 Mbps | 2.55 | 130M / 101M | 744 |
slz | 1700 Mbps | 810 Mbps | 890 Mbps | 2.10 | 23.7M / 9.7M | 2382 |
At 85% CPU (limited) :
Library | BW In | BW Out | BW Saved | Ratio | memory VSZ/RSS | hits/s |
zlib | 1240 Mbps | 976 Mbps | 264 Mbps | 1.27 | 130M / 100M | 1738 |
slz | 1600 Mbps | 976 Mbps | 624 Mbps | 1.64 | 23.7M / 9.7M | 2210 |
In the CPU-limited test, once CPU usage limit was reached,
blocks were sent
uncompressed. This explains how the gigabit link was always saturated, and
it
is then the input bandwidth which varies according to the average compression
ratio. In all situations,
SLZ resulted in bandwidth savings 3 times higher than
zlib.
No magics inside!
There are 6 key points having a large impact on compression speed in any LZ-
based compressor :
dictionary lookup : looking up a sequence of bytes in a recent window is not
necessarily fast. SLZ uses a hash-based technique similar to LZO or LZ4,
focusing on keeping track of only the most recent occurrence of a given
sequence. The hash table contains the absolute position in the stream of the
last occurrence of the hashed entry. This table is initialized to impossible
values prior to compressing so that unused entries are not even evaluated, as
it was found to be faster than not initializing them. During tests, it was
found that hashing 4 bytes to one value among 4096-65536 resulted in the best
compression ratio, and that 8192 entries was the best tradeoff between speed
and compression ratio. Smaller hash tables cause more collisions hurting the
compression ratio, and larger ones experience a significant slowdown on
certain datasets due to the hashtable having more difficulties to fit
entirely in the CPU cache.
indexing : if sequences of less than 3 bytes are considered for a lookup,
since we only keep the last occurrence, there's a high risk of not keeping
the most useful sequence which has large chances of matching a similar word.
For example in HTML a lot of occurrences of "><", "</", "/>" can be found so
the probability that the one from "</a>" is picked instead of the one from
"</td>" will make it hard to find long matches. Conversely, sequences larger
than 4 bytes will only match once in a while and will prevent shorter series
of characters from being matched. Experimentation has shown that hashing 3
bytes only provided the same compression ratio as hashing 4 bytes, with up to
10% faster input processing thanks to the reduced cost of hashing. Attempts
to index the repeated data were made, and caused a significant performance
drop (about 5%) for less than 1% compression ratio improvement, so they were
abandonned.
longest match : the longest match measurement is very similar to an strcmp()
call except that we count similar characters. In order to avoid this slow
operation most of the time, we store in the hash table a copy of the 4 bytes
that were hashed. This makes it easy to perform a comparison and to skip
those which do not match, then to only start the longest match measurement
for parts which already match on the first 4 bytes. With a 13-bits hash on
32-bit character sequence, there's a theorical 1 chance over 524288 that a
matching hash implies a matching sequence. But in practice on selected text
this chance is significantly higher since not all 32 bits are used, it's more
like 24 bits (6 bits per char), so it's one chance over 2048 that an entry
will match. That's still 2047/2048 useless calls to the memmatch function
that are saved this way.
encoding the output : deflate is often criticized for its bit-oriented stream
which is expensive to manipulate as it requires bit shifts. Moreover, on top
of this the deflate stream supports several encoding modes, the most common
one being dynamic huffman trees, which requires some significant maintenance
to keep a valid tree up to date and which SLZ doesn't use. Instead, SLZ uses
static huffman trees, and when too many binary entries are found to keep this
encoding efficient, it switches to the plain literal mode. Proceeding like
this comes with a significant benefit : all code sequences, all distances and
all lengths are stored as their respective huffman output bits. This avoids
an encoding phase, sequences of bits are immediately sent to the destination
buffer without any further operation, and remain reasonably cheap compared to
the more common usages.
checksumming : 3 checksum methods exist depending on the encoding format.
RFC1950 (zlib format) uses an Adler-32 checksum which theorically involves
modulus (divide) but which can be optimized as it's done with a number very
close to a power of two. RFC1951 (raw deflate) doesn't involve any checksum.
RFC1952 (gzip) uses a CRC-32 which is somewhat expensive but can be optimized
in various ways. Since SLZ was designed around gzip mainly, most of the
optimizations were done to optimize CRC-32. Some well-known methods involve
pre-computed tables, which can be further optimized for multi-byte processing
and single-byte processing as well. It was found that updating the CRC-32 on
a per-character basis before looking up the hash came at almost no cost
because it would probably compensate for some pipeline stalls when the
hashed entry was not in the CPU's L1 cache. But in order to avoid multiplying
the functions, the checksum computation was moved out of the function with a
very low extra cost (less than 1%) as it can process all the input at once 4
bytes at a time, contributes to preloading the input data into the caches.
implementation-specific optimizations : modern CPUs can load unaligned words
from memory with minimal to no overhead. On architectures that support this
(ix86, x86_64, armv7), this is used to limit the number of operations and
memory accesses. Some CPUs have a smaller cache and the direct mapping
between distances and huffman sequences can make it thrash a lot. Some
experimentations were made using a direct mapping only for shortest distances
(the most common ones), but results were not encouraging for now as a cache
miss is not completely offset by the amount of extra operations.
These two factors have a significant impact on the compression ratio :
collisions in the dictionary, caused by suboptimal hash distribution. Here
with 8k entries we have what looks like the best tradeoff between speed and
compression ratio for most workloads tested, and it can easily be changed
using a macro.
output encoding : the dynamic huffman trees would definitely shorten the
output stream but at an important performance cost. But even with fixed trees
it is possible to waste a lot of space. First, input bytes 144 to 255 are
encoded on 9 bits per byte. SLZ constantly monitors the output stream to
decide when it's preferable to switch to plain literals which cost exactly
52 bits to switch back and forth and will not inflate the output stream any
further. Second, referencing a match can be longer than the data it replaces.
This is true for short sequences, but it can be true even for a long sequence
in the middle of a series of plain literals. By monitoring the output stream
and verifying the encoding cost before switching, it is possible to avoid any
overhead beyond the base 5 bytes per 32kB plus headers for non-compressible
data.
Get it!
SLZ is provided as a library with a few extra tools (eg:
zenc, a compressor
emitting the various formats). It is distributed under the
X11 license, meaning that
you can do a lot of things with it, such as merge it into
GPL or BSD-licensed programs,
as well as use it in proprietary software.
Contributions are welcome and should be
sent as
Git patches (see
"git format-patch") and will be made under
the same license exclusively.
Implementations
HAProxy can use SLZ since version 1.6.0 to compress HTTP responses.
External links