Web Compression
I wrote this comment about pre-compressing web artifacts with zstd:
I have read somewhere (can’t find links handy) that for web server case, zstd may not be as useful as brotli due to longer decompression speed, but I may be wrong here.
That felt wrong — someone suggesting a cool change in a module, and I am just FUDing it. If I were the PR submitter, I would certainly not appreciate this comment. So I decided to conduct a non-scientific experiment: take a big piece of Javascript and compare brotli with zstd.
Executive Summary
- brotli compresses my chosen piece of Javascript better than zstd by 4-22%.
- zstd is faster than brotli by 50-80% (depending on platform) and uses less system resources than brotli.
As a result, I will re-phrase my comment on github and welcome zstd
to the
default compressors.
Benchmark Setup
Hardware:
- AMD Ryzen 7 7840HS, DDR5-5600.
- Raspberry Pi 4.
CPU scaling governor set to performance
on both nodes:
for f in /sys/devices/system/cpu/cpufreq/*/scaling_governor; do echo 'performance' | sudo tee $f; done
Software:
- NixOS 24.05-2908-g883180e6550c.
- Linux v6.6.44.
- brotli 1.1.0 from the distribution.
- zstd v1.5.6 from the distribution.
Test Harness
I picked Youtube’s desktop_polymer.js
, because:
- That file weighs 8.52MB.
- YouTube is a somewhat frequently accesssed website, so that file is frequently downloaded and decompressed, making it somewhat representative, albeit anecdatal1.
Acquiring and compressing it:
$ wget https://www.youtube.com/s/desktop/bf8c00d7/jsbin/desktop_polymer.vflset/desktop_polymer.js -O y.js
$ for prog in 'zstd -3' 'zstd -6' 'zstd -9' 'zstd -12' 'zstd -15' 'zstd -19' 'zstd --ultra -22'; do $prog y.js -o y.js.${prog##*-}.zst; done
$ brotli y.js
poop accepts a single command to run, so we have this wrapper:
#!/bin/sh
set -e
FILE=y.js
case "$0" in
./brotli)
exec brotli -cd ${FILE}.br
;;
./zstd-*)
level=${0#./zstd-}
exec zstd -cd ${FILE}.${level}.zst
;;
*)
>&2 echo "invalid program $0"
exit
;;
esac
Then symlink to it for each compression level:
$ for l in 3 6 9 19 22; do ln -s wrap zstd-${l}; done
$ ln -s wrap brotli
Compression Ratio
Filename Bytes % larger that br
y.js.br 1380352 0.00%
y.js.22.zst 1437696 4.15%
y.js.19.zst 1437696 4.15%
y.js.15.zst 1548288 12.17%
y.js.12.zst 1581056 14.54%
y.js.9.zst 1609728 16.62%
y.js.6.zst 1687552 22.26%
y.js.3.zst 1892352 37.09%
y.js 8519680 517.21%
As we can see, zstd -19
yielded 4% worse compression for this file than
brotli. We should keep in mind that brotli has web-specific tricks, giving
itself somewhat of an advantage over other compressors for this corpus.
Since zstd -19
and zstd -22
yield the same compression ratio, there is no
point going ultra, I will exclude zstd -22
from the tests.
Decompression Speed
hyperfine --export-markdown $(hostname) -w 1 -N ./zstd-{6,3,9,19} ./brotli
AMD Ryzen
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./zstd-6 |
8.4 ± 0.4 | 7.8 | 11.3 | 1.00 |
./zstd-3 |
8.7 ± 0.4 | 8.0 | 11.6 | 1.03 ± 0.07 |
./zstd-9 |
9.1 ± 0.6 | 8.3 | 13.6 | 1.08 ± 0.09 |
./zstd-19 |
10.8 ± 0.7 | 9.8 | 14.1 | 1.28 ± 0.10 |
./brotli |
15.5 ± 0.8 | 14.4 | 19.4 | 1.83 ± 0.13 |
Raspberry Pi 4
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./zstd-6 |
53.9 ± 0.6 | 52.6 | 55.3 | 1.00 |
./zstd-3 |
56.2 ± 2.3 | 54.9 | 68.2 | 1.04 ± 0.04 |
./zstd-9 |
57.7 ± 0.6 | 56.7 | 59.3 | 1.07 ± 0.02 |
./zstd-19 |
65.3 ± 0.6 | 64.3 | 66.8 | 1.21 ± 0.02 |
./brotli |
82.7 ± 2.2 | 81.5 | 91.2 | 1.53 ± 0.04 |
Summary: zstd -6
is fastest, brotli is slower by 50-80%.
Resource Usage
poop ./zstd-{6,3,9,19} ./brotli
AMD Ryzen
Benchmark 1 (563 runs): ./zstd-6
measurement mean ± σ min … max outliers delta
wall_time 8.83ms ± 1.02ms 7.79ms … 11.8ms 89 (16%) 0%
peak_rss 6.31MB ± 84.0KB 6.03MB … 6.42MB 1 ( 0%) 0%
cpu_cycles 29.1M ± 565K 28.6M … 37.5M 43 ( 8%) 0%
instructions 90.6M ± 14.9K 90.6M … 90.7M 20 ( 4%) 0%
cache_references 1.76M ± 23.0K 1.72M … 1.99M 26 ( 5%) 0%
cache_misses 122K ± 2.76K 116K … 136K 3 ( 1%) 0%
branch_misses 334K ± 1.25K 332K … 342K 51 ( 9%) 0%
Benchmark 2 (546 runs): ./zstd-3
measurement mean ± σ min … max outliers delta
wall_time 9.11ms ± 1.01ms 8.13ms … 12.0ms 84 (15%) 💩+ 3.2% ± 1.4%
peak_rss 6.31MB ± 84.9KB 6.03MB … 6.42MB 1 ( 0%) - 0.1% ± 0.2%
cpu_cycles 30.3M ± 428K 29.9M … 34.3M 44 ( 8%) 💩+ 4.2% ± 0.2%
instructions 97.5M ± 14.7K 97.5M … 97.5M 24 ( 4%) 💩+ 7.6% ± 0.0%
cache_references 1.85M ± 21.9K 1.80M … 2.05M 29 ( 5%) 💩+ 4.8% ± 0.2%
cache_misses 125K ± 2.57K 119K … 142K 12 ( 2%) 💩+ 2.2% ± 0.3%
branch_misses 319K ± 1.62K 317K … 329K 49 ( 9%) ⚡- 4.3% ± 0.1%
Benchmark 3 (524 runs): ./zstd-9
measurement mean ± σ min … max outliers delta
wall_time 9.51ms ± 1.03ms 8.36ms … 12.7ms 82 (16%) 💩+ 7.6% ± 1.4%
peak_rss 8.40MB ± 84.2KB 8.13MB … 8.65MB 4 ( 1%) 💩+ 33.1% ± 0.2%
cpu_cycles 29.1M ± 1.21M 28.1M … 33.7M 80 (15%) - 0.1% ± 0.4%
instructions 85.2M ± 14.1K 85.2M … 85.2M 20 ( 4%) ⚡- 6.0% ± 0.0%
cache_references 1.79M ± 16.4K 1.76M … 1.87M 19 ( 4%) 💩+ 1.7% ± 0.1%
cache_misses 156K ± 2.76K 151K … 168K 10 ( 2%) 💩+ 28.4% ± 0.3%
branch_misses 331K ± 1.04K 329K … 336K 39 ( 7%) - 0.9% ± 0.0%
Benchmark 4 (442 runs): ./zstd-19
measurement mean ± σ min … max outliers delta
wall_time 11.3ms ± 1.19ms 10.00ms … 15.0ms 86 (19%) 💩+ 27.6% ± 1.5%
peak_rss 12.5MB ± 87.9KB 12.2MB … 12.6MB 1 ( 0%) 💩+ 97.6% ± 0.2%
cpu_cycles 31.3M ± 1.58M 30.2M … 39.2M 56 (13%) 💩+ 7.5% ± 0.5%
instructions 88.5M ± 15.5K 88.4M … 88.5M 22 ( 5%) ⚡- 2.4% ± 0.0%
cache_references 1.81M ± 18.2K 1.77M … 1.94M 20 ( 5%) 💩+ 2.6% ± 0.1%
cache_misses 192K ± 2.68K 186K … 200K 3 ( 1%) 💩+ 57.3% ± 0.3%
branch_misses 346K ± 1.06K 344K … 352K 23 ( 5%) 💩+ 3.6% ± 0.0%
Benchmark 5 (316 runs): ./brotli
measurement mean ± σ min … max outliers delta
wall_time 15.8ms ± 1.27ms 14.6ms … 20.5ms 72 (23%) 💩+ 78.7% ± 1.7%
peak_rss 12.0MB ± 102KB 11.7MB … 12.2MB 2 ( 1%) 💩+ 90.3% ± 0.2%
cpu_cycles 52.9M ± 1.53M 51.8M … 71.4M 12 ( 4%) 💩+ 81.7% ± 0.5%
instructions 101M ± 14.4K 101M … 101M 8 ( 3%) 💩+ 11.8% ± 0.0%
cache_references 1.96M ± 155K 1.91M … 3.45M 11 ( 3%) 💩+ 11.3% ± 0.7%
cache_misses 165K ± 1.60K 161K … 172K 1 ( 0%) 💩+ 35.5% ± 0.3%
branch_misses 898K ± 905 896K … 903K 9 ( 3%) 💩+169.1% ± 0.0%
Raspberry Pi 4
Benchmark 1 (91 runs): ./zstd-6
measurement mean ± σ min … max outliers delta
wall_time 54.8ms ± 1.71ms 53.0ms … 63.7ms 6 ( 7%) 0%
peak_rss 5.69MB ± 69.8KB 5.51MB … 5.77MB 0 ( 0%) 0%
cpu_cycles 65.7M ± 2.23M 63.4M … 77.1M 12 (13%) 0%
instructions 82.2M ± 888 82.2M … 82.2M 4 ( 4%) 0%
cache_references 29.2M ± 14.1K 29.2M … 29.3M 2 ( 2%) 0%
cache_misses 666K ± 116K 553K … 1.02M 11 (12%) 0%
branch_misses 344K ± 1.50K 341K … 349K 1 ( 1%) 0%
Benchmark 2 (89 runs): ./zstd-3
measurement mean ± σ min … max outliers delta
wall_time 56.2ms ± 871us 55.0ms … 60.0ms 3 ( 3%) 💩+ 2.6% ± 0.7%
peak_rss 5.68MB ± 67.3KB 5.51MB … 5.77MB 0 ( 0%) - 0.2% ± 0.4%
cpu_cycles 68.1M ± 1.14M 66.5M … 73.7M 4 ( 4%) 💩+ 3.6% ± 0.8%
instructions 88.5M ± 436 88.5M … 88.5M 2 ( 2%) 💩+ 7.7% ± 0.0%
cache_references 31.4M ± 10.8K 31.4M … 31.4M 6 ( 7%) 💩+ 7.4% ± 0.0%
cache_misses 676K ± 100K 577K … 1.06M 5 ( 6%) + 1.5% ± 4.8%
branch_misses 326K ± 1.36K 322K … 328K 0 ( 0%) ⚡- 5.3% ± 0.1%
Benchmark 3 (85 runs): ./zstd-9
measurement mean ± σ min … max outliers delta
wall_time 58.6ms ± 2.36ms 56.8ms … 70.5ms 5 ( 6%) 💩+ 7.0% ± 1.1%
peak_rss 7.77MB ± 72.7KB 7.60MB … 8.00MB 0 ( 0%) 💩+ 36.6% ± 0.4%
cpu_cycles 67.7M ± 2.64M 65.8M … 81.7M 7 ( 8%) 💩+ 3.1% ± 1.1%
instructions 77.4M ± 923 77.4M … 77.4M 6 ( 7%) ⚡- 5.8% ± 0.0%
cache_references 27.7M ± 11.9K 27.7M … 27.7M 8 ( 9%) ⚡- 5.4% ± 0.0%
cache_misses 661K ± 86.0K 563K … 958K 6 ( 7%) - 0.8% ± 4.6%
branch_misses 341K ± 1.23K 338K … 344K 0 ( 0%) - 0.8% ± 0.1%
Benchmark 4 (76 runs): ./zstd-19
measurement mean ± σ min … max outliers delta
wall_time 66.0ms ± 811us 64.7ms … 68.9ms 4 ( 5%) 💩+ 20.5% ± 0.8%
peak_rss 11.9MB ± 49.9KB 11.8MB … 12.1MB 11 (14%) 💩+109.8% ± 0.3%
cpu_cycles 71.4M ± 981K 70.0M … 75.1M 7 ( 9%) 💩+ 8.7% ± 0.8%
instructions 80.1M ± 413 80.1M … 80.1M 0 ( 0%) ⚡- 2.5% ± 0.0%
cache_references 28.6M ± 11.8K 28.6M … 28.6M 12 (16%) ⚡- 2.3% ± 0.0%
cache_misses 670K ± 96.4K 560K … 990K 7 ( 9%) + 0.6% ± 4.9%
branch_misses 355K ± 726 352K … 356K 2 ( 3%) 💩+ 3.1% ± 0.1%
Benchmark 5 (61 runs): ./brotli
measurement mean ± σ min … max outliers delta
wall_time 82.7ms ± 1.69ms 81.1ms … 91.8ms 2 ( 3%) 💩+ 50.9% ± 1.0%
peak_rss 11.4MB ± 0 11.4MB … 11.4MB 0 ( 0%) 💩+100.5% ± 0.3%
cpu_cycles 94.1M ± 2.12M 92.1M … 105M 2 ( 3%) 💩+ 43.2% ± 1.1%
instructions 98.8M ± 11.2 98.8M … 98.8M 10 (16%) 💩+ 20.2% ± 0.0%
cache_references 42.9M ± 46.3K 42.8M … 43.0M 1 ( 2%) 💩+ 46.6% ± 0.0%
cache_misses 933K ± 53.1K 874K … 1.06M 0 ( 0%) 💩+ 40.0% ± 4.7%
branch_misses 999K ± 931 997K … 1.00M 1 ( 2%) 💩+190.5% ± 0.1%
Summary: brotli
decompression resource use, compared to zstd -6
, is shit.