Web Compression

2024-08-07

I wrote this comment about pre-compressing web artifacts with zstd:

I have read somewhere (can’t find links handy) that for web server case, zstd may not be as useful as brotli due to longer decompression speed, but I may be wrong here.

That felt wrong — someone suggesting a cool change in a module, and I am just FUDing it. If I were the PR submitter, I would certainly not appreciate this comment. So I decided to conduct a non-scientific experiment: take a big piece of Javascript and compare brotli with zstd.

Executive Summary

brotli compresses my chosen piece of Javascript better than zstd by 4-22%.
zstd is faster than brotli by 50-80% (depending on platform) and uses less system resources than brotli.

As a result, I will re-phrase my comment on github and welcome zstd to the default compressors.

Benchmark Setup

Hardware:

AMD Ryzen 7 7840HS, DDR5-5600.
Raspberry Pi 4.

CPU scaling governor set to performance on both nodes:

for f in  /sys/devices/system/cpu/cpufreq/*/scaling_governor; do echo 'performance' | sudo tee $f; done

Software:

NixOS 24.05-2908-g883180e6550c.
Linux v6.6.44.
brotli 1.1.0 from the distribution.
zstd v1.5.6 from the distribution.

Test Harness

I picked Youtube’s desktop_polymer.js, because:

That file weighs 8.52MB.
YouTube is a somewhat frequently accesssed website, so that file is frequently downloaded and decompressed, making it somewhat representative, albeit anecdatal¹.

Acquiring and compressing it:

$ wget https://www.youtube.com/s/desktop/bf8c00d7/jsbin/desktop_polymer.vflset/desktop_polymer.js -O y.js
$ for prog in 'zstd -3' 'zstd -6' 'zstd -9' 'zstd -12' 'zstd -15' 'zstd -19' 'zstd --ultra -22'; do $prog y.js -o y.js.${prog##*-}.zst; done
$ brotli y.js

poop accepts a single command to run, so we have this wrapper:

#!/bin/sh
set -e
FILE=y.js
case "$0" in
    ./brotli)
        exec brotli -cd ${FILE}.br
        ;;
    ./zstd-*)
        level=${0#./zstd-}
        exec zstd -cd ${FILE}.${level}.zst
        ;;
    *)
        >&2 echo "invalid program $0"
        exit
        ;;
esac

Then symlink to it for each compression level:

$ for l in 3 6 9 19 22; do ln -s wrap zstd-${l}; done
$ ln -s wrap brotli

Compression Ratio

Filename     Bytes    % larger that br
y.js.br      1380352             0.00%
y.js.22.zst  1437696             4.15%
y.js.19.zst  1437696             4.15%
y.js.15.zst  1548288            12.17%
y.js.12.zst  1581056            14.54%
y.js.9.zst   1609728            16.62%
y.js.6.zst   1687552            22.26%
y.js.3.zst   1892352            37.09%
y.js         8519680           517.21%

As we can see, zstd -19 yielded 4% worse compression for this file than brotli. We should keep in mind that brotli has web-specific tricks, giving itself somewhat of an advantage over other compressors for this corpus.

Since zstd -19 and zstd -22 yield the same compression ratio, there is no point going ultra, I will exclude zstd -22 from the tests.

Decompression Speed

hyperfine --export-markdown $(hostname) -w 1 -N ./zstd-{6,3,9,19} ./brotli

AMD Ryzen

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./zstd-6`	8.4 ± 0.4	7.8	11.3	1.00
`./zstd-3`	8.7 ± 0.4	8.0	11.6	1.03 ± 0.07
`./zstd-9`	9.1 ± 0.6	8.3	13.6	1.08 ± 0.09
`./zstd-19`	10.8 ± 0.7	9.8	14.1	1.28 ± 0.10
`./brotli`	15.5 ± 0.8	14.4	19.4	1.83 ± 0.13

Raspberry Pi 4

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./zstd-6`	53.9 ± 0.6	52.6	55.3	1.00
`./zstd-3`	56.2 ± 2.3	54.9	68.2	1.04 ± 0.04
`./zstd-9`	57.7 ± 0.6	56.7	59.3	1.07 ± 0.02
`./zstd-19`	65.3 ± 0.6	64.3	66.8	1.21 ± 0.02
`./brotli`	82.7 ± 2.2	81.5	91.2	1.53 ± 0.04

Summary: zstd -6 is fastest, brotli is slower by 50-80%.

Resource Usage

poop ./zstd-{6,3,9,19} ./brotli

AMD Ryzen

Benchmark 1 (563 runs): ./zstd-6
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          8.83ms ± 1.02ms    7.79ms … 11.8ms         89 (16%)        0%
  peak_rss           6.31MB ± 84.0KB    6.03MB … 6.42MB          1 ( 0%)        0%
  cpu_cycles         29.1M  ±  565K     28.6M  … 37.5M          43 ( 8%)        0%
  instructions       90.6M  ± 14.9K     90.6M  … 90.7M          20 ( 4%)        0%
  cache_references   1.76M  ± 23.0K     1.72M  … 1.99M          26 ( 5%)        0%
  cache_misses        122K  ± 2.76K      116K  …  136K           3 ( 1%)        0%
  branch_misses       334K  ± 1.25K      332K  …  342K          51 ( 9%)        0%
Benchmark 2 (546 runs): ./zstd-3
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          9.11ms ± 1.01ms    8.13ms … 12.0ms         84 (15%)        💩+  3.2% ±  1.4%
  peak_rss           6.31MB ± 84.9KB    6.03MB … 6.42MB          1 ( 0%)          -  0.1% ±  0.2%
  cpu_cycles         30.3M  ±  428K     29.9M  … 34.3M          44 ( 8%)        💩+  4.2% ±  0.2%
  instructions       97.5M  ± 14.7K     97.5M  … 97.5M          24 ( 4%)        💩+  7.6% ±  0.0%
  cache_references   1.85M  ± 21.9K     1.80M  … 2.05M          29 ( 5%)        💩+  4.8% ±  0.2%
  cache_misses        125K  ± 2.57K      119K  …  142K          12 ( 2%)        💩+  2.2% ±  0.3%
  branch_misses       319K  ± 1.62K      317K  …  329K          49 ( 9%)        ⚡-  4.3% ±  0.1%
Benchmark 3 (524 runs): ./zstd-9
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          9.51ms ± 1.03ms    8.36ms … 12.7ms         82 (16%)        💩+  7.6% ±  1.4%
  peak_rss           8.40MB ± 84.2KB    8.13MB … 8.65MB          4 ( 1%)        💩+ 33.1% ±  0.2%
  cpu_cycles         29.1M  ± 1.21M     28.1M  … 33.7M          80 (15%)          -  0.1% ±  0.4%
  instructions       85.2M  ± 14.1K     85.2M  … 85.2M          20 ( 4%)        ⚡-  6.0% ±  0.0%
  cache_references   1.79M  ± 16.4K     1.76M  … 1.87M          19 ( 4%)        💩+  1.7% ±  0.1%
  cache_misses        156K  ± 2.76K      151K  …  168K          10 ( 2%)        💩+ 28.4% ±  0.3%
  branch_misses       331K  ± 1.04K      329K  …  336K          39 ( 7%)          -  0.9% ±  0.0%
Benchmark 4 (442 runs): ./zstd-19
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          11.3ms ± 1.19ms    10.00ms … 15.0ms        86 (19%)        💩+ 27.6% ±  1.5%
  peak_rss           12.5MB ± 87.9KB    12.2MB … 12.6MB          1 ( 0%)        💩+ 97.6% ±  0.2%
  cpu_cycles         31.3M  ± 1.58M     30.2M  … 39.2M          56 (13%)        💩+  7.5% ±  0.5%
  instructions       88.5M  ± 15.5K     88.4M  … 88.5M          22 ( 5%)        ⚡-  2.4% ±  0.0%
  cache_references   1.81M  ± 18.2K     1.77M  … 1.94M          20 ( 5%)        💩+  2.6% ±  0.1%
  cache_misses        192K  ± 2.68K      186K  …  200K           3 ( 1%)        💩+ 57.3% ±  0.3%
  branch_misses       346K  ± 1.06K      344K  …  352K          23 ( 5%)        💩+  3.6% ±  0.0%
Benchmark 5 (316 runs): ./brotli
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          15.8ms ± 1.27ms    14.6ms … 20.5ms         72 (23%)        💩+ 78.7% ±  1.7%
  peak_rss           12.0MB ±  102KB    11.7MB … 12.2MB          2 ( 1%)        💩+ 90.3% ±  0.2%
  cpu_cycles         52.9M  ± 1.53M     51.8M  … 71.4M          12 ( 4%)        💩+ 81.7% ±  0.5%
  instructions        101M  ± 14.4K      101M  …  101M           8 ( 3%)        💩+ 11.8% ±  0.0%
  cache_references   1.96M  ±  155K     1.91M  … 3.45M          11 ( 3%)        💩+ 11.3% ±  0.7%
  cache_misses        165K  ± 1.60K      161K  …  172K           1 ( 0%)        💩+ 35.5% ±  0.3%
  branch_misses       898K  ±  905       896K  …  903K           9 ( 3%)        💩+169.1% ±  0.0%

Raspberry Pi 4

Benchmark 1 (91 runs): ./zstd-6
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          54.8ms ± 1.71ms    53.0ms … 63.7ms          6 ( 7%)        0%
  peak_rss           5.69MB ± 69.8KB    5.51MB … 5.77MB          0 ( 0%)        0%
  cpu_cycles         65.7M  ± 2.23M     63.4M  … 77.1M          12 (13%)        0%
  instructions       82.2M  ±  888      82.2M  … 82.2M           4 ( 4%)        0%
  cache_references   29.2M  ± 14.1K     29.2M  … 29.3M           2 ( 2%)        0%
  cache_misses        666K  ±  116K      553K  … 1.02M          11 (12%)        0%
  branch_misses       344K  ± 1.50K      341K  …  349K           1 ( 1%)        0%
Benchmark 2 (89 runs): ./zstd-3
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          56.2ms ±  871us    55.0ms … 60.0ms          3 ( 3%)        💩+  2.6% ±  0.7%
  peak_rss           5.68MB ± 67.3KB    5.51MB … 5.77MB          0 ( 0%)          -  0.2% ±  0.4%
  cpu_cycles         68.1M  ± 1.14M     66.5M  … 73.7M           4 ( 4%)        💩+  3.6% ±  0.8%
  instructions       88.5M  ±  436      88.5M  … 88.5M           2 ( 2%)        💩+  7.7% ±  0.0%
  cache_references   31.4M  ± 10.8K     31.4M  … 31.4M           6 ( 7%)        💩+  7.4% ±  0.0%
  cache_misses        676K  ±  100K      577K  … 1.06M           5 ( 6%)          +  1.5% ±  4.8%
  branch_misses       326K  ± 1.36K      322K  …  328K           0 ( 0%)        ⚡-  5.3% ±  0.1%
Benchmark 3 (85 runs): ./zstd-9
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          58.6ms ± 2.36ms    56.8ms … 70.5ms          5 ( 6%)        💩+  7.0% ±  1.1%
  peak_rss           7.77MB ± 72.7KB    7.60MB … 8.00MB          0 ( 0%)        💩+ 36.6% ±  0.4%
  cpu_cycles         67.7M  ± 2.64M     65.8M  … 81.7M           7 ( 8%)        💩+  3.1% ±  1.1%
  instructions       77.4M  ±  923      77.4M  … 77.4M           6 ( 7%)        ⚡-  5.8% ±  0.0%
  cache_references   27.7M  ± 11.9K     27.7M  … 27.7M           8 ( 9%)        ⚡-  5.4% ±  0.0%
  cache_misses        661K  ± 86.0K      563K  …  958K           6 ( 7%)          -  0.8% ±  4.6%
  branch_misses       341K  ± 1.23K      338K  …  344K           0 ( 0%)          -  0.8% ±  0.1%
Benchmark 4 (76 runs): ./zstd-19
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          66.0ms ±  811us    64.7ms … 68.9ms          4 ( 5%)        💩+ 20.5% ±  0.8%
  peak_rss           11.9MB ± 49.9KB    11.8MB … 12.1MB         11 (14%)        💩+109.8% ±  0.3%
  cpu_cycles         71.4M  ±  981K     70.0M  … 75.1M           7 ( 9%)        💩+  8.7% ±  0.8%
  instructions       80.1M  ±  413      80.1M  … 80.1M           0 ( 0%)        ⚡-  2.5% ±  0.0%
  cache_references   28.6M  ± 11.8K     28.6M  … 28.6M          12 (16%)        ⚡-  2.3% ±  0.0%
  cache_misses        670K  ± 96.4K      560K  …  990K           7 ( 9%)          +  0.6% ±  4.9%
  branch_misses       355K  ±  726       352K  …  356K           2 ( 3%)        💩+  3.1% ±  0.1%
Benchmark 5 (61 runs): ./brotli
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          82.7ms ± 1.69ms    81.1ms … 91.8ms          2 ( 3%)        💩+ 50.9% ±  1.0%
  peak_rss           11.4MB ±    0      11.4MB … 11.4MB          0 ( 0%)        💩+100.5% ±  0.3%
  cpu_cycles         94.1M  ± 2.12M     92.1M  …  105M           2 ( 3%)        💩+ 43.2% ±  1.1%
  instructions       98.8M  ± 11.2      98.8M  … 98.8M          10 (16%)        💩+ 20.2% ±  0.0%
  cache_references   42.9M  ± 46.3K     42.8M  … 43.0M           1 ( 2%)        💩+ 46.6% ±  0.0%
  cache_misses        933K  ± 53.1K      874K  … 1.06M           0 ( 0%)        💩+ 40.0% ±  4.7%
  branch_misses       999K  ±  931       997K  … 1.00M           1 ( 2%)        💩+190.5% ±  0.1%

Summary: brotli decompression resource use, compared to zstd -6, is shit.

anecdatal is a creative variation of anecdata. ↩︎