introduction

This post compares standard compression methods that are availble for TAR. By standard I mean those that are described in tar’s man pages :

       -j, --bzip2
              Filter the archive through bzip2(1).

       -J, --xz
              Filter the archive through xz(1).

       -z, --gzip, --gunzip, --ungzip
              Filter the archive through gzip(1).

       -Z, --compress, --uncompress
              Filter the archive through compress(1).

       --zstd Filter the archive through zstd(1).

preparation

Test were done on low-end laptop.

hardware

  • CPU : i7-8550U CPU @ 1.80GHz (cores : 4 / threats : 8, max turbo frequenxy 4.0 GHz )
  • disk : ssd sata disk,

operating system

xubuntu 24.04 LTS

$ tar --version
tar (GNU tar) 1.35
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by John Gilmore and Jay Fenlason.

I did not have installed compress tool, so I install with command :

sudo apt install ncompress

data sets

I used two data sets of ascii files described bellow :

name     size     number of files     number of lines
data_set01 751M 394 19588813
data_set02 1.5G 61 38079940

test script

I use bellow test script to create archives compressed with different commpression methods on 2 datasets I prepared. The sync synchronize cached writes to persistent storage. The sleep 120 command allow to ‘cool down’ to restore idle state of my hardware (laptop).

#!/bin/bash

DATASET="data_set01"
time tar czf ${DATASET}.tar.gz  ${DATASET} ; sync; sleep 120; sync
time tar cjf ${DATASET}.tar.bz2 ${DATASET} ; sync; sleep 120; sync
time tar cJf ${DATASET}.tar.xz  ${DATASET} ; sync; sleep 120; sync
time tar cZf ${DATASET}.tar.Z   ${DATASET} ; sync; sleep 120; sync
time tar -c --zstd -f ${DATASET}.tar.zst ${DATASET} ; sync; sleep 120; sync
time tar cf ${DATASET}.tar      ${DATASET} ; sync; sleep 120; sync

DATASET="data_set02"
time tar czf ${DATASET}.tar.gz  ${DATASET} ; sync; sleep 120; sync
time tar cjf ${DATASET}.tar.bz2 ${DATASET} ; sync; sleep 120; sync
time tar cJf ${DATASET}.tar.xz  ${DATASET} ; sync; sleep 120; sync
time tar cZf ${DATASET}.tar.Z   ${DATASET} ; sync; sleep 120; sync
time tar -c --zstd -f ${DATASET}.tar.zst ${DATASET} ; sync; sleep 120; sync
time tar cf ${DATASET}.tar      ${DATASET} ; sync; sleep 120; sync

execution of the test

The test script was run once what took about 24 minutes.

results

The combined results, for both datasets, are shown in the table below.

compression type size [B]     time [s]     compression rate     execution ratio
xz 157250668 1171.74 14.89 451
bz2 234273588 175.43 9.99 67
gz 270529843 53.55 8.65 20
zst 290103115 8.13 8.07 3
Z 545122420 28.13 4.29 10
tar 2341601280 2.59 1.00 1

‘compression rate’ is defined as uncompressed data size to the compressed data size, indicating how effectively data has been reduced

’execution ratio’ is defined as ratio of execution time to time it took to create archive without compression (tar)

summary

I plan more work in this area that will include commpression tools that use parallel processing.

I think that will be more usefull and show a bigger picture.