TAR, compression comparation
introduction
This post compares standard compression methods that are availble for TAR. By standard I mean those that are described in tar’s man pages :
-j, --bzip2
Filter the archive through bzip2(1).
-J, --xz
Filter the archive through xz(1).
-z, --gzip, --gunzip, --ungzip
Filter the archive through gzip(1).
-Z, --compress, --uncompress
Filter the archive through compress(1).
--zstd Filter the archive through zstd(1).
preparation
Test were done on low-end laptop.
hardware
- CPU : i7-8550U CPU @ 1.80GHz (cores : 4 / threats : 8, max turbo frequenxy 4.0 GHz )
- disk : ssd sata disk,
operating system
xubuntu 24.04 LTS
$ tar --version
tar (GNU tar) 1.35
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by John Gilmore and Jay Fenlason.
I did not have installed compress tool, so I install with command :
sudo apt install ncompress
data sets
I used two data sets of ascii files described bellow :
| name | size | number of files | number of lines |
|---|---|---|---|
| data_set01 | 751M | 394 | 19588813 |
| data_set02 | 1.5G | 61 | 38079940 |
test script
I use bellow test script to create archives compressed with different commpression methods on 2 datasets I prepared.
The sync synchronize cached writes to persistent storage. The sleep 120 command allow to ‘cool down’ to restore idle state of my hardware (laptop).
#!/bin/bash
DATASET="data_set01"
time tar czf ${DATASET}.tar.gz ${DATASET} ; sync; sleep 120; sync
time tar cjf ${DATASET}.tar.bz2 ${DATASET} ; sync; sleep 120; sync
time tar cJf ${DATASET}.tar.xz ${DATASET} ; sync; sleep 120; sync
time tar cZf ${DATASET}.tar.Z ${DATASET} ; sync; sleep 120; sync
time tar -c --zstd -f ${DATASET}.tar.zst ${DATASET} ; sync; sleep 120; sync
time tar cf ${DATASET}.tar ${DATASET} ; sync; sleep 120; sync
DATASET="data_set02"
time tar czf ${DATASET}.tar.gz ${DATASET} ; sync; sleep 120; sync
time tar cjf ${DATASET}.tar.bz2 ${DATASET} ; sync; sleep 120; sync
time tar cJf ${DATASET}.tar.xz ${DATASET} ; sync; sleep 120; sync
time tar cZf ${DATASET}.tar.Z ${DATASET} ; sync; sleep 120; sync
time tar -c --zstd -f ${DATASET}.tar.zst ${DATASET} ; sync; sleep 120; sync
time tar cf ${DATASET}.tar ${DATASET} ; sync; sleep 120; sync
execution of the test
The test script was run once what took about 24 minutes.
results
The combined results, for both datasets, are shown in the table below.
| compression type | size [B] | time [s] | compression rate | execution ratio |
|---|---|---|---|---|
| xz | 157250668 | 1171.74 | 14.89 | 451 |
| bz2 | 234273588 | 175.43 | 9.99 | 67 |
| gz | 270529843 | 53.55 | 8.65 | 20 |
| zst | 290103115 | 8.13 | 8.07 | 3 |
| Z | 545122420 | 28.13 | 4.29 | 10 |
| tar | 2341601280 | 2.59 | 1.00 | 1 |
‘compression rate’ is defined as uncompressed data size to the compressed data size, indicating how effectively data has been reduced
’execution ratio’ is defined as ratio of execution time to time it took to create archive without compression (tar)
summary
I plan more work in this area that will include commpression tools that use parallel processing.
I think that will be more usefull and show a bigger picture.