IntroductionThis is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.
Redundancy of Source DataOn a network with multiple similar computers there can be a lot of redundant data, particularly in the operating system and applications. This occurs as each machine has much of the same files installed on it. A backup system that recognizes this and can exploit it can reduce both the volume of backup media and the time taken to perform backups (especially of the full backup kind).
The previous solution of using a cryptographic hash function to identify when a particular file has really changed can also be used to tell if a particular file on some other machine is really the same as the file on the current machine, and if so, allow us to avoid redundantly storing it.
I studied 5 machines on a network, these were running NT4.0 workstation, Win2000 Pro, XP Pro and NT 4.0 server. The following were the observed file and byte counts (I also looked at the number of chunks that would be needed for the cases of 8k, 64k, 256k byte, 1M and 4M byte chunks - more on chunks or blocks later).