IntroductionThis is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.
This article describes the main features of the ArcvBack backup software that I have written.
Backup TypeArcvBack is a traditional backup system that implements the full backup followed by a set of incremental backups. However, it is intended to be used as an archiving backup system, whereby one does a full backup followed by a long run of incremental backups. Through the use of a large backup cache drive and an online version database ArcvBack makes the restore process (even in the presence of a large number of incremental backups) easy. When the backup cache space is exhausted it is time to start a new backup cycle (full backup followed by incrementals) again.
Supported MediaArcvBack does not make any particular requirements of backup media, apart from requiring hard disk space for the backup cache. Backup package files (the files containing the saved data) can be configured to any convenient size and it is up to the user to copy these to the final backup media. Removable drives (particularly external USB attached IDE drives) can be used as backup media, and a program called ArcvPkgCopy is provided to copy and verify automatically. Optical media such as DVD+RW or DVD-R (and in the future Blu-Ray) is quite attractive from a cost and durability stand point, and is especially suitable for smaller systems (say with up to 100GB of data). Tape drives can also be used, though these are typically less cost effective than removable hard drives. Finally, if the cache is on a robust drive (such as a RAID array) it might not be necessary to ever copy the backup package files to backup media for some applications.
Backup CacheArcvBack uses a disk cache to record the backup data while it is running. This data is organized into a series of "package" (.pkg) files. These files are uniquely identified so that when a restore takes place the restoration utility can find the necessary data. These files are of configurable maximum size, so depending on the type of backup media that is used one may want o set the size of these to fit well. For backup to DVD media I find that making the package files about 100MB in size works well with DVD recording software as about 45 files will fit on a DVD.
ArcvBack allows the backup cache to use several directories (obviously on separate drives) so that if you have free space on a number of disks it can make use of it. At some cost in backup speed these directories could even be located on different machines in the LAN.
ArcvBack can also provide redundant package file storage, so if you have two drives with free space you can configure ArcvBack to write each package file to both drives, essentially providing redundant storage at the cache level so if one cache drive fails the backup is unaffected. For greater protection this form of redundancy can be used with drives on different machines through network shares. So if one machine fails the backup data still is available on the second machine.
For the easiest restore experience it is best to keep a copy of all the package files that makeup the current backup set (i.e. the full backup and all following incrementals) in the cache. However, the restore utility does have the ability to temporarily mount additional directories (or even drives) that contain package files that have since been removed from the cache to make space. And if you are only doing a partial restore the restore utilities can tell you which additional package files you will need to make available for the restore to succeed, so you don't necessarily have to restore excessive amounts of data to just get the few files you really need back.
Cache to ArchiveCopying from backup packages from the cache to archive media can be done automatically using the ArcvPkgCopy tool if the archive media is a hard drive. If removable media, such as DVD or tape is used, then some other utility program (such as DVD writing software) must be used to perform the copy.
Once package files have been copied it is up to the user to decide if he wants to delete them from the backup cache. The recommendation here is not to delete the files from the cache. By leaving them in the cache you gain two significant advantages:
Backup Integrity ChecksArcvBack includes a number of automatic integrity checks as part of its backup and restore processes. As each file is backed up and added to a package file an SHA1 (160bit secure hash algorithm digest or checksum) is computed for the file and written to the package and the version database. In addition, if a file that is too large to fit in a single package is being backed up, then each portion of that file will have an individual SHA1 checksum computed for it, as well as an overall checksum for the whole file.
Once the package is written to disk it is closed, then reopened and the backup program will re-read the whole package and check that all the checksums are still valid. If errors are found then the package is not considered valid and the package will not be entered into the version database.
When using ArcvPkgCopy to copy package files to external hard drives it will also perform the same read after write integrity testing.
The file restore programs will perform the same tests as they are reading the data from the packages and will also compute an overall SHA1 checksum for the whole file and compare that to ensure that the restored files are exactly as they were originally.
Even with all this testing there is still the possibility that when the original file was first read from the disk an error (perhaps a memory error) occurred and the SHA1 digests all include this error in their computations. This sort of error is rather difficult to eliminate since Windows aggressively caches the contents of files, so if one were to do a second read of each file and compare the contents they might well always get the same error due to the file cache.
Large File HandlingArcvBack can backup very large files (such as large video files or drive images) if needed. The size of these files is not limited by the maximum package file size the user has set. This allows large files to be backed up onto media that is smaller than they are. ArcvBack can backup files that are larger than 4GB, though this may depend a bit on the file system and version of the operating system it is running on.
Online Version DatabaseArcvBack has an online version database. While this is actually stored as a directory tree of files on disk (and not in a true database) it serves the same purpose as the version databases in other products. This database stores some information about each file that has been backed up or encountered by the backup program in the current backup set (i.e. the full backup and all subsequent incremental backups). You can browse this information with a command line tool or with a GUI tool that provides a tree view of its contents.
The main information in this database are the machine and directory structure and the files and versions of files that have been backed up over time. Each file version that is recorded has the size and date and backup event ID recorded here. Two other significant features are supported, first one can see files that have been added to the system, but which the backup program has not been successful at backing up. Second, when a file that has been backed up gets deleted the date and time at which the backup program noticed this is recorded.
The online database is most important for facilitating the file restore process, so the ArcvBack program will make a special backup of the database at the end of each run. This is done by creating a zip file that contains the whole database and saving that in a special directory. The system will keep a number (say 10) of the most recent zip files around automatically deleting any older ones, so if for some reason you need to revert to an older database version you can just pick one and unzip it.
Almost all of the information in the version database is also contained in the actual package files (the two things that are missing are the file deletion dates and the new files that were seen but could not be backed up). A special tool is provided so that, as a last resort, one could take a set of packages and recreate the version database from them (just lacking the two special dates).
File RestorationThis proceeds by first identifying the versions of the files that must be restored and the directory where they will be placed. Then:
Backup EventsFrom an organizational perspective it might be useful to be able to record particular backup events for future reference. Such as when a full backup pass done. These backup events would record all the files that were included in a particular event. That way if you wanted to return a machine to this particular state you could.
Treatment of Recently Deleted FilesWhen a file is deleted that has been backed up in the current set, you will still be able to see that file and any older versions of it that were backed up so long as the current set is continued. When you start a new set, as that file is no longer on disk, it will not be present in the new database. If, after starting a new set, you find you need to restore the file you can do so by using the database from the older set and packages from that set.
Handling Large FilesThere are three cases of interest here:
When files exceed a particular size, say 1MB then the file will be recorded as a set of 1MB blocks plus one block of less than 1MB to hold the remainder of the file (i.e. each of these is a block object), when a file is less than 1MB it is recorded as a single block object of variable size. I have picked 1MB as this seems to be a midpoint, below this size the overhead due to tracking the individual chunks (which is about 60 bytes/chunk in a system that archives 4 copies) grows rapidly and dominates the file record database. In theory this size would even allow you to save archive data to floppy disks. With a size such as 1MB the directory of a single CD of backup data would still be quite short, and a DVD disk would have about 4500 files. Increasing the block size to 4M or even 16M would reduce these directories and still would not waste significant amounts of space on either of these pieces of media. This might also be of significance in the cache disk's directory, since that might be an 80GB disk. The other factor in directory size is that most of the files will actually be less than 1MB long, in fact with the average file length being on the order of 100K the directory size will be dominated by these small files.
How did I arrive at the 1MB chunk size? Consider a machine with roughly 5GB of data in about 50000 files on it. Given that an average file size is 100Kbytes. Note that as a 1MB chunk size is much greater than the average file size the number of chunks will be only slightly more than the number of files, but if we reduce the chunk size we'll see the number of chunks rise. We can estimate this by the following formula:
total_chunks = #of_files + total_size/chunk_size
the database_size is the approximate storage space that must be allocated for each file object (multiplied by the number of files), this changes due to the approximate number of chunks per file. We don't want this getting too large as it will prove troublesome. Each chunk needs to have its BID (20 bytes) stored and the MIDs of where it is archived (I've assumed 1 MID for cache, and 3 MIDs for three archive copies, hence 16 bytes). Then the file objects have overhead that is composed of fixed stuff like the file name, date etc (lets assume this is 128 bytes) and a list of the chunks (BIDs) that are in the file plus the number of chunks, the size of this list is the total_chunks/#of_files:
database_size = #total_chunks * (16 + 20) + #of_files * (128 + 4 + 20*total_chunks/#of_files)
database_size = #total_chunks * 56 + #of_files * 132
as the #of_files is constant at 50,000, that term gives us a fixed minimum database size (when there is only one chunk per file) of 6.6M.
So a chunk size of 256K to 1M would seem about right. I had originally done these calculations for a lower number of files per machine (22000) and a larger number of bytes (10GB) per machine, in which case the effect of chunk size is more significant (for an 8K chunk size a 126M database was expected).
Note the database sizes above do not include storage of the directory structure, but this is a much smaller storage requirement as the number of directories is less than 1/10 the number of files and the data to be stored per directory is less than that stored for each file (as there is no need to track any data chunks). So this can be neglected.
What to do about files that are too large to fit in the cache disk? This situation can arise in three ways:
This sort of situation could arise with large cache sizes when a new large system is being backed up for the first time, so the software needs to be able to tolerate and work through a cache shortage - this is not an exceptional condition that can cause failure.
Implementation ApproachSome general thoughts on the way to go about doing this.
Add Network Capability
Additional File Data