Features of ArcvBack

Copyright 2008 by Stephen Vermeulen
Last updated: 2008 Oct 12


Introduction

This is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.

This article describes the main features of the ArcvBack backup software that I have written.

Backup Type

ArcvBack is a traditional backup system that implements the full backup followed by a set of incremental backups. However, it is intended to be used as an archiving backup system, whereby one does a full backup followed by a long run of incremental backups. Through the use of a large backup cache drive and an online version database ArcvBack makes the restore process (even in the presence of a large number of incremental backups) easy. When the backup cache space is exhausted it is time to start a new backup cycle (full backup followed by incrementals) again.

Supported Media

ArcvBack does not make any particular requirements of backup media, apart from requiring hard disk space for the backup cache. Backup package files (the files containing the saved data) can be configured to any convenient size and it is up to the user to copy these to the final backup media. Removable drives (particularly external USB attached IDE drives) can be used as backup media, and a program called ArcvPkgCopy is provided to copy and verify automatically. Optical media such as DVD+RW or DVD-R (and in the future Blu-Ray) is quite attractive from a cost and durability stand point, and is especially suitable for smaller systems (say with up to 100GB of data). Tape drives can also be used, though these are typically less cost effective than removable hard drives. Finally, if the cache is on a robust drive (such as a RAID array) it might not be necessary to ever copy the backup package files to backup media for some applications.

Backup Cache

ArcvBack uses a disk cache to record the backup data while it is running. This data is organized into a series of "package" (.pkg) files. These files are uniquely identified so that when a restore takes place the restoration utility can find the necessary data. These files are of configurable maximum size, so depending on the type of backup media that is used one may want o set the size of these to fit well. For backup to DVD media I find that making the package files about 100MB in size works well with DVD recording software as about 45 files will fit on a DVD.

ArcvBack allows the backup cache to use several directories (obviously on separate drives) so that if you have free space on a number of disks it can make use of it. At some cost in backup speed these directories could even be located on different machines in the LAN.

ArcvBack can also provide redundant package file storage, so if you have two drives with free space you can configure ArcvBack to write each package file to both drives, essentially providing redundant storage at the cache level so if one cache drive fails the backup is unaffected. For greater protection this form of redundancy can be used with drives on different machines through network shares. So if one machine fails the backup data still is available on the second machine.

For the easiest restore experience it is best to keep a copy of all the package files that makeup the current backup set (i.e. the full backup and all following incrementals) in the cache. However, the restore utility does have the ability to temporarily mount additional directories (or even drives) that contain package files that have since been removed from the cache to make space. And if you are only doing a partial restore the restore utilities can tell you which additional package files you will need to make available for the restore to succeed, so you don't necessarily have to restore excessive amounts of data to just get the few files you really need back.

Cache to Archive

Copying from backup packages from the cache to archive media can be done automatically using the ArcvPkgCopy tool if the archive media is a hard drive. If removable media, such as DVD or tape is used, then some other utility program (such as DVD writing software) must be used to perform the copy.

Once package files have been copied it is up to the user to decide if he wants to delete them from the backup cache. The recommendation here is not to delete the files from the cache. By leaving them in the cache you gain two significant advantages:
  1. an additional level of redundancy
  2. faster restore operations as all necessary data is still in the cache
However, if you are running low on cache space and still want to continue with the incremental phase of the backup cycle, you can choose to delete package files after they have been copied to backup media.

Backup Integrity Checks

ArcvBack includes a number of automatic integrity checks as part of its backup and restore processes. As each file is backed up and added to a package file an SHA1 (160bit secure hash algorithm digest or checksum) is computed for the file and written to the package and the version database. In addition, if a file that is too large to fit in a single package is being backed up, then each portion of that file will have an individual SHA1 checksum computed for it, as well as an overall checksum for the whole file.

Once the package is written to disk it is closed, then reopened and the backup program will re-read the whole package and check that all the checksums are still valid. If errors are found then the package is not considered valid and the package will not be entered into the version database.

When using ArcvPkgCopy to copy package files to external hard drives it will also perform the same read after write integrity testing.

The file restore programs will perform the same tests as they are reading the data from the packages and will also compute an overall SHA1 checksum for the whole file and compare that to ensure that the restored files are exactly as they were originally.

Even with all this testing there is still the possibility that when the original file was first read from the disk an error (perhaps a memory error) occurred and the SHA1 digests all include this error in their computations. This sort of error is rather difficult to eliminate since Windows aggressively caches the contents of files, so if one were to do a second read of each file and compare the contents they might well always get the same error due to the file cache.

Large File Handling

ArcvBack can backup very large files (such as large video files or drive images) if needed. The size of these files is not limited by the maximum package file size the user has set. This allows large files to be backed up onto media that is smaller than they are. ArcvBack can backup files that are larger than 4GB, though this may depend a bit on the file system and version of the operating system it is running on.

Online Version Database

ArcvBack has an online version database. While this is actually stored as a directory tree of files on disk (and not in a true database) it serves the same purpose as the version databases in other products. This database stores some information about each file that has been backed up or encountered by the backup program in the current backup set (i.e. the full backup and all subsequent incremental backups). You can browse this information with a command line tool or with a GUI tool that provides a tree view of its contents.

The main information in this database are the machine and directory structure and the files and versions of files that have been backed up over time. Each file version that is recorded has the size and date and backup event ID recorded here. Two other significant features are supported, first one can see files that have been added to the system, but which the backup program has not been successful at backing up. Second, when a file that has been backed up gets deleted the date and time at which the backup program noticed this is recorded.

The online database is most important for facilitating the file restore process, so the ArcvBack program will make a special backup of the database at the end of each run. This is done by creating a zip file that contains the whole database and saving that in a special directory. The system will keep a number (say 10) of the most recent zip files around automatically deleting any older ones, so if for some reason you need to revert to an older database version you can just pick one and unzip it.

Almost all of the information in the version database is also contained in the actual package files (the two things that are missing are the file deletion dates and the new files that were seen but could not be backed up). A special tool is provided so that, as a last resort, one could take a set of packages and recreate the version database from them (just lacking the two special dates).

File Restoration

This proceeds by first identifying the versions of the files that must be restored and the directory where they will be placed. Then:
  1. the system identifies the media that contains this data
  2. cache is considered to be media, so if all the data is in cache (which it may well be) then the restore starts immediately (step 4)
  3. for any data (packages) that are not in cache the system identifies the packages and then gets the user to place them online in one or more user-defined directories
  4. the restore proceeds from the cached data pool, also searching any user defined directories for any packages that are not in the cache
You can restore:
  1. the most recent version of any file or directory tree
  2. a particular version of any file
  3. the version of a directory tree as it was at the time a particular back event took place
An additional tool is provided for directly restoring files from within particular packages. This is a last resort tool, which might be useful if you have some backup media you know has a particular version of a particular file you want on it and you don't necessarily want to rebuild the version database to get at it (perhaps its from an older media set).

Backup Events

From an organizational perspective it might be useful to be able to record particular backup events for future reference. Such as when a full backup pass done. These backup events would record all the files that were included in a particular event. That way if you wanted to return a machine to this particular state you could.

Treatment of Recently Deleted Files

When a file is deleted that has been backed up in the current set, you will still be able to see that file and any older versions of it that were backed up so long as the current set is continued. When you start a new set, as that file is no longer on disk, it will not be present in the new database. If, after starting a new set, you find you need to restore the file you can do so by using the database from the older set and packages from that set.

Handling Large Files

There are three cases of interest here:
  1. files larger than 4GB
  2. large files that may not fit easily on backup media
  3. files too large to fit in the cache
We should be able to address the issue of files larger than 4GB by using the appropriate file access API.

When files exceed a particular size, say 1MB then the file will be recorded as a set of 1MB blocks plus one block of less than 1MB to hold the remainder of the file (i.e. each of these is a block object), when a file is less than 1MB it is recorded as a single block object of variable size. I have picked 1MB as this seems to be a midpoint, below this size the overhead due to tracking the individual chunks (which is about 60 bytes/chunk in a system that archives 4 copies) grows rapidly and dominates the file record database. In theory this size would even allow you to save archive data to floppy disks. With a size such as 1MB the directory of a single CD of backup data would still be quite short, and a DVD disk would have about 4500 files. Increasing the block size to 4M or even 16M would reduce these directories and still would not waste significant amounts of space on either of these pieces of media. This might also be of significance in the cache disk's directory, since that might be an 80GB disk. The other factor in directory size is that most of the files will actually be less than 1MB long, in fact with the average file length being on the order of 100K the directory size will be dominated by these small files.

How did I arrive at the 1MB chunk size? Consider a machine with roughly 5GB of data in about 50000 files on it. Given that an average file size is 100Kbytes. Note that as a 1MB chunk size is much greater than the average file size the number of chunks will be only slightly more than the number of files, but if we reduce the chunk size we'll see the number of chunks rise. We can estimate this by the following formula:

total_chunks =  #of_files + total_size/chunk_size

the database_size is the approximate storage space that must be allocated for each file object (multiplied by the number of files), this changes due to the approximate number of chunks per file. We don't want this getting too large as it will prove troublesome. Each chunk needs to have its BID (20 bytes) stored and the MIDs of where it is archived (I've assumed 1 MID for cache, and 3 MIDs for three archive copies, hence 16 bytes). Then the file objects have overhead that is composed of fixed stuff like the file name, date etc (lets assume this is 128 bytes) and a list of the chunks (BIDs) that are in the file plus the number of chunks, the size of this list is the total_chunks/#of_files:

database_size = #total_chunks * (16 + 20) + #of_files * (128 + 4 + 20*total_chunks/#of_files)

or, rearranging:

database_size = #total_chunks * 56 + #of_files * 132

as the #of_files is constant at 50,000, that term gives us a fixed minimum database size (when there is only one chunk per file) of 6.6M.

chunk_size
total_chunks
database_size
8K
675000
44.4M
64K
128125
13.8M
256K
69531
10.5M
1M
55000
9.7M

So a chunk size of 256K to 1M would seem about right. I had originally done these calculations for a lower number of files per machine (22000) and a larger number of bytes (10GB) per machine, in which case the effect of chunk size is more significant (for an 8K chunk size  a 126M database was expected).

Note the database sizes above do not include storage of the directory structure, but this is a much smaller storage requirement as the number of directories is less than 1/10 the number of files and the data to be stored per directory is less than that stored for each file (as there is no need to track any data chunks). So this can be neglected.

What to do about files that are too large to fit in the cache disk? This situation can arise in three ways:
  1. the cache has little free space because it contains a lot of files that still have not been written to the archive media the desired number of times
  2. the file to be backed up is actually larger than the total cache disk
  3. the file to be backed up is larger than the free space in the cache disk
Pretty much all of these amount to the same thing, without enough free space in the cache disk the program will not be able to create a file object and the complete set of data blocks to create a true snapshot of the file. But this might not matter since the file could be backed up one piece at a time over a number of attempts as space becomes available, all that needs to be done is to record the SHA1 values of each part of the file and then be able to return to the file from time to time to finish the job off and only mark the file as backed up once all of the blocks have at least made it to the cache.

This sort of situation could arise with large cache sizes when a new large system is being backed up for the first time, so the software needs to be able to tolerate and work through a cache shortage - this is not an exceptional condition that can cause failure.

Implementation Approach

Some general thoughts on the way to go about doing this.

Initial

  • single machine backup (since can use the administrative shares to access the drives on other machines the software could just run on the sever and backup all machines on the network)
  • in Python
  • support cache file
  • need to consider files that have a newer last modified time than the last backup time (and/or use the win32api to get at the archive bit)
  • archive to media would be done by a process preparing a directory of the files to be placed on media and then use regular CD/DVD burner or backup software to record to CD or tape. Then delete this directory
  • try using win32api to write/read tape drives directly
  • allow to run as an NT service
  • restore from media would be done by:
    • restore listing the MIDs needed for data that is not in cache
    • you place the needed media online somewhere
    • restore tool then gets the blocks it needs from the media an puts them into the cache
    • final restore step takes place using the data that has been assembled in the cache

Add Network Capability

  • add a server to handle cache requests
  • add a client to gather data from remote machines and copy to cache

Additional File Data

  • add modified, created, accessed date info
  • add ACL security info

Utilities

  • reporting on the file database, to identify
    • files that have never been backed up
    • files that have been changed since they were last backed up, but have been unable to backup (due to errors such as the file being in use...)
    • files that have been backed up but have not been archived the desired number of times
  • pruning old / unused info
  • repacking old archive media to reclaim waste space so it can be reused more effectively
  • taking media out of service (for lost, bad or media that is too old)
  • preparing an independent off-site archive set, or just identifing which set of media to take off site: "... the following list of MIDs is a complete set or else the smallest set"

Registry Backup

  • for Windows the ability to backup system registry
  • handing for backup of system files and open files

Archive Media

  • integrated / easier reading and writing
  • faster retrieval of data from tape (when only some chunks are needed)
  • support for multiple archive writers (on the server or on network nodes)
  • support for juke boxes



                back to arcvback.com home