Backup Software

Copyright 2008 by Stephen Vermeulen
Last updated: 2008 Oct 12


Introduction

This is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.

Disk Image Backup

Image based backups are a useful tool for restoring a dead system, particularly in skipping the long operating system and application re-installation process. The general idea is to identify the blocks on a partition or an entire drive that are actually in use by the file system and copy those to some other media (which may be a DVD, and other hard drive or a file on a network drive). This process is done in such a way that the blocks can be put back onto another hard drive (of equal or larger size) and then the machine will boot and run from this drive just as it had when the image was made, with all applications, settings and user files intact and in place.

Image backups are commonly used in corporate environments to quickly deploy an identical set of machines. They are also used in the home environment to migrate from an old, slow, small hard drive to a new, fast, big drive.

They can make selective restoration of individual lost files difficult or impossible (some do provide a tool for this). As well, they must normally be run with the machine off-line, often booted from a special disk (some provide for a backup of a live system) which means that manual attention to the process is needed (which means that they are not well suited to an automatic, scheduled regimen). These systems also get slower as the disks fill up because the backup size gets larger.

Recently some vendors have started to provide an "incremental" imaging mode, so that once you have a full image you can create incremental images of just the blocks that changed, which will take less time to make and space to store.

These are probably best used to backup a boot and applications partition, but not for backing up a data partition.

There are a number of commercial image back solutions, I recommend TrueImage from Acronis. There are also some free software projects for this sort of software.

Traditional File Based Backup

File-based backup systems are not good at restoring a system to a completely working state from scratch. They are good in that they allow for regularly scheduled backups to take place (normally without booting to a special operating system, like imaging tools do), this means that there is a much better chance that in the event of trouble your data will have been backed up. Windows NT, 2000, XP come with a reasonable backup tool that will work on a single computer (not certain if this supports scheduled backups well). But to get a centralized backup solution for a small network of computers the software will be costly.

This form of backup also gets expensive due to the quantity of media needed to do a good job. For a discussion of this aspect see my article: Backup Media Costs.

There are a number of twists to the file-based backup story, these affect both cost and ease of use and are described in the following sections.

Simple Copy

The easiest form of backup to understand is the act of simply copying the important files to another device. This might be to another drive in the same machine, to another machine on the LAN or to some form of removable media such as a CDROM or DVD drive. While this is easy to understand it can be wasteful in both time and media as once the file count gets large you pretty much have to make a full copy of everything each time (because its too difficult to figure out what files changed or were added and just copy them).

There are some programs that will do this sort of thing for you (especially from one drive to another across a LAN) and will just copy the changed files too. Sometimes these are called directory or folder mirroring or synchronization programs.

Archive Tools

In the past tools like zip and tar have been used to backup directory trees. These will gather up all the files into a single archive file (along with the directory tree information). The user can then just copy that one file onto the backup media for safe keeping. Zip will allow one to update an existing zip file ("freshen") as well.

Backup Programs

Dedicated backup programs will usually provide for both a full backup (all the files in some user-selected list of drives or directories) and one of two types of incremental backups:
  1. incremental, which will record the new version of any files that have been added or modified since the last time the backup program was run in either full backup or incremental mode
  2. differential, which will record the new version of any files that have been added of modified since the last time the backup program did a full backup
The difference here is that if one does a full backup and then follows that with a number of incremental backups one will need to access files from the full backup and all of the incrementals in order to do a restore; whereas, if one does a full backup followed by a number of differentials then one only needs to access the full backup and the most recent differential backup to do a restore.

Since each differential contains all the changes since the last full backup each subsequent differential will be as large as (or more typically) larger than the previous differential. This makes differential backups less space efficient than incremental backups (and they will also take longer to run). As always there is a trade off, here you trade some storage space for convenience in restoration.

Backup Databases

To aid in managing the file restoration process (particularly when one is trying to restore an older version of something that was backed up a few days or weeks ago) some backup programs add an online backup database. This contains a record of all the versions (names, dates, sizes) of all the files that have been backed up and their current locations on the backup media. The idea being that if you wanted to restore a particular file to its contents of a few days ago you could browse the backup database, locate a version that is suitable (by looking at the dates or sizes) and then get the restore program to restore that particular version (it would probably prompt you to load the appropriate backup media into a drive).

Some packages give a unique event ID to all the files that are backed up in a particular backup run, so that if you can identify the version of one file you want to restore, then you can restore other files that were in place at the same time by just specifying the event ID.

Backup Cache Drives

Because changing tapes or DVDs requires operator intervention some backup systems implement a two stage backup process. First they locate new or modified files that need backing up and copy them to a local hard drive. This allows the backup to run without operator attention (while there is sufficient free cache drive space) and also run faster. Then they provide a second utility that allows the operator to copy the data from the cache drive to the final backup media. This second stage empties the cache drive allowing the first state to do more work when it wants to. Since the act of writing to the backup media is now decoupled from the act of acquiring the files that need backing up, one can arrange to run the first stage during the night (so as not to interfere with other users) and then run the second stage during the day while the operators are around to feed tapes or DVDs into the backup drives. The use of a cache drive can also allow for better utilization of the backup media, since the cache flush can be postponed until enough data had been collected to fill a new piece of media.

A logical extension of this idea is to make the cache drive large enough to hold a full backup and any desired incrementals, in this way one can then rapidly restore files by using the data in the cache directly without having to reload from tape or DVD media. If one is doing this in a business environment it would be prudent to place the cache area on a RAID protected drive set.

The Amanda backup system is an example of a cache based system.

Multiple Caches

Given that cache drives (especially those based on IDE devices) may be easily, cheaply and conveniently (lots of PCs have space and ports to support one or two additional IDE drives) added to a number of machines on a network, it would make sense for backup software to support this in two ways:
  1. Allow data blocks to be saved on any available cache drive in the network, thus combining a number of drives into one large cache
  2. Allow data blocks to be redundantly saved to a number of parallel cache drives in the network. This would allow for a fault tolerant caching system, then if any cache drive failed the data on it would probably be available on one of the other cache drives. This would allow for simpler system implementation than arranging for the cache directories to be on RAID-ed drives.
The above could be achieved by grouping cache drives into pools and having the software write each backup record to each pool, thus storing it in multiple places. When a restore operation takes place the software would look for data in the first pool, and if it did not find it there it would look for a copy of it in the second pool (and so on if configured for additional redundancy).

Archive File Based Backup

This is an extension of the idea of the traditional full plus incremental backup made feasible by employing a version database for organization and cache drives.

The idea here is to greatly increase the number of incremental passes that take place between full backups. The reason this strategy arises is from the observation that the actual amount of data that gets modified (or is newly created) on each day is relatively small fraction of the whole data set. This means that the time (and space) to perform incremental backups may be much less than the time (and space) to perform full backups.

I got thinking about this issue when (some years ago) I was using a product called NovaNet to backup my LAN and my digital archive grew beyond what would fit on one, and then two and then three tapes (I was using DDS3 12GB tapes). At this point it was taking nearly a day to do a full backup (with verify) of the archive, yet the daily incrementals still ran quite quickly since I rarely added more than a few hundred megs in any one week.

After some thought I came to realize that the act of creating a full backup is only going to get worse with time and that it made sense to look for ways to minimize the number of these events that are required. I also knew that even though NovaNet had a online database of backup version information, the database was not enough to make the process of doing a restore simple and error-free (even with a once a week full backup followed by 6 days of differential or incrementals). This was because of issues with tape management.

However, if one could provide a system where the raw backup data was largely online (such as with a cache drive) along with a version database then it would be possible to make restoring any particular version of a set of files easy (due to no or little operator intervention) and error free. If this is done then one can run full backups a long time apart (perhaps several months), running fast incrementals between them and then once the online cache is out of space either extend the cycle by flushing to offline media or finishing the backup set and resetting the system and starting a new set with a new full backup.

Curiously enough, if you take consider a system like this it becomes apparent that one could even use write-once media for the backups, which provides an additional, long term redundancy advantage. Each time you start a new set you end up keeping the old set, and since it is read-only you cannot change it, so you now have another copy of most of the data on your system. In fact, for some types of files (like the family photo archive) to which you only add additional files over time, this copy will remain a current backup of a significant portion. By periodically starting a new set you end up producing a redundant copy of your most important and irreplaceable files.

For example, assuming you have 100GB of "static files" to backup (for example a large MP3 and photo collection) which are not going to change over time (apart from adding in new files), plus you might have about one DVD worth of files that get updated each week (for example email folders and a few word documents etc.). Plus you might choose to make a complete duplicate copy once every 6 months, so you will need 25 disks for the static files and another 25 disks for the weekly changes every 6 months, which is 100 disks per year. Which would cost you only US$28.75 per year in media. That's pretty inexpensive for a complete backup where you can have access to pretty much any weekly (or perhaps even daily) version of any file over the entire year - plus after 6 months you have two copies of all the hard to replace stuff in the event that one of the copies becomes unreadable. And every 6 months you get another redundant copy (you could always reduce the redundancy rate to a year or two later).

If you want to minimize total cost you can still use read/write media, just have several sets of it and rotate through the sets. If you have three sets then you can be writing one, have one on hand in case you need to go back a month or so and even store the third off site to protect against loss due to fire, flood or theft. As the rotation period between sets is longer than the traditional week or so it is even not to inconvenient to place one of the sets in a safe deposit box.

Another thing to consider when trying to decide between using write-once or RW type media (given the factor of 5 difference in cost) is that in a few more years new technologies (such as blue-ray) will be on the market and will probably be offering a lower cost per byte of storage along with much higher transfer rates. So an investment in more expensive RW media may well be wasted because you will be switching to another storage technology in a few years.

If your intended backup volume (the full backup and the following incrementals) will fit on a single IDE disk then it becomes practical to use disks for backup. To protect against the issue of drive failure what you could do is to have one internally mounted disk plus two or three external drives (say inside USB attached cases). Then when you start a new backup set you record the full backup and the incrementals to the internal drive, but at the same time you copy those backup files to one of the external drive. This gives you three copies of every file in your system: one on the internal backup drive, one on the external backup drive copy and the original file itself. The chances of all three drives having problems at once will be quite low (unless you are in a trailer park in Tornado Alley), but if you are concerned you could keep the external drive in a fire proof safe when you are not actually copying backup files over to it. Once the backup cache drives fill up its time to end the current backup cycle and start a new one.  At this time you put the external drive away somewhere safe (in a safe deposit box or lock it in your office at work or ...) and clear the internal drive and attach the next external drive and start a new cycle with a full backup.

A traditional full plus incremental backup system can be made to approximate an archival backup solution by just increasing the number of incremental backup passes between full backup runs.

In a true archive-type backup system, the backup database tracks the number of copies of each file that have been written to backup media and once the desired degree of redundancy on a particular file is achieved it no longer needs to copy that file any more. This system will also not rotate through the media in the same way as traditional approaches, once a piece of media is written it may not get overwritten for a long time (or ever). This is different from the typical combination of full plus incremental or differential backups as found in traditional (and even image) backup systems, in that backups of a new machine will take a long time to run for the first few times, but once the desired degree of redundant storage is achieved the backups will go very fast as only the new or changed files need to be saved. A true archive backup may provide additional safety by periodically (say once a year) forcing the creation of a fresh backup copy of each file (thus increasing the number of redundant copies in storage and protecting from media failure due to bit rot over time). A true archival system may provide tools to free up old media for reuse once the files on it have been backed up more than the minimum number of times or have been deleted more than a certain amount of time ago. Such a system may also allow one to configure some files or directories as being more important, and thus, needing a higher number of redundant copies.

The second version of ArcvBack had some support for true archive storage, but I found that this provided little true benefit and was complicating the code too much, so this support was removed in the development of the third version.


...remove the following section?

The Fundamental Problem with Traditional Approaches

The previous two approaches suffer from one common problem, they actually do too much work. Because they both re-copy all the data on the systems they are protecting periodically (say once a week), the size of the full backup becomes a major problem. It quickly exceeds the size of common (hence inexpensive) storage devices (such as CD-RW or DVD-RW) that use inexpensive media and drives one towards more expensive solutions (like tape drives), either due to the cost of the media or the inconvenience of performing the backup (try imaging a 50GB disk to CD-RW some day...).

When one stops to think about the data that is being backed up one will realize that in both the home and office environments not a lot of files are actually being created or changed each day. This is a fact that the traditional file based backup systems have realized for some time and exploit via their incremental and differential backup modes. In an incremental backup only the files that are new or have changed since the last backup (either full or incremental) was run are saved. In a differential backup all the files that are new or have changed since the last full backup are saved. This greatly reduces the size of these incremental or differential runs, and hence, the time taken to do them. The disadvantage with this is that it can make restoring lost data more difficult (especially with incremental runs, since not only the last full backup needs to be applied but also all the incrementals since the last full backup, with differential runs you only need two backups, the last full one and (typically) the last differential one).

There appear to be some image based backup tools that are claiming to provide an incremental mode as well.

Storing backup data on a cache drive can solve some of the inconvenience problems when doing a restore, since data will probably come from the cache rather than slower backup media.



                back to arcvback.com home