ArcvBack
An Archiving Backup System

Copyright 2009 by Stephen Vermeulen
Last updated: 2009 Oct 18


Background

The whole topic of backing up computer data is fairly complex and it is rather difficult to describe in a single linear document. For this reason I have split this into a number of separate articles that discuss different aspects of this issue.
  • Backup Software, a general overview of backup software,
  • Backup Media Costs discusses the various types of backup media and devices and does a simple cost analysis to determine that, depending on the amount of data to be backed up, either DVD+RW or an external hard drive is the least expensive system. Tape based systems are currently the most expensive by a substantial margin, and Blu-ray is far to expensive to be practical.
  • Types of Backup Systems discusses the various types of backup software, this concludes that a system that has a large online cache of backup data along with a backup version tracking database can allow a full plus incremental approach to be used over a much longer than normal run of incrementals. This reduces cost and may also allow write-once media to be used economically. As well, such a system is ideal for implementation on a set of inexpensive USB-attached IDE drives.
  • Data Redundancy, examines the issue of redundancy in the source data. In a typical home LAN or perhaps a small office  it concludes that there is not enough redundancy to make it worth while complicating the software to exploit.
  • Block Based Backup, discusses a technique for reducing the size of backups by storing data on a unique-block basis allowing identical sections of multiple files to only be recorded once
  • The Case for DVD in Backups, examines how DVD based solutions could be made practical
  • Other Products and Resources, lists some other products and resources to explore in this area
  • Future Ideas in Backup, some ideas for further research and development
While these articles discuss many things and come to a variety of conclusions, for someone who has a small network of computers (such as a home LAN or a small business) the key conclusions are:
  1. you possess data that is at risk and should be backed up
  2. you should focus your efforts on protecting your unique and irreplaceable data
  3. a system which consists of a periodic rotation of a full backup followed by a sequence of incremental backups can be implemented in a cost and time-effective manner using either DVD media or IDE hard drives
  4. for most users tape is not an option
  5. Blu-ray is far too expensive, perhaps in 2012...

The ArcvBack Solution

Why Did I Write This?

I needed a backup system for user data on my home LAN. I have over the years used a number of commercial systems (both traditional file based and imaging systems) and kept running into issues. Eventually I had settled on NovaNet, which is a commercial system that is designed to backup a LAN of Windows machines. I was generally happy with it, until a couple of things happened:
  1. my data storage requirements started to exceed the size of my tape drive (a DDS-3, 12GB unit), over time I was able to work around this until about the time my weekly full backup was hitting three tapes
  2. NovaNet's version database went senile on me a couple of times, pretty much causing me to have to reinstall it and restart my backup cycle - I didn't loose any data, but it was a big pain, and possibly the only solution would be to upgrade to a new major version which would cost several hundred dollars more.
  3. My tape drive passed its third birthday and started to report a lot of tapes (even brand new ones) as bad.
At this point I was looking at replacing the tape drive, getting a new pile of tapes and getting a new version of NovaNet and so I started to think seriously about the backup problem. In order to avoid the multi-tape per full backup issue for the following few years I could see that I would need to get a tape drive with more than 50GB of native capacity.  Not only were these drives expensive a couple of years ago (and they are still to expensive for an individual), but the tapes for them were in the range of $100 each (now they have fallen to about $40, but that's still too much), so the obvious upgrade path was not going to be cheap.

At this time DVD+RW media had fallen to about the $0.50 to $0.75/GB range which made it less than half the price of any blank tape and DVD burners were in the $100-200 range which made them about 1/10th the price of the sort of tape drive I needed so I started to look into what it would take to write my own software to address this issue.

The first version of arcvback was pretty simple, and as I used it I thought of useful enhancements to it. These were added after about a year along with some other features I thought might be useful to make version 2.  After using version 2 for another year I noticed that the way the backup media was structured could be changed to reduce the time it took to burn the final backup DVDs. Also, during the two years or so since the first version was written the cost of IDE drives had dropped significantly and the USB2.0 interface had made it possible to easily add and remove external drives, making this sort of drive an excellent backup media.

With these issues in mind, in the fall of 2006 I rewrote arcvback for the third time. At this point I did some code simplification and dropped a number of features that had sounded like a good ideas earlier, but which over time I had not found to be that useful. The end result was less code and fewer features but a program that did a better job of its main task.

After another 2 years (in the summer of 2008) I revised arcvback to address a number of bugs, update it to Python 2.5 and to remove the file-based version database and replace it with an object database using the Zope Object Database (ZODB).

Intended Application

ArcvBack is primarily intended to backup user data files that are distributed over a set of directories on multiple computers on a LAN. It is not intended to backup operating system files, installed applications or such things as the Windows Registry or boot blocks.

ArcvBack uses a backup cache (usually on one or more local drives) to save the first copy of the backup data. This allows the backups to happen at maximum speed, with minimum operator intervention (especially if you install it as a Windows Service). ArcvBack also employs an online backup version database so that the user can locate older versions for easier restoration tasks. In most configurations the backup cache will be as large as the backup media set so that in the event a restore is needed quickly, it can be done from the data in the cache without further bother.

I envision two common ways in which ArcvBack might be deployed which are discussed in the following sections.

For a small system

For a system with smaller amounts of data (say less than 50GB or so) ArcvBack would be used with a DVD writer, the user would have three sets of read/write DVDs. He would select the oldest set, erase it and then burn a full backup to it followed by a number of days of incremental backups until he runs out of blank disks in the set. Then he would save the version data base along with the set that just filled up, put it aside (perhaps taking it off site for storage) and repeat the whole process again. Because he has three sets of media he will have one in use and two others (perhaps containing the previous month or two of backups) available for file restore purposes. This gives him a certain degree of redundancy, especially with data that is not frequently revised (like the family photos). If there was a problem with the media in the most recent set for a file he needed to restore, there is a good chance that the same file is available in one of the other two sets.

The cost of doing this, assuming DVD+RW media is used and you need to protect about 50GB of data and you are modifying about 10% of that per month and want the media for a set to last about a month is $80.00 total (including the drive).

If you don't want to bother with the time taken to burn on average about one DVD every 2 days you could implement the same thing with three USB attached hard drives of a least 55GB capacity each. Of course, you can't buy drives this small today in the more cost effective 3.5 inch size (though you could use the smaller, slower and more expensive lap top 2.5 inch drives for this), so let's say you use the common 500GB size which can typically be purchased for about $100 each (including the USB case). You are still only looking at about a $300 total investment. You can't even buy the least expensive tape drive for that, let alone the 10 or more tapes you would need to have a decent rotation frequency.

For a larger system

Consider a system (a home or small office LAN) where there is a total of 200GB to be backed up. Assume a small fraction of this data also changes on a daily basis, for about 10GB per day. This data may be distributed across a number of hard disks and across a number of computers. Most of these computers are running some version of Windows, but there might be data that resides on Linux file servers and is accessible through SMB (SAMBA) drive shares.

This is probably about the limit of what one might what to backup using DVDs, if you assume a one month duration per media set then you would have the 200GB full backup plus about 30 daily backups of 10GB each for about 500GB total. That's about 110 DVDs or about an average of 4 DVDs burned per day.  The cost of the drive plus three sets of RW media would be about $300, which is still quite reasonable.

If instead, you use 500GB USB attached hard drives (three of them, one per "set") you are looking at about $300 total as you can pick drive of this sort up for about $100 including the case. You could even pick up external 1TB drives for about $140 each and not have to worry about backup space for even longer.

If you are worried about the reliability of hard drives consider the following points:

  1. the vast majority of the data you are backing up is unchanging, so at any time there are really 5 copies of it in existence: the original copy on the user's drive, the copy in the ArcvBack cache drive, and the three copies on the external media drives. If you have a smaller cache drive then you may only have 4 copies of the data.
  2. for the data that does change often (or has been freshly created that month) there are at least three copies in existence, the copy on the user's drive, the copy in the ArcvBack cache drive and the copy on the current media set.
If you are really worried about three drives failing at once you could do several things:
  1. place the ArcvBack cache on a RAID-1 or RAID-5 protected hard drive (then two disks in the cache would have to fail at once for you to loose the cache copy)
  2. you could buy an additional external drive and make a second copy of the current backup data files in parallel with the current set (you would just reuse this drive every time you changed sets)
  3. You could configure ArcvBack to write to two (or more) cache drives in parallel (perhaps even in different machines)
  4. You could choose to burn to DVD as well as writing to an external drive. The external drive gives you rapid protection and the burn to DVD can happen at a slower pace as time permits.
Note, there are reasons that three (or even more) drives could fail at once, such as theft, fire or flood. Your best protection against these issues is to take some of the backup media to another location. An external USB hard drive is an excellent way of taking data off site, and if you have two, then you use the arcvpkgcopy.py command to copy the data to the external drive, take it off site, then swap it with the one that was off site, bring that back and use arcvpkgcopy.py on it to bring it up to date until it is time to do another swap.

If you have a much larger set of static data the ArcvBack solution will still work, but you will probably want to extend the duration of each backup cycle so that the load of the full backup is distributed over a longer period of time. You could also configure more than one machine to act as a backup server, thus sharing the load across a number of machines - you might want to do this anyway to make use of free drive space that happens to be available.

Note that if you have all the backup data in your cache drive restores are easily done and are not greatly affected by the number of incremental passes so having longer runs of incrementals is not a particularly bad thing. This is the main reason why ArcvBack provides support for spreading the cache across several drives if needed.

What I Do

I use ArcvBack to backup key data on three PCs, one running Linux (configured as a file server) and the other two running Windows XP Pro. In total there is about 130GB of data that is backed up, so the initial pass takes about 5 hours to run (for an average rate of about 25GB per hour on my gigabit LAN, when my LAN was only 100MHz this took about twice as long doing about 13GB per hour). Once the initial pass is complete the daily incremental backups take about 10-20 minutes to search out any updated or new files on these machines and to record the new contents to the backup disk. Typically the daily backup process requires about 700MB of additional storage. Over a 6 month period the total backup media ended up requiring 251GB (so about 130GB of initial files and then about 121GB of changed files - so about 670MB per day of revised data). The database for this was about 105MB and compressed to 35MB. In my case much of the revised data is due to my large email folders (inbox, trash, etc.) that get updated on a daily basis.

I have configured one of the Windows XP workstations as the backup server, it has a RAID-5 array with enough free space to hold a full 6 months or more of backups (i.e. the initial full backup and all the incrementals). I have ArcvBack configured as a Windows Service to run one backup pass a day first thing in the morning, the backup saves all its data and the backup database to the RAID-5 array. About once a week I connect an external USB drive (300GB or so) to the system and run the arcvpkgcopy.py utility to copy all the new backup media files and the current database to the external drive. In this way I have enough redundancy that I can tolerate loosing 2 hard drives at once without loosing any data (remember that most of the time the data is available in 3 places, unless it is less than a week old - in which case just two places, but one of them is RAID-5 redundant, so I would still have to loose 2 drives on the RAID at once to loose data).  If data is less than a day old it is at some risk as it only exists on one drive.

Periodically I take the external drive off site and swap it for a second unit so that I have protection against fire, theft and flood. Since the off site unit is still in the same city I'm not fully protected against a nuclear strike or anything larger than a very small asteroid impact.

With this setup (since I have all the incrementals for all the files for the last few months online) I can, and have, used the arcvrestore programs to recover lost files and earlier versions of existing files. This works quite well, and is quite quick to do, since all the of database and the necessary media files are online all the time in the RAID array.

About once every 6 months or so its time to consider throwing out the current backup and starting a new one. At this point you might consider keeping a copy of the old backup around on a second USB drive in case you later need to get to an earlier version of something.

Downloading

The latest version of ArcvBack can be downloaded here: arcvback2009-10-18.zip (version 4.4).

If you need the older version 3 you can download it here: arcvback2007-01-03.zip, the documentation for it is in this arcvback3-docs.zip zip file.

Release notes are here.

Installation

Note that as the version database storage system changed completely from version 3 to 4 you will need to start a fresh backup cycle and version 4 commands cannot be used to work with a version3 backup (and version 3 commands will not work with a version 4 database either).

It is recommended that you install ArcvBack by just unzipping the download archive into a directory called: C:\programs\arcvback4, this way you will minimize the installation effort. See the section on the config.ini file for more details.

Before you attempt to use any ArcvBack utilities you will also need to install Python (any version from 2.5 and up should work) and if you are running on a Windows machine then also install the Win32 extensions to Python. To install these things you should download their installers and then just following the installation instructions. I have written this using Python 2.5, and it should work with newer versions of Python too. It might still work with an older Python, but I seem to recall there was one place I used something that was new in Python 2.5. You can find the appropriate install packages here:

  • Python is available for many platforms, this link gets you Python 2.5.2 release download page.
  • The Python Win32 Extensions package, the install kit can be found on this page, click on the "Download" link to get to a list of the various builds and select the one that matches your installed Python version.
  • The wxPython package if you want to use the arcvrestoregui.py program (recommended), the installation kits for this can be found here.. There are both unicode and ansi versions, the unicode version is probably the best to use with Windows.
  • You will also need to install the Zope Object Database (ZODB) package. This was developed with ZODB3 (version 3.8.0). The package is available here, if you have ez_setup.py installed you can install ZODB by typing: ez_setup.py zodb3 at a DOS command prompt. ez_setup.py is part of EasyInstall (which is available as part of setuptools which you can get here, note you may need to add the Python\Scripts directory to your search path after the install) which is available here. Once EasyInstall is installed you should be able to install ZODB by typing easy_install zodb3 at a command prompt.
You need to install Python first and then the Python Win32 extensions and ZODB.

I typically install my Python to C:\programs\python25, if you install it to a different directory you may need to change one or two things.

You may need to add the C:\programs\python25 directory to your Window PATH environment variable.

When running ArcvBack on Windows you can choose to run it as a Windows Service, which will allow it to run automatically once the machine is booted, even when no one is logged into the computer. To allow it to work in this way you must have the Python Win32 Extensions installed and you then need to use the following additional installation instructions:

Install Python as a windows service (this only needs to be done once on a machine, so if you have other Python services running you don't need to do this again).  I have my python installed in c:\Programs\Python25 so the commands to execute (in a CMD console window) are:

cd C:\Programs\Python25\lib\site-packages\win32
PythonService.exe /register

Then do the following to install the ArcvBack service:

cd c:\programs\arcvback4
service.py install

It will output the following:
Installing service ArcvBack4
Service installed

You should then be able to see ArcvBack in the Windows control panel as in the following picture:
Windows control panel

Before you start the service you will want to configure the service to automatically start when the machine boots up, to this you right click on the ArcvBack Backup Server title in the control panel and select "Properties" from the popup menu, the property sheet should be configured like:

Setting the autostart

You may also want to configure the service to run under a particular account ( you must do this if it needs to access directories on other machines in the LAN).  To do this you first create the account that the service will be run under and then use the "Log On" tab to tell the service to run under this account, as you can see in the following picture:
Setting service account rights

In this case I have created an account on the local machine that is called arcvback. I have also created local machine accounts with the same user name and password on all the machines that it needs to backup.  If you are running under a Windows Domain then this is easier, just setup a domain account for the service to run under and then there is no need to create additional accounts on the individual machines. If you are on a LAN where the domain controller is unreliable or is not powered up most of the time, then you can also use the individual machine accounts instead of the domain account. You can use Samba 3 on a Linux box to provide a Windows Domain account system instead of a Windows Server.

If you need to remove the arcvback service you can do:

cd C:\programs\arcvback4
service.py remove

It will output the following:

Removing service ArcvBack4
Service removed

If you install a new version of the Python code, or you modify some of the files (such as the service.py) then you can get the service to use the new code by just clicking on the "Restart" link in the Service control panel. The ArcvBack service checks the contents of its config.ini file once a minute, so if you make any changes to config.ini you don't have to restart the service for them to take effect.

Configuration

Most of the configuration settings for ArcvBack are stored in a file called config.ini, the various settings are explained here, along with what you need to change if you have installed your ArcvBack to some directory other than c:\programs\arcvback4. A sample config.ini file is included in the distribution zip file, you can also read the comments in it.

Command summaries:

arcvback.py is the main backup utility, you use this for manual backups (or include it in a script file), it shares the same version database as the Windows ArcvBack Service, so it is best to stop the service before running this. As ZODB locks the database while it is in use you will probably have to stop the service to release the database lock.

arcvdbrebuild.py is an emergency repair utility, in the event the version database is lost or corrupted and all the backups of it are also missing, then you can run this utility against a set of backup package files to build a new version database so that you can select and restore files as normal.

arcvlist.py is a command line utility to list the contents of the version database. This should be considered a last resort as arcvrestoregui is a lot easier to use and is more powerful.

arcvpkgcopy.py is a command line utility that copies (with verify) package files from the cache directories to another device (typically an external USB drive) so that you can have extra redundancy (and even take that drive off-site for better protection against data loss).

arcvpkglist.py is used to list the contents of package files, normally this is not required (the arcvlist.py and arcvrestoregui.py are better). Note that the arcvpkglist program does have a verification function, and with it you can check a single package or all the packages in a directory for integrity.

arcvpkgrestore.py is an emergency command used to extract the contents of a package file, normally this should never be needed.

arcvreset.py is a utility command to manipulate the next package and event IDs that the backup will use. Normally this should not be needed, it is primarily used for testing a few special cases.

arcvreschedule.py is a utility used when setting up a schedule of backup events for the service.py program

arcvrestore.py is the file restore command line utility. Probably best to just use the arcvrestoregui program instead.

arcvrestoregui.py is the file restore utility with a graphical user interface, this is the preferred way of selecting files or directories to be restored. You can restore single files, previous versions of single files, the latest version of a directory tree, just the files that were saved in one backup pass in a directory tree, or the most recent version of all files in a directory tree up to and including a particular backup pass. It also verifies all restored data with SHA1 hashes to check the integrity of the restore.

arcvrisk.py is a utility to report on files that may be at risk, this will search the database and list any files that:

  1. have never been backed up
  2. have been backed up in the past, but have been seen to have changed and not been backed up since

service.py is the Windows ArcvBack Service application (largely described in the Installation section above)

treecopy.py is a command line utility for copying the contents of a directory trees, I wrote this largely for testing the arcvback backup/restore processing to see that it was all correct. However, from time to time I find this utility useful for migrating data from one drive to another. This utility does SHA1 hashes on the data to check that the destination files were correctly written and that the source data was read the same way twice. This way if one of the drives (or controllers) is misbehaving there is a chance that you will notice it. Also, if your computer has faulty memory there is also a chance this utility will notice it.

treediff.py is a command line utility for comparing the contents of two directory trees, I wrote this largely for testing the arcvback backup/restore processing to see that it was all correct. However, from time to time I find this utility useful for determining what has changed between two directory trees.

features
intended usage patterns
additional use
file formats
limitations

Licensing

This is free for non-commercial use. If you have a web site or a blog I would appreciate a linkage to this page, that way I can see how much interest there is in this package.

If you want to use it for commercial purposes you are free to evaluate it for 100 days, then contact me for licensing if you find it suitable.




                back to arcvback.com home