File Formats Used in ArcvBack

Introduction

This is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.

Caution: the following discusses version 2 of arcvback and needs updating for version 3.

The File Object

In an archival system that is file-based this is the fundamental object. For practical reasons each file will, in fact, be composed of one or more blocks (or chunks) of data. The file object will contain a table that contains the SHA1 identifiers for each of these blocks (the BIDs). When the file is modified, such that it now contains some different BIDs a new File Object is allocated (of the same file name but a different version number - should there just be a file ID, FID, that's a 4 or more byte int? Would that do?), this allows the system to restore older versions of files if needed. The purpose of the file objects is to track the location and properties of the various files on the various machines being backed up along with the BID (SHA1 value) for the blocks in the file. If there are multiple versions of a file, where the versions are just copies on different machines, in different directories or under different names they will all have the same BIDs, so all the file objects will share the same Block Objects. If there are multiple versions of a file that have the same name but the contents are different, then there will be multiple file objects, and each will use a different block objects. They may actually share some of the BIDs if parts of the files are the same.

The file objects will also contain data about the file, such as the creation and modification time stamps, the file name and security attributes. The approximate deletion date should also be maintained (this is the time that the backup system first noticed the file, which it had backed up previously, was deleted from the system), this way one can determine when it is safe to delete the File and Block objects of a file that has been deleted (by just checking to see if the current time minus the deletion time is greater than the archive time period).

The Block Data and Chunk Objects

The block data object contains the SHA1 value for the data (the BID) and the media identifiers for where the actual data Chunks are stored. The user sets the level of redundancy he desires at the time the system is configured in terms of the number of copies that must be stored for each block. Each block will have a short array that contains the MIDs where each chunk data copy is to be found. Since files are broken into a set of chunks for storage a single file may be larger than the available space on one media piece and in this case the chunks for that file will be stored under different MIDs. We will never split a single chunk across two pieces of media.

Chunks may also be stored on the backup cache. In fact most restore operations will be satisfied by loading the desired chunk data from the cache, especially as the cache could be made quite large (in a home environment a 100-200GB drive could act as the cache allowing the complete backup information for several machines to be kept in the cache for immediate access, in addition to being recorded on backup media for security). The file data object will record the MID of the cache(s) that the chunk is currently contained in.

Package Identifiers

In an archive system there are many pieces of media in use, and some of them may be quite old. To make things simple when a piece of media is formatted (or erased) it is given a unique media identifier (MID). For simplicity we never reuse old media identifiers, so the MID will need to be more than a byte, probably either a 2 byte or a 4 byte quantity. In a system that uses CD-RW for backup a 2 byte MID would cover 42TB of total storage, or if one disk was written per day 178 years to exhaust the available MIDs. A more reasonable rate of 5 disks per day would still take 35 years to exhaust. If we increase the MID size to 4 bytes then we would have to write about 1 disk per second 24 hours a day for 178 years to exhaust the MIDs, which is going to be impossible. So clearly, a 4 byte MID is going to be sufficient for the worst case of the smallest and least cost backup media, so unless the design shows the MID size as being a significant problem we'll stick with 4 bytes.

The Directory Object

This contains the information about the location of a file. It might also contain the machine name and drive or device name.This could be as simple as a UNC formatted string of the complete path to each file, but since directories are true trees its more space efficient to have a one to one mapping between these directory objects and the real ones.

In a full-blown system additional security information will need to be stored here too.

The Machine Object

This will identify each machine that is being backed up

The Drive Object

This might be the same as a directory object, it identifies which drive within a machine particular directories as stored on.

Media Layout

The backup media (and this includes the cache drives) will primarily contain Chunk Objects as these are the actual data for the files. Other things to think about:

should there be a table of contents on the media so that at restore time (especially for tape) the first part of the media can be read, and then the restore program can seek directly to the chunks it needs? Would have the BID, length and location of each chunk on the tape.
should there also be file/directory object data on the media, perhaps just for the files that reference the chunks on this media, so that in the event of database loss it would be possible to rebuild a database by re-reading all the archive media (or at least this first part of all the media)
should there be a backup of the database, or should this be done separately?

back to arcvback.com home