Redundant data storage



The idea that came to me after reading the article Digital media archiving challenges Hollywood on Slashdot.

The system I have in mind consists of three large data storages, each one containing a copy of the data that needs to be preserved.
To keep an eye out for bitrot a process will start reading each copy and compare these. If there's bitrot in one file it would have to differ from the other two, this part would then have to be checked for corruption of the media (bad blocks on disk, etc.) and then be overwritten by the correct data from one of the other files, checked once more (if it stays corrupt, flag the block, move it to a different part of the media) and the process continues.
Depending on the chance of media failure this process would run every 6 months to 2 years, and the more copies there are the more reliable the data.
This differs from things like RAID sets and .par files in that one whole file can be lost or corrupted but the original data can still be retrieved reliably and automatically.
If only two copies are available it's not possible to tell which is the correct data without manual confirmation.

To put it in a shell script would be something like;

check diff between file01 and file02 -> diff01
check diff between file01 and file03 -> diff02
no differences, continue on next file
if differences detected check diff between file02 and file03 -> diff03
if diff01 and diff02 true, and diff03 is false, then it would mean file01 is corrupt and an rsync of file02 or file03 to file01 is neccessary.
Same goes for diff02 and diff03 true, diff01 false, file03 would be corrupt, etc.
after the rsync the script continues to the next file

As an extra could be routines for missing files.;

file 01 missing, diff between file02 and file03 is false, copy file02 or file03 to file01
if only 1 file available, assume it is new and copy it to the other locations

Deleting files would mean having to delete all 3 copies at the same time, otherwise the script could start copying the files if it's checking those at that time.
Modifying files would also mean having to modify all 3 copies at once, otherwise all changes in 1 file would be reversed.

The consistency check could be done by a script, the deletion, creation and modification of files could be the task of a file system driver.


The possible solution?

The trouble with the previous idea is backup; there is none.
What I'm planning on setting up is the redundant storage with rdiff-backup.
The first setup will be done with four disks, each having three partitions on it. Data is then stored on three partitions on three different disks for a total of four storage groups.

part 1part 2part 3
disk 1data 1adata 2bdata 3c
disk 2data 1bdata 2cdata 4a
disk 3data 1cdata 3adata 4b
disk 4data 2adata 3bdata 4c

Data 1a is the source partition, this one will be used for the client.
Data 1b and 1c will be copies made by rdiff-backup. For extra redundancy the rdiff-backup directory holding the backup data will be copied (with rsync?) to the two other data partitions in the storage group.

The directory structure on each partition in a group would look like this;

/data     -> containing the data itself
/rdiff_1  -> containing the backup data from the first backup partition
/rdiff_2  -> containing the backup data from the second backup partition

This way it should be possible to restore data from at least one other rdiff copy, even if the rdiff backup data got corrupt.