Redundant data storage
The idea that came to me after reading the article Digital media archiving challenges Hollywood on Slashdot.
The
system I have in mind consists of three large data storages, each one
containing a copy of the data that needs to be preserved.
To keep an
eye out for bitrot a process will start reading each copy and compare
these. If there's bitrot in one file it would have to differ from the
other two, this part would then have to be checked for corruption of
the media (bad blocks on disk, etc.) and then be overwritten by the
correct data from one of the other files, checked once more (if it
stays corrupt, flag the block, move it to a different part of the
media) and the process continues.
Depending on the chance of media
failure this process would run every 6 months to 2 years, and the more
copies there are the more reliable the data.
This differs from
things like RAID sets and .par files in that one whole file can be lost
or corrupted but the original data can still be retrieved reliably and
automatically.
If only two copies are available it's not possible to tell which is the correct data without manual confirmation.
To put it in a shell script would be something like;
check diff between file01 and file02 -> diff01
check diff between file01 and file03 -> diff02
no differences, continue on next file
if differences detected check diff between file02 and file03 -> diff03
if
diff01 and diff02 true, and diff03 is false, then it would mean file01
is corrupt and an rsync of file02 or file03 to file01 is neccessary.
Same goes for diff02 and diff03 true, diff01 false, file03 would be corrupt, etc.
after the rsync the script continues to the next file
As an extra could be routines for missing files.;
file 01 missing, diff between file02 and file03 is false, copy file02 or file03 to file01
if only 1 file available, assume it is new and copy it to the other locations
Deleting
files would mean having to delete all 3 copies at the same time,
otherwise the script could start copying the files if it's checking
those at that time.
Modifying files would also mean having to modify all 3 copies at once, otherwise all changes in 1 file would be reversed.
The
consistency check could be done by a script, the deletion, creation and
modification of files could be the task of a file system driver.
The possible solution?
The trouble with the previous idea is backup; there is none.
What I'm planning on setting up is the redundant storage with rdiff-backup.
The
first setup will be done with four disks, each having three partitions
on it. Data is then stored on three partitions on three different disks
for a total of four storage groups.
| part 1 | part 2 | part 3 |
| disk 1 | data 1a | data 2b | data 3c |
| disk 2 | data 1b | data 2c | data 4a |
| disk 3 | data 1c | data 3a | data 4b |
| disk 4 | data 2a | data 3b | data 4c |
Data 1a is the source partition, this one will be used for the client.
Data
1b and 1c will be copies made by rdiff-backup. For extra redundancy the
rdiff-backup directory holding the backup data will be copied (with
rsync?) to the two other data partitions in the storage group.
The directory structure on each partition in a group would look like this;
/data -> containing the data itself
/rdiff_1 -> containing the backup data from the first backup partition
/rdiff_2 -> containing the backup data from the second backup partition
This way it should be possible to restore data from at least one other rdiff copy, even if the rdiff backup data got corrupt.