|
An Antidote to Bloated Storage
Just a few years ago, disk-to-disk backup seemed almost too good to be true. Powered by inexpensive ATA (and later SATA) disk drives, D2D, whether implemented as virtual tape libraries or as a backup-to-disk option in your favorite backup application, made backups faster, eliminated mechanical failures in tape drives and libraries, and made it easier to deal with the continuous chorus of calls to the helpdesk for individual file restores.
Today, our disk-backup devices are filling up, and there's not enough space or power in the data center to add another petabyte of backup space, so we're keeping only two to three days' worth of backups on disk, when we'd like to keep a month's worth. Problem is, there's too much duplicate data in our backup sets. The good news is, vendors--smelling money, of course--are promising that their new data de-duplication products can provide 20-to-1, even 300-to-1 reductions in the amount of data we need to store. Can it be? Let's take a look.
De-duplication technology lets you store more backup data on a given set of disks. This can extend the period you keep disk backups and reduce your data center power and cooling costs. If you de-dupe data before sending it across the WAN, you can save on bandwidth, making online off-site backups practical at companies that used to rely on tape. The only drawback to data de-duplication is that it can slow down the backup process.
Point Of Origin
Duplicate data makes its way into backups across the temporal realm over time, as your backup program backs up the same file from the same directory multiple times, or as the same files are backed up from multiple locations in your network. Most networks have a surprising amount of duplicate data, from the holiday party invitation PDF 56 users saved to their home directories to the 3 GB of Windows files on the system drive of every server.
One solution to file duplication in the temporal realm is incremental backup. Although we're big fans of this, especially the incremental-forever approach used by Tivoli Storage Manager and others, we don't consider incremental backups to be data de-duplication any more than we consider RAID disaster recovery. Incremental backups fall in the realm of duplicate avoidance.
The most basic form of data de-duplication is the file-level single-instance store found in CAS (content-addressable storage) devices, such as EMC's Centera. As each file is stored on a CAS system, the device generates a hash of the file's contents; should a file with the same hash already exist, rather than saving another copy, the system just creates another pointer to the copy it already has.
Microsoft's latest version of Windows Storage Server, the OEM NAS (network-attached storage) version of Windows server, uses a slightly different approach to eliminating duplicate files. Rather than identify duplicates as they're written, WSS runs a background process, the SIS (single-instance storage) Groveler, which identifies duplicate files using a partial file hash function followed by a full binary comparison, moves the file to a common storage area and replaces the files in their original locations with links to the file in the common store.
Although file-level SIS can save some space, things get really interesting if we eliminate not only duplicate files but also storing data duplicated within the file. Think of Outlook's lowly .PST file. A typical user may have a 300-MB or larger .PST holding all his e-mail from time immemorial; every day he receives one or more new messages, and since his .PST file is changed that day, your backup program includes it in the incremental backup even though there are only 25 KB of changes in the 300-MB file.
A de-duping product that could identify that 25 KB of new data and store it without the rest of the baggage could save lots of disk space. Extend that concept so that duplicate data, such as the 550-KB attachment that's in 20 users' .PST files, can be eliminated, and you could achieve staggering data-reduction factors. One group of such solutions are the data de-duping backup targets pioneered by Data Domain. These devices look to a backup application like a VTL (virtual tape library) or NAS device. They take their data from the backup app and do their de-duplication magic on it transparently.
Modus Operandi
Vendors have taken three basic approaches to the data de-duplication process. The hash-based approach, used by Data Domain, FalconStor Software in its VTL software and Quantum in its new DXi-series appliances, breaks the data stream from the backup app into blocks and generates a hash for each block, using SHA-1, MD-5 or a similar algorithm. If the hash for a new block matches a hash that's in the device's hash index, the data has already been backed up, and the device just updates its tables to say the data exists in the new location too.
The hash-based approach has a built-in scalability issue. To quickly tell if a given block of data has been backed up, it should hold the hash index in memory. As the number of backed-up blocks grows, so does the index. Once the index grows beyond the device's ability to hold it in memory, performance falls off, as disk searches are much slower than memory searches. As a result, most hash-based systems are self-contained appliances balancing the amount of memory with the amount of disk space for storing data so the hash table never grows too big.
The second approach, content-aware de-duplication, relies on the backup appliance being aware of the data format it's recording. It can use the file-system metadata embedded in the backup data to identify files; it then does byte-by-byte comparisons with other versions in its data repository to create a delta file of the changes in this version compared with the first version stored. This approach avoids the possibility of a hash collision (see "Don't Fear Collisions," below), but requires the use of a supported backup app so the device can extract metadata.
ExaGrid Systems' InfiniteFiler is an example of a content-aware de-duplication device that uses its knowledge of the common backup apps like CommVault Galaxy and Symantec Backup Exec to identify files from the source system as they're backed up. After the backup is completed, it identifies files that have been backed up multiple times and generates deltas. Multiple InfiniteFilers can be combined into a grid supporting up to 30 TB of backup data. The de-duping approach ExaGrid uses does a good job of storing the one new message in a 1-GB .PST file but it can't eliminate duplicate data across multiple different files, like the same attachment in four .PSTs.
Sepaton's DeltaStor for its VTLs also uses the content-aware approach, but compares the new file with both previous versions from the same location and with versions backed up from other locations so it can eliminate geographical duplicates.
The third approach, used by Diligent Technologies in its ProtecTier VTL, divides data into blocks like the hash-based products but uses a proprietary algorithm to determine if given blocks are similar to one another. It then does a byte-by-byte compare of the data in similar blocks to determine if the block has been backed up.
Hardware Or Software
In addition to their de-duping approach, backup targets differ in their physical architectures. Data Domain, ExaGrid and Quantum make monolithic appliances that contain their disk arrays. The Data Domain and Quantum appliances can have NAS or VTL interfaces, while ExaGrid is always a NAS. Diligent and FalconStor sell their products as software, running on an Intel or Opteron server, to create a VTL gateway to external storage.
Although a backup appliance with a VTL interface may seem more sophisticated and could be easier to integrate into an existing tape-based backup environment, using a NAS interface gives your backup application more control over virtual media management. When a backup file reaches the end of its retention period, some backup apps, including Symantec's NetBackup, can delete the file from their disk repository. When a de-duping NAS appliance sees the deletion, it can re-allocate its free space and hash index. Since you don't delete tapes, there's no way to release space on a VTL until the virtual tape is overwritten.
Of course, there is a price to pay for fitting 25 TB of data in a 1-TB bag, and not just in dollars. All the work of slicing your data into chunks and indexing it to remove the duplicates does slow things down more than just a little. A midrange VTL like an Overland REO 9000 can back up data at 300 MBps or better. Diligent has been able to achieve 200-MBps backup rates on its ProtecTier in third-party benchmarks, but that required a quad Opteron server front-ending an array of more than 100 disk drives.
Other vendors address the problem by de-duping the data as a separate process that runs after the backup. On a system running FalconStor's VTL software, data is written from the backup app to a compressed but not de-duped virtual tape file. Then a background process chunks the data, removes the duplicates and creates a virtual virtual tape that is an index of which de-duped data blocks were on the original virtual tape. Once the data from a virtual tape is de-duped, the space it occupied is returned to the available space pool. Sepaton's DeltaStor and ExaGrid also perform their de-duping as a post-backup process.
Although post-processing can boost backup speeds, it has its own costs. A system that does post-process de-duping must have enough disk space to hold a full set of standard backups in addition to its de-duped data. If you're looking to keep to a weekly full/daily incremental backup schedule, you may need a couple times more disk space on a system that de-dupes in the background to hold those full backups until it can digest them.
Just because the de-duping is running in the background, don't ignore de-duping performance. If your VTL hasn't finished digesting the weekend's backups by the time you start backing up your servers again on Monday night, you may not be happy with the results. Disk space may not be available or the de-duping process may slow down your backups.
Bandwidth Conservation
Saving disk space on a backup appliance isn't the only application of subfile de-duping technology. A new generation of backup applications, including Asigra's Televaulting, EMC's Avamar Axion and Symantec's NetBackup PureDisk, use hash-based data de-duplication to reduce the bandwidth needed to send backups across a WAN.
First, like any conventional backup application making an incremental backup, these use the usual methods like archive bits, last-modified dates and the file system change journal to ID the files that have changed since the last backup. They then slice, dice and julienne the file into smaller blocks and calculate hashes for each block.
The hashes are then compared with a local cache of the hashes of blocks that have been backed up at the local site. The hashes that don't appear in the local cache and file system metadata are then sent to the central backup server, which compares the data with its hash tables. The backup server sends back a list of the hashes that it hasn't seen before; the server being backed up then sends the data blocks represented by those hashes to the central server for safekeeping.
These backup solutions could reach even higher data-reduction levels than the backup targets by de-duplicating not just the data from the set of servers that are backed up to a single target or even a cluster of targets but across the entire enterprise. If the CEO sends a 100-MB PowerPoint presentation to all 500 branch offices, it will be backed up from the one whose backup schedule runs first. All the others will just send hashes to the home office and be told, "We already got that, thanks."
This approach is also less susceptible to the scalability issues that affect hash-based systems. Since each remote server only caches the hashes for its local data, that hash table shouldn't outgrow available space, and since the disk I/O system at the central site is much faster than the WAN feeding the backups, even searching a huge hash index on disk is much faster than sending the data.
Although Televaulting, Avamar Axion and NetBackup PureDisk all share a similar architecture and are priced based on the size of the de-duplicated data store, there are some differences. NetBackup PureDisk uses a fixed 128-KB block size, whereas Televaulting and Avamar Axion use variable block sizes, which should result in greater de-duplication. PureDisk can be managed from NetBackup, and Symantec promises greater integration in the future, which we hope means de-duplication integrated into data center backup jobs. Asigra also markets Televaulting for service providers so small businesses that don't want to set up their own infrastructure can take advantage of de-duplication too.
Backup targets, including FalconStor's VTL, Quandum's DXi series and Data Domain's appliances that can replicate data after it has been de-duped, can see the same kind of bandwidth reductions for branch data center off-site backups and disaster recovery of applications that don't require real-time replication.
Data de-duplication is here to stay for at least a while. We spoke to several users who report they really do get 20-to-1 and greater data-reduction factors without making major changes to their backup processes. Small organizations can use the new-generation backup programs from Asigra, EMC and Symantec to replace their conventional backup solutions. Midsize organizations can use backup targets in the data center. Large enterprises with very high backup performance needs may have to wait for the next generation.
Don't Fear Collisions
We've heard several comments from users afraid to use hash-based de-duping because there's a possibility of a hash collision--two sets of data generating the same hash--and, therefore, data corruption. Although there's some risk of data corruption through a hash collision, it's much smaller than the risks storage admins live with every day.
De-duping setups typically use MD-5 (a 128-bit hash) or SHA-1 (a 160-bit hash). The probability of two random blocks of data generating identical MD-5 hashes is approximately 1 in 1037. If a petabyte of data were de-duped using MD-5 with an average block size of 4 KB, the probability would be about 1 in 1020 (or 1 in a hundred billion billion) against two blocks having the same hash.
By comparison, the probability that both drives of a mirrored set of drives with an MTBF of 1 million hours will fail within 1 hour of each other is 1 in 1012--over 1 billion times more likely than a hash collision. Data sent across Ethernet or Fibre Channel is protected by a CRC-32 checksum, which has a probability of undetected data errors of approximately 1 in 4x109 (or 1 in 4 billion).
It's also important to remember that a hash collision, however unlikely, doesn't mean a total loss of data. If a de-duping system incorrectly identifies two data blocks as containing the same data, when they don't, the system will continue operating. When the data is restored, the one file whose data was misidentified will, however, be corrupted. All the other data would be restored correctly. We put hash collisions on our list of worries somewhere below asteroid strikes and the mega- volcano at Yellowstone erupting.
The larger risk inherent in data de-duplication is catastrophic data loss from a hardware failure. Since the data from any given backup job--and, in fact, any given large file--is broken up into blocks and spread across the whole de-duping data store, it doesn't matter how many times you backed up that server, if you lose a RAID set in the de-duping device, you'll lose lots of data. This makes enhanced data-protection features, such as battery backup cache and RAID 6, even more important for de-duping targets than for primary storage applications.
Howard Marks an NWC contributing editor, is chief scientist at Networks Are Our Lives, a consultancy in Hoboken, N.J. Write to him at hmarks@nwc.com.
|