|
If time is money, most data admins are overdrawn. Users want the sales conference PowerPoints they made four years ago, while legal says to dispose of business documents ASAP. New regulations crop up constantly, requiring fine-tuning of data-retention policies. To make life even more hectic, if an order hits your desk asking for all electronic business documents--e-mail and IM, spreadsheets, Word and Excel files--pertaining to a lawsuit, the Federal Rules of Civil Procedure that took effect late last year give you just 120 days to comply. Forget sending the lawyers to say, "Too expensive! No tools!" The onus is on you to prove to a judge you can't pull it together.
With so little time, will you be able to recall all relevant backup tapes from the warehouse, restore them to new servers, extract the data that might be germane and have your legal eagles review everything?
We didn't think so. This, of course, has vendors with any play in data management licking their chops in anticipation of the pain new e-discovery rules will cause IT. Last year, the records-management market stood at $280 million, according to Forrester. By next year it's projected to grow nearly 500 percent, to a whopping $1.3 billion.
Many of these dollars will likely go to ILM (information lifecycle management) applications. In a nutshell, ILM is IT's answer to the old adage, "a place for everything, and everything in its place." It mandates storing data in locations commensurate with its value while recognizing that the value of any given data item changes over time, and that different access methods may be appropriate for data items at different points in their lifecycles.
Hard Row To Hoe
We won't sugarcoat it. There are currently no unified ILM systems. You can get some of the way there by pulling together e-mail archiving tools and their file-management and database counterparts, and by developing a comprehensive policy that defines the business value of your data--structured, e-mail and files--so you can manage it in a way that's commensurate with its current worth. But to hit an ILM bull's-eye requires technologies that just aren't here yet, like sophisticated data-classification engines.
Can you just sit tight? Unless you have minimal data-storage needs and are in a relatively unregulated industry, probably not. Sure, disk is still cheap, but retention rules combined with thousand-fold increases in file sizes--witness the 2-KB WordPerfect letter of 10 years ago versus the 2-MB Microsoft Word file of today--have pushed all but the smallest companies to the breaking point.
It doesn't need to be that way. The journey to ILM isn't easy, but it is proving worthwhile: In our reader poll for this article, three out of four respondents with production ILM initiatives enjoyed easier management of primary systems and were spending less on high-end disks. See nwcanalytics.com in March for complete poll results and an in-depth ILM market analysis.
Finally, Movement
If you're drowning now, point solutions like e-mail archiving applications can give you a little breathing room while ensuring you can meet regulatory requirements. Over the next two to three years, a new generation of file-management systems, including classification and migration services, will emerge from vendors such as Acopia Networks, Brocade Communications Systems, NeoPath Networks, Njini, and EMC as it integrates Infoscape with Rainfinity.
Ideally, these vendors will realize the bulk of e-mail archive space is taken up by attachments that also exist in file systems and give IT a way to integrate these archives. In "Your Money or Your Data," we tested products that claim to classify unstructured file data using detailed, flexible criteria and migrate files or provide an interface to a separate data-migration engine. See page 46 for a summary; the complete product review is at nwcreports.com.
Finally, managing structured data will always be dependent not only on the database server environment, but on the application's database schemas and utilizations. As a result, point products that are application-aware will work substantially better than any integrated solution. Princeton Softech's Optim and Solix Technologies' ArchiveJinni, for example, have modules and policies for apps like PeopleSoft and Oracle financials.
Pony Up
Getting an ILM project off the ground takes a significant commitment in both time and money. From a labor perspective, ILM is, first and foremost, a policy issue. Before ILM tools can automate the process of finding data and moving it to appropriate storage, correspondingly appropriate retention policies must be set. The file-classification software we reviewed can help here, but because they're limited to keywords and metadata, they won't automatically make the contextual distinctions we hope tomorrow's offerings will.
Policy, like politics, is local, but one universal truth is that determining which data qualifies as a record means close collaboration among IT, legal and business stakeholders. We discuss policy, including e-discovery and unified message archiving, in greater depth in "Don't Get Burned," at nwc.com/showArticle.jhtml?articleID=193003540.
Getting an ILM project rolling is a financial investment as well. File-classification software could easily set an average enterprise back $50,000 to $100,000. Managing 12 TB of data with our Editor's Choice would run $80,000. E-mail archiving costs $10 to $50 per mailbox. But there are offsetting savings. You'll be using less--and less expensive--storage because you're keeping fewer copies of low-value data. Productivity gains can be realized by reducing backup and restore windows and speeding up e-mail servers and databases by removing inactive data. Then there's being able to satisfy e-discovery requests in days without recalling tapes and putting a pair of admins on tape monkey duty for a month.
Mail On The Fore
Corporate America has come closest to achieving the ILM dream through e-mail archiving products, like EMC's EmailXtender, Symantec's Enterprise Vault and Zantaz's EAS, that migrate e-mail messages out of the primary data store based on age. Messages are placed in a secondary data store, where they're semitransparently available to users, then deleted as dictated by the organization's data-retention policy.
While we now think of e-mail archiving primarily as a tool for ensuring compliance with data-retention regulations and providing the ability to index e-mail messages and allow searches across multiple mailboxes for e-discovery, the systems were originally marketed as tools to make the e-mail administrator's life easier. Because restoring even a single message to an Exchange server long required restoring the entire information store or making incredibly slow mailbox-by-mailbox backups, admins had a strong motivation to limit the size of their information stores. But imposing user mailbox quotas led to the proliferation of user .PST files, inappropriate deletion of messages--and an extremely inadvisable shift of data management to individual users, who are apt to keep amusing off-color jokes but delete communications that might qualify as business records.
Most archiving software requires an Outlook or Notes client plug-in that displays a "message migrated" icon for users and automatically retrieves messages and attachments from the archive. Users with Macs and Linux machines may not have full functionality.
Ideally, ILM vendors will integrate their e-mail and file-management tools. Because many users create documents on file servers and then send the files to co-workers as attachments, the same file exists in both the file system and mail server data store. Using a collision-resistant hashing algorithm, like SHA-2, an integrated file/e-mail ILM system could identify these redundancies and keep only a single copy, saving disk space.
We expect vendors like EMC that have both file and e-mail management products to come to the plate soon with integrated suites. Right now, Scentric's Destiny provides some policy-based data extraction for Exchange and file data, but it lacks message logging and the ability to redirect messages to the archive, both capabilities found in full featured e-mail archiving applications.
Database Dilemma
With a database migration engine and a little work, your storage admins and DBAs can get production, test, development and spare databases allocated to appropriate storage tiers. However, when it comes to migrating data as it ages and changes value, files and e-mail messages have the advantage in that they always have a time stamp in the same place.
In contrast, tables and rows in an Oracle or SQL Server database may be segregated by time or have individual time stamps, and each application will organize its data differently. Thus, classifying structured data requires a more intimate association between the database schema created by the application and the classification engine.
As a result, database ILM solutions, including EMC's DatabaseXtender, Hewlett-Packard's Reference Information Manager for Databases and Princeton's Optim, come in editions for common applications, like SAP, to simplify the process. Solix similarly has application definitions that it provides to customers of its ArchiveJinni database-archiving software.
In addition to migrating data as it ages through its lifecycle, most vendors have a module as part of their database ILM suites to generate smaller copies of working databases for development or testing purposes. These minidatabases can contain a full set of triggers and stored procedures, with a representative set of data that can be one-tenth the size of the primary database, enabling programmers to test their code without consuming several terabytes of disk space.
Oracle's ILM assistant, available for free download from the company's site, lets a DBA easily define data lifecycles and assign database tables to the lifecycle. It will then, using Oracle's table partitions, move data from a partition in one tablespace to a partition in another tablespace that's stored on a lower-cost tier. Because Oracle partitions are transparent to user applications, users will be blissfully unaware.
What To Do Now
Once you put into place a management policy and get e-mail and databases squared away, it's time for the lifecycle piece of the equation. That means building a three-legged stool: tiered storage; data classification, where through a combination of business processes and an automated classification engine you evaluate each set of data; and a migration engine that moves data to a location commensurate with its current value.
Mention tiered storage to most IT folks and they think of high-performance Fibre Channel drives for their valuable data and low-cost SATA drives for data that's determined to be of lesser value. That's a start, but storing data in a manner consistent with its business value isn't just about expensive and less-expensive storage on a dollars-per-terabyte basis. Think of storage tiers as providing different SLAs (service-level agreements) rather than just varying costs: The primary storage tier is optimized for performance, with frequent backups to reduce the RPO (recovery point objective), and it's kept small to minimize restore times.
We also must define storage tiers with security in mind--for example, servers accessible only from one side of the Chinese Wall between investment banking and brokerage operations may keep all sensitive data encrypted and have extensive access auditing and controls.
When analyzing savings derived from migrating data between storage tiers, include not just the raw cost-per-gigabyte of disk array acquisition but the fully loaded cost of storing data, including snapshots and DR (disaster recovery) replicas. Take a typical large enterprise: Critical applications are feeding data to a monolithic disk array, say, an EMC Symmetrix or Hitachi TagmaStore. The arrays are configured to take hourly split-mirror snapshots and replicate to one or more DR sites, where snapshots are taken again. These organizations may keep six or more copies of the application's data on its most expensive Tier 1 storage.
If you could identify data that's reached the point in its lifecycle where it's essentially static, and migrate it to a storage environment where only two copies are kept online--one at the primary data center and one at the DR site--you could save a huge amount of disk space. Take advantage of the fact that older, less frequently accessed data can be placed on slower, less expensive storage arrays using RAID 6 and high-capacity SATA drives instead of mirrored Fibre Channel drives, and the savings can be even more significant.
Also bear in mind that as data moves through its lifecycle, the ratio of reads to writes increases substantially, so RAID 5 or RAID 6, with its greater storage capacity and lower write performance compared with mirrored arrays, becomes more attractive.
But tiered storage alone does not ILM make. Even the most intelligent storage array just sees your data as bits stored in blocks. The most you can manage at block level is the kind of automated data migration that 3Par and Compellent do--moving data blocks from high-cost, high-performance drives to lower-cost drives based on block-access frequency. Although this can have some impact on the raw cost of storing your data, it doesn't have any real effect on backup and restore times because these functions are performed on a volume, file or database level.
What's It Worth To You?
The second half of the equation is knowing the business value of any piece of data. The majority of information reaches the end of its useful life after a finite period. Deleting data as soon as possible eliminates the possibility of exposure and minimizes expensive searches.
However, some files--archives of advertising materials and annual reports, for example--have value for an extended period of time beyond normal business processes. Data in a permanent archive need not--and often should not--be easily accessible by users. Leave a metadata bread-crumb trail so an archivist can determine what needs to be retrieved from offline storage.
Determining retention periods is relatively easy for structured data. Your DBAs should know what each database is for and how it impacts your business.
Unstructured data is much messier. For e-mail, we have sender, recipient, date and content. For other files we have several possible sources from which to derive value. The most obvious is file system metadata in your servers or NAS devices. Modern file systems, from Linux's ext3 to Network Appliance's WAFL, store file-creation, last-modified and last-accessed dates along with file attributes like "hidden," "read only" or "this file is stored offline."
Each file also has security information attached, including an access control list and, for most systems, a file owner. Conventional HSM (hierarchical storage management) solutions use the "file last accessed" date as their only indication of value, migrating files after they have not been accessed for a given period.
As we discuss in "A Fine Metamess," page 48, there are problems with using file system metadata to classify files. Even so, some data-management systems, including Arkivio's Auto-stor and Kazeon Systems' appliances, rely on file system metadata as their only information source.
Fortunately, most files have additional embedded metadata that give file-classification applications more to work with; for example, a JPEG could have an embedded date years before the file-system creation date. Unfortunately, most users don't go to the effort of entering even basic metadata, like subject, into their files.
Classification products, including Abrevity's FileData Classifier and Manager, Scentric's Destiny and EMC's Infoscape, create an index of file content and allow IT to create policies based on key words or phrases. Better than depending on users, but still a pretty crude indication of a file's value or sensitivity.
StoredIQ's ICM 5000 appliance takes it a big step further by recognizing text strings in unstructured data as meaningful object types, like names, places, dates and Social Security numbers, enabling policies like, "All files with Social Security numbers should be stored in the encrypted data store."
Eventually, we'll see a classification engine smart enough to recognize that a word-processing document with a date at the top, followed by a name, followed within six lines by another name, and the name at the bottom matching the name at the top, is a business letter. It could then do a database search to see if the intended recipient is a customer; if so, the engine would know this is a business letter subject to SEC Rule 17a-4 or other regulatory retention requirements. StoredIQ comes closest now, but no one has the LDAP linkup piece.
Ideally, your ILM classification engine would also know how frequently a file is accessed. There's a big difference between knowing that a file was last accessed on Friday and knowing it was accessed 104 times over the past 30 days, the last time on Friday.
Unfortunately, again, we're not there yet on common NAS and file server systems. A classification vendor could create a file system filter or similar agent that used a NetApp filer or EMC Celerra's antivirus scanning API to keep track of file-access frequencies, but this would require an agent be installed on every managed server.
As a workaround, this data also could be collected from an in-band NAS virtualization appliance like those from NeoPath Networks or Attune Systems. Because they see all file access requests, they can track access frequency and use that data as a trigger to migrate data to another location. Njini's IAM sits in-band looking at the CIFS data stream to collect stats, including who the file creator is, to classify files and place them in appropriate locations.
Go South, Young Data
Finally, data's storage location must be in line with its value. And, we must decide whether the data should still be available from its original location: When we create a data-migration policy that says, "Move all Word documents not accessed in the past 90 days from users' home directories to the intermediate archive file share," how will you manage user access?
The simplest case would be to just move the files, deleting them from the home directory. While this is easy for IT--and for ILM vendors--it will seriously annoy users. We must move data yet still allow owners to access it from the original location. An in-band NAS virtualization appliance such as those from Neopath and Acopia Networks could redirect user requests for migrated files to the new location and be truly transparent, even showing the actual file size in the user's directory.
We also could leave a pointer file in the original location that causes the user's computer to load the file from the new server. But links and pointers may be treated differently by different OSs. Even if you have all Windows workstations, users who open migrated files through a link will save their changes to the migrated location, which would make that location less than static and interfere with archiving files to preserve state and version.
The migration engine should also integrate into other pieces of the storage-management toolkit, updating the enterprise search index when it moves files and, ideally, integrating with the backup application to update its catalogs with the new file locations.
Clearly, this is a lot to pull together. But the alternative is data anarchy and seeing storage costs continue to be the beast that ate the budget. Having a policy in place is crucial to show regulators that your company is operating in good faith. And even small steps, like implementing e-mail archiving and planning future storage hardware purchases with tiers in mind, will put you ahead of the game.
Not Their First Bite at the Apple
EMC, Commvault and others tried to address data lifecycle issues in the late 1980s and early '90s with the HSM (hierarchical storage management) technology used in the mainframe world. Several pitched three-tier HSM for Windows and Novell NetWare file servers that migrated files from standard hard drives to an optical disk jukebox, and from the jukebox to a tape library, based on last-modified and last-accessed date attributes. Migrated files were replaced with stubs and recalled from near-line storage when a user or application accessed them.
HSM seemed like such a good idea that Microsoft even cooked it into Windows 2000, calling it Remote Storage.
But while HSM was relatively successful in the batch-processing mainframe environment, it went over like a lead balloon in the more interactive distributed systems arena. There's no shortage of reasons HSM didn't take off--many issues remain today as sticking points for ILM.
The biggest nail in HSM's coffin? Hard-drive capacity increased and costs decreased so quickly that spooling files, even onto a tape library, didn't save enough money to make the exercise worthwhile. Organizations also discovered that age alone wasn't enough information on which to classify data. They'd set up an 80-day migration policy, then have users complaining they couldn't open the end-of-quarter spreadsheet for last quarter because they weren't patient enough to wait for the system to restore the file from tape. Still an issue.
The stub file and retrieval mechanism is also broken. If an employee tries to use Windows search, or, heaven forbid, Google Desktop, to find the letter he sent to a popular customer by searching for the customer's name in document files, HSM systems recall all the files, creating a significant load on the server, or cause the search to fail by not recalling files in a timely manner, ruining the user's experience and productivity.
Still, some HSM offerings, like CommVault's DataMigrator, EMC's Disk Extender and Symantec's NetBackup Storage Migrator, are available. They're frequently used as migration engines by data-management systems that have their own data-classification processes. CommVault is also adding data-classification options to migrate files based on content as well as age.
A Fine Metamess
Why Is Metadata so inadequate for defining the value of files?
First and foremost, metadata cannot specify data that may sit idle but then be important again. Think of an expense report: It may decline in value after a check is cut, but the accounting department is going to need month-, quarter- and year-end spreadsheets for reference when closing the next corresponding period.
Even worse, much of the file-system metadata in corporate America is just wrong. Most organizations have now had file servers of one type or another for about 20 years, during which time administrators have migrated data as servers have been upgraded or reorganized. Unfortunately, we've sometimes just used Windows Explorer to drag and drop folders from one server to another, updating the last-accessed date and making the administrator account the owner of all those files. Go on, admit it.
In addition, antivirus, anti-spyware and backup applications may fail to properly maintain the last-accessed-date attribute. Some Citrix and terminal server gurus even recommend disabling NTFS last-accessed time stamping to improve performance. So, before you blame users for shoddy metadata, look inward.
Review Scenario
We asked for products capable of classifying a set of unstructured data using flexible criteria, including age, file name and frequency of access, and either migrate files itself or provide an interface to a data-migration engine.
PARTICIPATING VENDORS
Abrevity, Arkivio, Scentric
TESTING SCENARIO
We classified three datasets from production file servers with 25,000, 200,000 and 1 million files, respectively, then attempted to identify data to be migrated to alternative locations to support an ILM (information lifecycle management) initiative with tiered storage. To grade pricing, we asked for quotes based on 2 million file/ 1-TB and 25 million file/12-TB scenarios.
SCORING CRITERIA
Types of data and classifications: 25%
Data movement and other features: 20%
Price: 20%
Management/ease of use: 15%
Reporting: 10%
Scalability/performance: 10%
RESULTS
Scentric's Destiny is our Editor's Choice. Destiny is strong on ease of use, data-migration capabilities and types of classifications supported. However, its price may put it out of reach--$50,000 to get in the game. Arkivio's Auto-stor posted consistently decent scores across all categories and led the pack in reporting, an important consideration for products that affect groups outside IT, such as legal. CAS (content-addressable storage) support is a plus, but we wanted the ability to manage data based on file name and content, not just file system metadata. Abrevity FileData Classifier was the lowest-priced entry, but it lagged behind rivals in most categories. Moreover, the fact that it runs as an application means you have to run Classifier from a workstation that's logged into your network with full data access, a potential security nightmare.
Find our complete product evaluation and report card at nwcreports.com. Go to nwcanalytics.com for our original in-depth research and analysis of the ILM market.
To test file-classification products we used the production file shares from an organization with more than 200 users. We created a small subset of the data, occupying 6 GB with approximately 10,000 files and built a larger set of approximately 150,000 files totaling 100 GB. We noted the how long each application took to create its index for each set.
We set up a Windows 2003 server for each product under test to use as its data source, then assigned each application a volume on a Dell PowerVault 745 NAS as its migration target.
Each application was installed in turn on an IBM X345 server (2- x 2.4-GHz Xeon, 2 GB of memory) and asked to migrate data according to the following policies, using the larger dataset of 150,000 files.
1. Move all .PPT files with last-accessed dates before 1/1/2004 (372 files, 450 MB).
2. Copy all .PST files (four files, 100 MB).
3. Move all .XLS files with creation dates before 1/1/2004 semi-transparently using links (6,324 files, 600 MB).
Finally we reviewed the results and generated reports describing the changes. We ensured that the files we wanted moved were where they were supposed to be, that the links worked, that no files other than the ones we wanted to move were moved. Performance didn't vary enough to be a point of differentiation.
All Network Computing product reviews are conducted by current or former IT professionals in our Real-World Labs® or partner labs, according to our own test criteria. Vendor involvement is limited to assistance in configuration and troubleshooting. Network Computing schedules reviews based solely on our editorial judgment of reader needs, and we conduct tests and publish results without vendor influence.
It's no secret corporate america is drowning in data, but what makes the problem especially intractable is that much of it is user-managed, tucked willy-nilly into file systems on servers and NAS devices distributed across the enterprise. Unlike its view into more tightly managed databases and e-mail stores, IT typically has little visibility into this unstructured data. In a typical organization you'll find documents from clients and vendors that could be important evidence in a dispute, intermixed with 15-year-old expense reports from employees who've left the company. Clearly, tools are needed to help classify this text-based data and sort it into appropriate storage silos.
What's that you say? You also have rich media, like audio, video and image files, that defy simple indexing, yet still need to be managed? Unfortunately, this is currently possible only if your applications were designed so file and folder names carry data. Otherwise, you'll need to build a metadata database manually; we found no classification or ILM product with image-file indexing support.
When we started discussing file classification and management with users, some wondered whether host-centric SRM (storage resource management) suites like those we reviewed in our June 2, 2006 issue couldn't fill the bill (see "A Cure for the Terrible Terabytes"). And indeed, file classification and SRM solutions do have a lot in common: Both scan file systems, building indices of file data. EMC even used the file crawler from its Visual SRM as a key component of its InfoScape classification tool. But SRM as a discipline is much more concerned with reporting and analytics than with the dirty work of touching and moving files. We were more interested in tools that classified files and let us move them around for ILM-type management.
While enterprise indexing and search vendors tried hard to convince us that their products are ILM data-classification tools, we limited the scope of this review to products that could classify unstructured file data using more detailed and/or flexible criteria than an HSM solution's simple age criteria, and migrate files itself or provide an interface to another data-migration engine. If it don't move data, it ain't ILM.
Abrevity, Arkivo and Scentric answered the call sending us FileData Classifier, Auto-stor and Destiny, respectively. EMC, Kazeon and Stored IQ declined.
Beyond The Basics
We expected all the products to be able to classify data based on file name, extension, location and file system metadata items, such as last accessed date and owner. We hoped they would also be able to use file content in some way. Abrevity's FileData Classifier and Scentric Destiny have this insight, but Arkivo's Auto-stor cannot use file content or names. In addition to file data, Destiny can even manage data in Exchange and, to a limited extent, SQL Server databases.
As for management, we found Abrevity's FileData Classifier unwieldy; even basic tasks, like moving files based on their content, requires creating and scheduling several jobs. This isn't just an problem faced at the initial setup--it's the core of what the product is supposed to do. If scheduling is a pain in the ass, it will be an ongoing ache. Again, Scentric's Destiny is the standout thanks to its ability to delegate different levels of management to different users, and a really intuitive rule generator that reminds us of the one for processing e-mail in Microsoft's Outlook.
Data On The Move
All three programs can copy and move files on command or on a schedule; they also can migrate files semi-transparently, leaving a stub or link in the original location. Destiny adds compress, delete and run-script options. We also liked that its command line let us design all sorts of creative migration schemes.
Only Arkivio's Auto-stor supports end times for policies, an oversight by Abrevity and Scentric. We could tell all the products, "Move data at 4:00 a.m." Processing would then begin at 4:00 a.m. and run until finished ... which could be noon, with users seeing files disappear right in front of their eyes.
With Auto-stor, we could say, "Move data from 4:00 a.m. until 8:00 a.m." so data movement wouldn't interfere with business. If needed, it will pick up where it left off the next night.
Scentric's Destiny and Arkivio's Auto-stor both provide for multiple data movers on separate servers; this should let either product scale to manage prodigious amounts of data as a single pool. Both are substantially faster than Abrevity's FileData Classifier as well.
Managing more than 3 TB of data will require an upgrade to Abrevity's FileData Manager, which can oversee multiple FileData Classifier stations and requires dividing data up into 3-TB or smaller slices. This could be a problem for large sites but, on the other hand, a very decentralized organization may be well served by the Abrevity architecture because one FileData Manager can control multiple slices.
Pain Points
Unfortunately, reporting was weak across the board. We found a limited number of "SRM-lite" reports about discovered files, but only Auto-stor can produce reports showing the results of a policy. This is be a serious shortcoming for compliance-heavy organizations and would require them to purchase another product for detailed reporting.
We rated pricing on the cost to manage both 1 TB and 12 TB data stores. At 1 TB, Abrevity quoted us $12,500, in line with Arkivo's $14,000. Unfortunately, at a whopping $50,000 to manage up to 3 TB of data, Scentric Destiny's cost will likely keep it from being part of many organization's futures. With a 12-TB store, Abrevity's system would run $34,000, compared with $58,000 for Arkivo and $80,000 for Scentric.
Yes, you would have to save a ton of storage to offset the cost, but these devices are also helpful in getting data out of the primary store to reduce RTO and backup windows, and for e-discovery, as we discuss in "ILM: Off the Mark."
Scentric Destiny 1.0
Scentric's Destiny is both the easiest to use and most powerful of the products we tested, and it can scale to handle all but the largest environments. It can classify data based on the full range of file system information, like last accessed date and file name, and by whether key words and/or phrases appear in the file's contents. Destiny's core server, Windows console and indexing and file movement engine can be installed on a single server, or for larger implementations, you can install the components on separate servers and assign data resources to specific indexing and movement engines.
If we could just get in the game for under $50,000, we'd be happy.
Scentric sent us a Dell PowerEdge 1900 with Destiny preinstalled. We then set out to add hosts and file systems to manage. As we added each host, Destiny prompted us for a user ID and password for the admin charged with managing that host, so we didn't need to create a single account that would have access to all our managed data. We could have done so but appreciated not being required to, as with the other products, because this could spark a turf war.
Next we created a data group for our source data, assigned as Tier 1, and a data group to be the destination of our policies, assigned to Tier 2. We could then apply a predefined policy, such as "migrate data over 45 days old," or create a new set of rules as a new policy and apply it to our data.
The rule creation screens use a format much like the rule engine in Microsoft's Outlook. To migrate old expense reports, for example, we chose a statement, "Where the file contains TEXT" and an action, "Copy file to a location type." Then, we simply clicked on the text and location-type placeholders and entered values of "Expense" and "Tier 2." We then selected the source data group we had earlier created and assigned our new policy to it, scheduling it to run at 3:00 a.m. on Sundays.
In addition to CIFS and NFS, Destiny can manage Exchange server and SQL Server data, though these servers must be in the same Active Directory domain as the Destiny core server. To test this feature we created policies to daily migrate all messages from a specific sender out of an Exchange mailbox into a .PST file and sent messages from various newsletters that are over 30 days old to the OLD_NEWS folder.
SQL Server management is, sadly, limited to running scripts. If you're looking for database ILM, look elsewhere; see "ILM: Off The Mark" for more on managing structured data.
Scentric also provides a CLI in Destiny that got our geek flag flying. It let us script common functions, like adding the home directories of employees who are leaving the company, to a data group that gets migrated to archives. Feeding a script the weekly list of the dearly departed you get from HR is a lot easier than ctrl-clicking 25 folders in the HOME share.
You can also delegate different users in the Destiny system to have different roles. Some users could apply existing policies to data items, others might create rules and add data items, while others simply have the ability to view the configuration.
Version 2.0, which was released as we finished our testing, can also send data to dedicated archive repositories, like EMC's Centera and Powervault's DVD libraries.
Destiny is the class of the field, doing just about all we can ask of products in this emerging market. Hopefully Scentric will add more sophisticated indexing and query functions to future versions to help justify the sky-high price.
Arkivio Auto-stor
Arkivio's Auto-stor manages files on CIFS and NFS file shares using the usual file system metadata fields as criteria. Auto-stor can migrate files to WORM storage for compliance purposes and supports the EMC Centera content addressable storage system, making it a good choice for those looking for a classification engine to populate a long-term archive.
We installed the Auto-stor central server on a Windows 2003 server and connected to the Web site installed in the server's IIS Web server. The Auto-stor console is implemented as a series of ASP and Active-X controls, so we used Internet Explorer to add the servers-to-be-managed list and scheduled data collections to run nightly at 1:00 a.m. Nice feature.
The next morning, we tried to set up a job to migrate all old expense report files from users' home directories to secondary storage. We created a file group that defined the folders we wanted to move data from and selected file attributes--in this case .XLS extension and a last-accessed date before Jan. 1, 2006, We then selected a volume group that included the home shares on each of the servers we wanted the policy to get files from.
Once the preliminaries were out of the way, we set out to create a migration policy, leaving links so users could still find their files even if they were moved to low-end storage. Auto-stor can create tag files that will recall a migrated file when accessed on EMC Celera and Network Appliance filers through those vendors' APIs, in addition to links; that option wasn't available in our Windows-centric test setup. In a nutshell, all three products provide links only for Windows systems. If a user edits a migrated file, it changes it on the migrated store, which then must be backed up.
To create the policy, we selected our volume list of home directories, the file group of selection criteria and a destination folder on one of our managed servers and told Auto-stor to run the policy any time the source volume got over 70 percent full, until it was down to 50 percent full.
Policies can include multiple file groups at different priorities; Auto-stor will move high-priority files while not processing lower-priority files to bring utilization below the threshold.
Then it hit us: We didn't get a chance to specify anything about the file's name or its contents. Where other offerings let us migrate only files that had "expense" in their name or contained keywords, for example, Auto-stor is limited to just file types and attributes. How disappointing.
On the plus side, Auto-stor has far and away the best reporting of any of the classification products we tested, using the Crystal Reports engine to create, and optionally e-mail, reports on policy results as well as the SRM-like file type and disk capacity reports the other products provided.
To accommodate larger environments, Auto-stor supports remote server assistants, which run on additional servers to perform data collection and data movement. Each managed server is controlled by either the central server or a remote server assistant.
We wish Arkivio would take the next step and let Auto-stor manage data based on file name and content as well as file system metadata. It would then be a significant player in the classification market. Using metadata alone is just too crude a tool to solve today's file management problems.
Abrevity FileData Classifier 2.5
Abrevity's FileData Classifier is definitely the bantamweight in our file classification shootout. Rather than using a series of background services running on one or more servers, Abrevity has chosen to implement FileData Classifier as a Windows application. FileData Classifier is also the only program to eschew the use of a relational database for its indexes, instead using a proprietary file format. This constrains its ability to do reports and limits the function to 3-TB slices.
Moreover, while FileData Classifier's $7,500 entry cost makes it attractive to the budget conscious, add-on and expansion costs add up quick. If you wanted to find files that contained Social Security numbers and use Active Directory attributes, those two add-ons would almost triple the price. Add-ons include duplicate elimination, embedded metadata, pattern matching, NFS and Active Directory integration
The product's architecture means you can get FileData Classifier up and running quickly, without having to provision a server. A consultant or central IT staffer could even install it on a laptop and carry it to multiple locations. On the downside, running as an application means that you have to run Classifier from a workstation that's logged into your network as a user, with full access to all the data you're managing, even when running data classification and movement events overnight using FileData Classifier's scheduler.
FileData Classifier is also limited to managing 3 TB of data per Classifier, which isn't much for today's enterprises. Abrevity's FileData Manager can push queries and policies down to multiple machines, so you can manage multiple slices of data for larger environments and delegate administration to different users by slice. Large enterprises will probably prefer an approach like Scentric's that spreads functions across multiple servers but lets you manage a single integrated view.
When we told FileData Classifier which file systems and shares to manage, it built an index based on file system metadata. In addition to file names and folders, the index tracks the words, separated by spaces, dashes, dots and underscores, in the file and folder names.
To move all old expense reports to a lower tier of storage we first created a query to find all files of the type .XLS with the word "expense" in their name and a last-modified date before Jan. 1, 2006. After viewing its results to verify it didn't include false positives, we scheduled a file move event to migrate these files every Sunday at 1:00 a.m. and leave links in their original locations.
The word-in-filename and folder feature makes it easy to locate files but, unfortunately, there's no easy way to limit the query to specific file systems or folders, other than to create separate data slices. We'd rather see a folder tree that allows us to select or deselect folders, similar to what Destiny or Auto-stor offers.
Note that FileData Classifier doesn't create an index from the content of all your files as it scans file systems into its database. To manage data by its content, we had to run an extraction event that scanned those files selected by the metadata query to find the files that included the key words we were looking for. Moving all the files with a keyword--or if you spring for the optional security discovery module, SSN or credit card number--means scheduling a scan event to include new files in the database, then an extraction, and finally a file move event.
While FileData Classifier starts off significantly less expensive than rivals, the security discovery module, ILM module for de-duplicating files, and enterprise features module for AD integration and NFS scanning each cost an additional $4,995 per 3TB of managed data.
All in all, we found ourselves spending more time fitting what we wanted to do into FileData Classifier's way of working than we'd like. It could be a useful tool for smaller companies looking to re-organize and consolidate file systems, but didn't really fit the way we wanted to work to automagically migrate data between storage tiers.
Howard Marks is founder and chief scientist at Networks Are Our Lives, a network design and consulting firm in Hoboken, N.J. Write to him at hmarks@aol.com.
|