Data Retention: In IT We Trust

By May 27, 2015Archiving, Backup
Data retention

While discussing backups and recovery validation in my first blog of this series, “The Case for Disaster Recovery Validation“, I cited “…The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy…” [Wikipedia]. In this blog, I will review data loss/retention in regard to backups and archives, and the difficulties inherent in specifying retention times.

The burgeoning cost of storing increased amounts of data may have put an unfair burden on most IT organizations because the business and application “users” are failing to specify the retention needs of the data created. Now, more than ever before, IT has to balance budgets with regulatory compliance, industry standards and company constraints. This may vary from five days to fifty years, depending on the purpose of data retention strategy, the type of data and the functional use of the data: financial, health, education, research, government, etc. Further, retention policies apply to one or both systems of data retention: backup and archiving.

Data Loss Is Not Tolerable

March 31st 2015 was World Backup Day 2015: A Reminder of the Pain of Lost Data. The cautionary tale is: you can only afford to lose the data you can afford to lose.

Here are a number of instances where data loss is possible:

  • Confusion of objectives – you backup too little data or you backup too much of the wrong kind
  • State of complacency – you inherit a state of backup and archive from a previous team while you are in a state of “job content”
  • Multiplicity of responsibilities – you tell everyone that data is a shared responsibility so everyone believes someone else is responsible
  • Plethora of archives – you deploy a number of technologies in the data center and in the cloud, hoping that the data is on at least one of them
  • Matter of timing – you keep the data for too short a time (media is reused) or you keep the data for too long (media is no longer of use)

Note that we use the words backup and archive loosely here. In reality, backups protect active and inactive data by making a production copy to keep for a length of time. Whereas, archives are inactive production data (not a copy) that are likely moved out of original location for cost and other purposes. In times of disaster, backups are the source to recover production data, but archives must include self-preservation methods when created. In the absence of true archiving, backup copies with long-term retention act as substitutes for archives.

All Data Is Not Equal

Email retention has received considerable attention in recent years, as it is often the target in investigations and external requests. Many vendors now offer software that can archive email in the data center or in the cloud in optimal ways (offloaded at lower costs). Companies have also developed policies for deleting email after a specific length of time, making it easier to specify retention policies for email backups and archives.

However, other forms of unstructured data are less specific about their retention. Keeping data “forever” in the backup system results in poorly performing backups over time and retaining archives “forever” leads to spiraling costs.

The following questions can help in understanding the data retention requirements:

  • When can one comfortably delete the data?
  • What is the regulation requirement (if any)?
  • Are there penalties associated with deleting the data?
  • Are there liabilities associated with keeping the data?
  • Is there known data ownership by individuals, departments or applications?
  • Who will request for the data after the creators of the data are gone?
  • Do applications need the older data as part of the workflow?
  • How often will one request the retrieval of the data?
  • When requested, how quickly must one perform the retrieval?
  • Is the data a business asset – artist gallery, trading algorithm, intellectual property?

The Data Life Cycle

IT requires all users of data to be truly objective about their data needs and long-term use. As data volumes move to critical mass, it will become more important that organizations understand all of their data and the stages the data is likely to take: generate, store, use, transfer, share, archive, and destroy. IT must also recognize which of the data sets are true business assets and are vital to the life of the business. Implementing the data life cycle in collaboration with business and application teams is one of the ways in which IT adds tremendous value. This can be a daunting task and requires a fair amount of data classification.