As we look at the many ways to improve storage utilization, data deduplication often pops up as a potential technique. Data deduplication, or sometimes referred to as “intelligent compression” or “single-instance storage”, is a method of reducing storage needs by eliminating redundant data. Deduplication is quite similar to data compression, but it looks for repeating sequence of very large chunks of data across very large comparison windows. Long sequences are compared to the history of other such sequences, and where matched, only one unique instance of the data sequence is actually retained on storage media. Redundant data is replaced with a pointer to that first unique data sequence copy.
For example, a typical email system might contain 300 instances of the same two megabyte (2 MB) file attachment. If the email platform is backed up or archived, all 300 instances are saved, requiring 600 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 600 MB storage demand could be reduced to just 2 MB. Imagine the huge economic benefits! Of course, in a storage system, this is all hidden from users and applications, so the whole file is readable after having been written.
In actual practice, data deduplication is often used in conjunction with other forms of data reduction such as conventional compression and delta differencing. Together, these three techniques can be very effective at optimizing the utilization of storage space. Data deduplication technique varies and are mostly vendor dependent, but usually are able to operate at the file, block, and even the bit level. File deduplication eliminates duplicate files, but this is not a very efficient means of deduplication. Block and bit deduplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data will be saved, e.g. if only a few bytes of a document are changed, only the changed blocks or bytes are saved, not an entire new file. In summary, block and bit deduplication are far more efficient. However, block and bit deduplication takes more processing power and uses a much larger index to track the individual pieces. How deduplication works is that the algorithm takes incoming data stream into uniquely identifiable data segments, and then compares the segments to previously stored data. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again, but a reference is created to it. It will only be stored on disk if the data segment is unique. There are two ways to process this logic, namely:
- Inline deduplication means the data is deduplicated before it is written to disk (inline). This way is the most efficient and economic method of deduplication, as the data set is never written to disk, hence significantly reduces the raw disk capacity needed in the system. If replication is supported as part of the inline deduplication process, inline also optimizes time-to-DR (disaster recovery) far beyond all other methods as the system does not need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.
- Post-process deduplication technologies wait for the data to be written in full on disk before initiating the deduplication process. As this approach writes the complete data first, it will require a greater initial capacity overhead than the inline way. Besides that, it increases the lag time before deduplication is complete as well as when replication will complete (if there is replication function in place), since it is highly advantageous to replicate only deduplicated (small) data.
Data deduplication offers many benefits. Reduction in the use of storage space will directly translate to savings on disk expenditures. The more efficient use of disk space also allows for more copies of data to be retained, i.e. longer disk retention periods, which provides higher availability and better recovery time objectives (RTO) for a longer time, reducing the need for tape backups. Data deduplication also reduces the amount of data that must be sent across a WAN for remote backups, replication, and disaster recovery.
So, is data deduplication safe? How is the performance going to be affected? First of all, do understand that typically data deduplication creates a hash number for each chunk of data and uses it to compare with the existing hash numbers in its index. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored. This is much more faster than comparing raw data chunks. However, hash collisions are a potential problem with deduplication. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When a hash collision occurs, the system won’t store the new data because it sees that its hash number already exists in the index.. This is referred to as a false positive, and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions. I don’t think it is fool-proof yet.
In terms of performance, throughput will vary by vendor as deduplication is a resource-intensive process and the performance will be dependent on vendor-specific algorithms and implementation. For example, during writes, the deduplication process must determine if the chunk of data has been stored before, often across hundreds of prior terabytes of data. An index of this data will likely be too big to fit in RAM unless it is a very small deployment. If the index is on disk, it will need to seek on disk, and disk seeks are notoriously slow. The easiest ways to make data deduplication go faster is to sacrifice reduction ration, i.e. only look for large sequences, so you don’t have to perform disk seeks as frequently.
Tags: Capacity, Deduplication, Efficiency, Storage, Utilization