What is Data Deduplication?

When it comes to computer, storage space is a hot commodity. More often than not, many of us tend to find that we need far more storage space than we originally anticipated, meaning it’s always a good idea to optimise the amount of storage space you require with cloud or dedicated servers.

However, a very popular method of reducing the amount of data we need to store is through a process called data deduplication. By using this simple procedure, you can dramatically reduce the build-up of irrelevant and duplicate data on your machines and in the cloud.

For those asking the question, 'what is data deduplication?', we'll break it down without using any excessive terminology.

At its basic level, data deduplication, or dedup for short, is a way of reducing the amount of storage space required when backing up files to a server by preventing the recording of duplicate files.

For example, imagine you work in a business environment where each employee has their own desktop machine. On these machines, they each have their own folder system for storing their files. These files are then backed up onto the company’s network as well.

In an enterprise of hundreds or thousands of employees, all of these files add up to a huge amount of storage space. As such, methods like data deduplication are vital for reducing storage space on a server by ensuring that repeating files from numerous machines are only ever backed up as one file.

How does data deduplication work?

The process of data deduplication is as simple as it sounds. Let's stick with our office environment example to help explain it.

In this scenario, imagine all five members of your team receive the same copy of a PowerPoint document called example.ppt. Everyone saves example.ppt to their own machines, which means there are now six copies of example.ppt on the company network: the original and the five copies.

If we then say that example.ppt is 10MB in size; because everyone now has a copy, it actually takes up 60MB of storage space on the network.

Admittedly, 60MB doesn’t sound like much, but what if it was a 50MB photo file sent to 100 employees or a 1GB HD video file sent to 1000 employees?

Suddenly the 1GB video file takes up a terabyte of storage space across the company's dedicated server.

Data deduplication ensures that only one copy of any file is ever backed up, everyone else simply receives a file that points them to the original. The user doesn’t even realise that they haven’t got the original document, and if they make changes and save it as a new one, this new version will be backed up as it counts as a different file.

Thus, by ensuring no duplicates are made, storage requirements are cut dramatically.

Data deduplication techniques and analysis

When performing data deduplication, there are two main methods you can employ: file-level data deduplication and block-level data deduplication.

Both methods work well but it's worth explaining the differences between the two to better understand which method is better for you.

File-level data deduplication

File-level data deduplication is the most basic level of data deduplication available, the one used in our previous examples. In short, file-level data deduplication goes through a server and scans for exact duplicates of files and, well... deduplicates them.

When an employee saves a file to their area of the network, data deduplication systems check the file against the index of all of the other files on the network. If it comes up as a unique file, then it's stored and the server index is updated.

However, if it doesn't register with the data deduplication systems as unique, a pointer to the original file is saved instead.

This basic method of data deduplication does save space, but it's a pretty inefficient method. If one of the employees corrects a typo in example.ppt then file-level data deduplication systems will consider this to be a new unique file and save it.

This essentially doubles the previous storage amount because there is now one 10MB original, four pointers, and one 10MB edited copy, savig less space than originally intended.

Block-level data deduplication

Block-level data deduplication systems offer a solution to this problem. Instead of treating files as singular entities, block-level data deduplication works by breaking each file down into a more granular and binary level instead.

Say two employees are sent example.ppt, and they both make some changes to it. File-level data deduplication would treat this as three separate files, but block-level data deduplication systems would look at these files and save them as unique blocks of binary iterations for each copy.

This means that instead of having multiple unique 10MB files totalling 30MB of space, there is now only the equivalent blocks of data to be stored for little over 10MB of space.

In non-computer terms, imagine that example.ppt is a PowerPoint presentation made up of four slides. Block-level data deduplication would treat each slide as a unique block of data within the file, saving the file as four total blocks.

This is saved in the data deduplication system as blocks ABCD, giving us three versions of the same file, all formed of the ABCD blocks. The two employees who received the original file then both make slight changes to it, creating three unique files made up of blocks ABCD, ABCE, and ABDE.

As block-level data deduplication systems store only the unique blocks, not the unique files, blocks ABCDE are stored on the network and reconfigured to produce the different files when requested.

When any changes are made, it the system then up[dates the number of blocks with a new one.

Benefits of data deduplication systems

The obvious benefit of data deduplication is an overall reduction in storage demand, but it also provides the benefit of reduced bandwidth consumption on the server, and faster speeds when uploading or downloading to and from the backup.

Data deduplication is also highly customisable. You can tell the server to only process certain folders, to exclude duplicates of certain file types, or to exclude files that are less than a set number of days old.

Deduplication can also be set up to happen as soon as data is backed up to the server, or it can happen in the background at set increments. The choice is yours and it lets you establish exactly how you want your data to be saved.

Ultimately, the primary benefit of data deduplication is cost savings. Saving storage space saves money.

Of course, data deduplication is not the only way to save storage space, there's also file compression, which you can read all about right here.

Data deduplication isn’t an alternative to file compression, it’s more of a supplement to it. In fact, many companies will use a combination of data deduplication and file compression to fully optimise their storage space.

Naturally, data deduplication can be enabled on all of the dedicated servers available here at Fasthosts. With our 24/7 support services, you can be sure you'll always be saving as much storage space as possible.

But we don't just provide server support, we also offer web hosting and website building packages, as well as email hosting and domain name services.

And if you want more articles on how to save storage and update your business online, you can read our guides or head over to the Fasthosts blog for even more detailed articles like this one.