Did you get a chance to read “The Digital Universe Decade - Are You Ready?” published by IDC (May 2010)? This report is a follow-up from the 2007 report titled, “The Expanding Digital Universe”. I ask this question as I'm surprised by the number of individuals who are unaware of its existence. In fact I don’t believe I met more than a dozen individuals at VMworld (US & Europe) or Insight (US & Europe) who had read either of these reports.
I should admit, that this report kind of flew under my radar as well. I owe thanks to The Skeptic’s Guide to the Universe podcast (which is one of my favorites and a recommendation to you) for covering this report in episode 251.
In short the Digital Universe is defined as all of the data that exists today and these reports attempt to measure the capacity of data in existence today while also providing a forecast for future growth. I'd suggest that these reports are fantastic reads and wanted to share with you a bit in order to prepare you for the amount of data storage capacity your infrastructure will be required to address.
Buckle up, it's gonna get wild!
The 2007 report estimated that the digital universe, all of the known data in the world, totaled 161 exabytes (or 161,000 petabytes). In addition, this report forecasted a growth of the digital universe to 988 exabytes by 2010.
The 2010 report states the Digital Universe was nearly 880 exabytes in 2009 and that it grew by 62% over 2008. If you’d like to attempt to visualize 880 exabytes just imagine a column of DVD discs stacked to high enough to reach the moon and back to Earth. As a DVD is 1.2mm thick and the distance between the Earth and the moon is 384,403 km I fear to calculate the number of DVDs required to make this visualization!
For 2010, IDC projects that the Digital Universe will reach 1.2 zettabytes (or 1,200 exabytes or 1.2 million petabytes or 1.2 billion terabytes if you prefer). This estimate is roughly 21.4% greater than the estimate made just three years prior.
Are you familiar with how large is a Zettabyte of data is?
I'm sure most are familiar with capacities of data ranging from a byte up through to a terabyte or possibly a petabyte. Many of us aren’t familiar with capacities beyond terabytes or petabytes, so I thought I’d share the following scale (in decimal).
1,000 bytes is a kilobyte (NYSE:KB)
1,000,000 bytes is a megabyte (NASDAQ:MB)
1,000,000,000 bytes is a gigabyte (GB)
1,000,000,000,000 bytes is a terabyte (TB)
1,000,000,000,000,000 bytes is a petabyte (NYSE:PB)
1,000,000,000,000,000,000 bytes is an exabyte (EB)
1,000,000,000,000,000,000,000 bytes is a zettabyte (ZB)
What is clear, there’s more data than you can imagine and today’s conversation on petabytes will soon transition to one on yottabytes...
1,000,000,000,000,000,000,000,000 bytes is a yottabyte (YB)
A Few Highlights From the Report
IDC forecasts that over the next decade the amount of data will grow by a factor of 44, the number of files will grow by a factor of 67, and storage capacity will grow by a factor of 30. Here’s a rhetorical question… Are you looking to increase your data storage team by a factor of 30, 44, or 67 over the next decade? What about increasing your storage footprint in correlation with the growth of your data?
IDC also stated their data shows that nearly 75% of our digital world is a copy. In other words, only 25% of all data is unique! Redundant copies of data required are the norm when one considers backup, DR, application test & development and these are the sources IDC highlights. Based on NetApp’s block-level data deduplication we know that there is a fair amount of redundancy in objects that are dissimilar or are copies of primary data sets. An example of this type of redundancy would be OS and application binaries found within virtual infrastructures.
This type of redundancy is not clearly identified by IDC in the report and as we know it does exist, I think its fair to say that the actual amount of original data may be less than 25%.
Today every storage vendor promotes storage savings technologies like data deduplication and compression; however, these capabilities are almost exclusively locked into backup solutions or are so restrictive they are relegated to slideware-only implementations.
From “The Digital Universe Decade - Are You Ready?”
Published by IDC
Storage still seems to remain somewhat of a black box to many designing virtual and cloud architectures. To folks unfamiliar with storage they tend to focus on performance and availability. These characteristics are commodities in today’s market. Today’s storage arrays must address the complexity of scaling storage capacities while enabling application and infrastructure integration; both of which are have scaling challenges (for instance, can you say copy-offload?)
Call to Action
The verdict is in we either need to stop creating and sharing data or we need to begin storing our data in a more efficient manner. Speaking specifically to virtualization, cloud infrastructures are an especially attractive place to eliminate data redundancy due to the content aggregation and the high level of commonality contained within.
At NetApp we our advocating that our customers should begin reducing the capacity required to run every data set within their data center. The ‘cost’ of enabling technologies like dedupe, FlexClone, and thin provisioning is almost free, yet the benefits in terms of storage space reclamation can be significant.
“Since I began using NetApp’s deduplication technology, it’s deferred the need to purchase any additional storage hardware by at least six to eight months."
Manager of IT, Polysius USA
I plan to cover more of these capabilities in upcoming posts that will address backup, disaster recovery, enterprise class applications, virtual desktops, and more in the context of private clouds and virtual infrastructures. The goal to future-proof your datacenter is pretty exciting and one that I hope you enjoy.