The three Vs of big data are volume, velocity, and variety. In order to derive value from big data, IT departments need tools to store massive amounts of data (volume) and make sure it is always available for quick analysis (velocity).
There are two core components for doing this: 1) managing all the IT processes and solutions that are responsible for storing and making information available, and 2) actually analyzing that data—this is the one that generates revenue.
Organizations that get bogged down in the data management aspect of analytics will prolong the time-to-value of their analytics solutions, regardless of how robust the software is. Worse yet, the explosion of volume that will occur in the next several years means that those without streamlined storage and data management will face excessive costs.
Does the i have anything in its arsenal to equip IT teams to combat big data? Or will the three Vs of big data be the downfall of its reputation for reliability?
Turning Up the Volume
Volume is perhaps the most talked about characteristic of big data. After all, it’s difficult to do much with all that information if there’s nowhere to store it and, by its very nature, big data refers to data sets that are too large for traditional tools to handle.
It’s not just what organizations are collecting today that is an issue, though. It’s what they will collect in the future. IDC’s Digital Universe Study provided some insight into what the average data ecosystem is going to look like by 2020. There are two main points that are essential for understanding the big data problem:
- The digital universe is expected to grow to 40 zettabytes from 2010 to 2020, a 50-fold increase
- 23% of all digital information has big data value but less than 3% of it is actively being used this way
These statistics show that there is a lot of untapped potential in big data, but there are obstacles.
“As the volume and complexity of data barraging businesses from all angles increases, IT organizations have a choice: they can either succumb to information-overload paralysis, or they can take steps to harness the tremendous potential teeming within all of those data streams,” said Jeremy Burton, executive vice president of product operations and marketing at EMC Corporation. This means it is important to start identifying the upcoming challenges and create strategies to overcome them.
Challenge #1: The Expense of Storage
One issue involved in storing huge volumes of information is that it is expensive, which has been one of the major roadblocks in cancer research. As Nextgov reported, the National Cancer Institute maintains a library of data that could drive treatment innovation. However, the cost of storing the expected 2.5 petabytes of information that will be in the Cancer Genome Atlas in 2014 is expected to reach $2 million per year, meaning that the wealth of insight has been unavailable to the majority of researchers.
The National Cancer Institute is turning toward the cloud to solve its dilemma, but this is not the optimal approach for every initiative. Concern over data security and privacy aside, there is no reason to spend more on third-party storage when there are robust in-house solutions for managing data even in light of massive volumes.
Challenge #2: Availability
Another issue that must be overcome to glean value from big data is ensuring that all of the necessary information is actually available. Here’s where IBM i shines. The platform benefits from embedded technology that makes high availability clustering easy. Tools like PowerHA or well-established software replication vendors are the keys to using that innate power and for controlling the massive influx of enterprise information.
PowerHA offers a complete portfolio of high availability solutions for IBM i, including the System Mirror, which allows the set up of high availability clusters.
Each cluster is composed of a primary and secondary server and a shared storage pool that are connected so that, in the event of planned downtime, one server can take over for another. This configuration means that no data needs to be unnecessarily replicated. Furthermore, configuring a high availability cluster is useful outside the scope of big data, since it ensures that mission-critical applications can still be accessed during maintenance. The data on one of the clusters can split off for data backups and other maintenance without impacting current business.
Software replication gives the customer a more granular approach to replicating parts of the data base, program libraries, or the entire system. This technology has been around for years on IBM i and, when managed properly, delivers great results. Customers that are most successful have added automation software to help with Role Swaps and monitoring the process.
Automation Drives Efficiency on the i
The key to this technology is that it can detect how often data is accessed and automatically move it to the appropriate storage environment, resulting in fewer wasted resources, which effectively eliminates many of the issues that can make big data more difficult to manage. Automated storage tiering reduces the need to classify data, since it can be dynamically moved around to different hardware based on how often it is accessed. Similarly, PowerHA eliminates the need to identify which objects need to be replicated and reduces the complexity of keeping that data in sync.
Utilizing storage with the same performance grade for each type is highly inefficient. This makes the tried and true strategy of automated storage tiering particularly valuable, as performance demands even within a single big data environment can vary drastically. One file might be used in numerous reports while others collect dust in the system for months.
Automated storage tiering allows organizations to configure multiple tiers of storage that are designed to house data based on performance requirements so information that requires the fastest speed can be stored in solid-state drives, while rarely accessed data is stored in more affordable devices.