Big Data in the previous century was the archive full of ring-binders/folders/ordners, which would grow each year at the same pace. Now the definition is that it should grow each year as much as all years before combined.
A few months ago SunGard named 10 Big Data trends transforming financial services. I have used their list as a base to have my own focus: on increased computation-demands and not specific for this one market. This resulted in 7 general trends where Big Data meets/needs OpenCL.
Since the start of StreamHPC we sought customers who could no compute through their whole data in time. Back then Big Data was still a buzz word catching on, but it best describes this one core businesses.
Historical Data
Historical data can be pre-computed, so less need for high performing processors here. Take for example the distribution and movements of (real) clouds in meteorological data, the most compute-intensive part is handling the raw data, which then in turn can be saved in a form that is easier to search through and to post-process. Another example is hashing the data, also with the goal to make the data better searchable.
Also this data-searching is something that can be sped up using GPUs and parallel computing. But interest for historical data is highest when new algorithms have been developed. Fully testing the algorithm’s best parameters needs to be done fast. By writing algorithms in OpenCL, we get the two advantages that these computations can be sped up with a few factors, and be easily scaled up.
Desire to Leverage More Data
We know that companies like Facebook and Google are collecting personal data of everybody, that the seemingly endless internet can actually be downloaded, that it is possible the save all data each internet-user downloaded in the pas half year, etc. Big Data is like nuclear fission: it can be used for good or bad, and we have to be very careful with it. Extra exciting.
As in the above example, consumer data gets more important to be able to sell targeted ads. But this is a very wide trend for more many industries. Adding various statistical data to the regular data could bring up unexpected causal links and patterns. Much effort has been done to make various data accessible, making it now actually possible.
Here’s an example concerning Small Data. Studies on what people do in their last hour at the office has resulted in many tips for managers on what should be done to get more efficient colleagues. But as you might have expected, efficiency decreases creativity – reason for adding everybody’s tasks of the past years to the data stream. And don’t forget to add the mix of personality types in the office. Combining all these data could give an unexpected outcome about how to interpret an increased demand for more structure in a certain department. You must understand that this combination of various semi-unrelated data has hardly been done – the data was not made accessible, as interdisciplinary research is still quite new.
Post-Emergent Markets
As with many technologies, the new world is catching on. To quote from a Dutch trade mission document to China:
In lively and open discussions, it became clear that China’s role in scientific and commercial developments is increasingly important.
From what I read, I can only conclude that also for the compute-part of Big Data, the new world is gaining traction faster than the old world anticipates. I don’t want to go too much into it, as Exascale-marketing already does that.
Advances in Big Computational Power
Where the original list mentions drives and storage, I replaced it in this list by computation power. Adding GPUs and other new architectures to servers have increased the computational power per Watt dramatically. 100 GFLOPS under 5 Watts are becoming a new standard – that is a fast Intel Core 2 Duo using as much energy as a compact fluorescent lamp. Of course you’ll need technologies like OpenCL to harvest this potential.
Need to Re-engineer ETL (extract, transform, load)
The first bottleneck in Big Data is data transfer and bandwidth. Many systems are not engineered to handle so much data.
The first thing we do here at StreamHPC is focusing on finding out the limits of the network, databases and load/save statements in the software. A few examples of bottlenecks we’ve seen:
- Transferring 2TB of data from SSD-drives will take over an hour if done in serial, 5 hours from older drives.
- Older software that assumes servers with 1GB of memory instead of 32GB or more, can be inefficient in data-caching.
- Having the databases on a different location than the software-servers.
Once the data-transfer is not the problem any more, we can focus on the T-part: transform the data using faster algorithms, more efficient software and better hardware.
Mobile Proliferation
Mobile phones (and gadgets) are a growth market and a drive behind cloud computing, as to-be- processed data increases every minute.
There are two sides here: data-processing and data-gathering.
One of the trade-offs for processing data at mobiles is processing data locally or in the cloud. Computing the data of millions of mobile phones is an enormous task when done centralised. The device/service that can do more than the mobile processor can offer has a competitive advantage, making this off-loading an important market.
Gathering usage data on smart phones is something very usual nowadays – if we like it or not. All those millions of phones generate so much data, that storing it is already a challenge. Just like Facebook does with finding connections between all of their customers, this graph can be computed for phone users too – with each extra phone, the computation gets more complicated.
Big Data is Driving Big Data
To quote SunGard here:
Big data initiatives are driving increased demand for algorithms to process data, as well as emphasizing challenges around data security and access control, and minimizing impact on existing systems.
As it is possible to compute through a few terabytes of data in much less time, new ideas come up for more time-consuming algorithms. So in a sense, the need for more processing power is getting larger than the need for more data handling. The only limits are the 24 hours in a day and the money we can spend on it.
Big Data at StreamHPC
We chose MapReduce via Hadoop to implement GPGPU on clusters. We found that problems that are solved well by GPU-OpenCL, work well with Hadoop/Storm. As Hadoop/Storm is Java-based, it also has the advantage of being able to easily integrate with Aparapi.
Could Google’s algorithm be made faster using GPUs? We think it can, or they already have done it.