The Power Of Open Source To Solve The Data Fragmentation Challenge

|
 |  Includes: AMZN, CLOUD, CRM, HDP, TWTR
by: Tomasz Tunguz

Summary

Most modern data architectures employ many different data stores and processing engines.

Apache Arrow is a new open-source project that helps data analysts wrestle diverse data sets into a single format.

Apache Arrow is a collaborative effort that spans many of the largest providers and users of data infrastructure today.

Most modern data architectures employ many different data stores and processing engines. Hadoop, Cassandra, HBase, Spark, Storm and Phoenix. Data analysts looking to unearth insights within these data stores must move data back and forth between different systems and different data formats. As the number of new open source projects continues to grow geometrically, this data fragmentation is likely to splinter further.

Apache Arrow is a new open-source project that helps data analysts wrestle diverse data sets into a single format. Apache Arrow is a collaborative effort that spans many of the largest providers and users of data infrastructure today including Amazon (NASDAQ:AMZN), Cloudera (Private:CLOUD), Databricks, DataStax, Dremio, Hortonworks (NASDAQ:HDP) MapR, Salesforce.com (NYSE:CRM), Trifacta and Twitter (NYSE:TWTR). That so many different companies can collaborate on one initiative to improve data analysis industry-wide is a testament to the power of open source to inspire and engender great change.

I'm really excited about this project. I write many of the analyses for this blog in R and I've seen this data fragmentation problem for myself and across many different companies. It's one of the major reasons Redpoint invested in Dremio: to solve fragmentation for data engineers. As Wes McKinney, author of Pandas, the most widely distributed Python data analysis toolkit, says, "Arrow will enable Python and R to become first class languages across the entire Big Data Stack."

Arrow promises data engineers three things. First, data engineers can access data across the many different stores within their infrastructure through Arrow's common format. Second, engineers' analyses will conclude faster because Arrow takes advantage of innovations in CPUs to parallelize computation. Third, polyglot support: analyze data in whatever language whether R, Python, Julia, Javascript, C++, or Java. In short, faster data analysis in many different languages across more data.

Apache Arrow is one of the fastest projects to attain Top Level Project status, a fact that underscores the need for the technology, the strength and breadth of the coalition to support it, and the potential to change the way data analysts work today.