Author: Doug Downing
I recently attended the Gartner Data & Analytics (D&A) Summit, a great event with lots of interesting sessions on the future of data and analytics.
Some of the more interesting ideas and sessions discussed were data storage via DNA, cloud computing in outer space, D&A use cases, the increasing importance of data to human trust and purpose, D&A infrastructure technology and usage predictions, and a head-to-head session comparing the capabilities of three leading data visualization tools.
While this was all great information and experience, two points stood out to me:
First, Big Data is continuously growing in size and complexity. There are extremely large and complex datasets that cannot be easily managed, processed, or analyzed using traditional data processing tools and techniques. The immense volume, velocity, and variety of data collected from various sources make it challenging to handle.
Second, the predominant approach to managing Big Data seems to be through data repositories, such as data warehouses and data lakes. These repositories involve moving data from disparate sources to a central storage area, where it can be accessed for value extraction.
However, these two points appear to be in conflict, and increasingly so.
Let me explain.
The sheer speed at which Big Data is being produced is staggering. For example, in 2021, Facebook users were sharing 150,000 messages, Twitter users were posting 456,000 tweets, YouTube users were uploading 500 hours of video content, and WhatsApp users were exchanging 41.6 million messages every minute. Furthermore, it is projected that by 2025, the total amount of data created and replicated worldwide will reach 181 zettabytes, the number of connected IoT devices worldwide will reach 30.9 billion, and the Big Data Healthcare market will reach $69.96 billion.
To address the management of Big Data, data repositories have been seen as critical. They offer benefits such as centralized data storage organization, data integration for a unified enterprise view, data analytics and reporting capabilities, historical data preservation, scalability, performance, data governance, security, and data collaboration. However, in practice, implementing and maintaining data repositories can be expensive and time-consuming. Additionally, a significant percentage of data repository projects fail to meet expectations, and the speed of data ingestion into repositories has not kept up with the increasing speed of new data arrival. Moreover, once data is moved from its source to a repository, it becomes outdated, requiring constant reconciliation.
Considering these challenges, the reliance on data repositories as a singular solution for managing Big Data seems to be an ineffective approach.
While data repository technology is undoubtedly valuable, Albert Einstein’s quote, “The definition of insanity is doing the same thing over and over and expecting different results,” prompts us to question the sanity of persistently pursuing a data repository strategy. Hence, alternative solutions need to be explored.
One such alternative is the approach of accessing data directly from its original sources, known as Data From Source. This approach involves accessing data from multiple disparate sources, selecting relevant data, transforming it using an in-memory platform, applying MDM (Master Data Management) and data governance oversight, and consuming the transformed data directly or through integrated tools or algorithms. Directly accessing data from source systems offers benefits such as real-time data availability, data integrity and consistency, simplified architecture, cost efficiency, granular control and flexibility, enhanced data security, and reduced latency.
While concerns exist regarding the data from source approach, including performance, data quality and consistency, security and privacy, and dependencies on source systems, these concerns are being addressed through new technologies, integrated toolsets, data source connections that recognize and update changes, and different approaches to extracting value from Big Data.
To date, Data From Source has shown promising results compared to Data Repositories.
It is faster to implement and maintain, less expensive, more efficient and provides higher-quality results through real-time data usage. Despite the success and benefits observed from this approach, it is surprising that more companies are not experimenting with and adopting this technology. Big Data management cannot be solely solved by Data Repositories or Data From Sources; instead, a combination of approaches and the continuous introduction of new technologies will shape the future of managing Big Data.