In the era of big data, gathering vast amounts of information from a diverse range of sources holds immense potential to drive business insights, enhance decision-making, and fuel innovation. However, the sheer volume and complexity of this data pose significant challenges in terms of storage, management, and, most importantly, utilization. Up to 85% (Casteleijn, 2021) of data repository projects fail to meet expectations. Data repositories, data warehouses, data lakes, and data fabrics can very quickly become data swamps.
A data swamp can be defined as a poorly managed data repository that has become cluttered, inaccessible, and unreliable. It is characterized by a lack of organization, inadequate data governance, and poor data quality. As well as wasted running costs, data swamps can pose serious threats to businesses. They can hinder the ability to extract value from their data leading to financial losses, reputational damage, and missed opportunities. Even worse, they can be a source of ‘hallucinations’ if a GenAI is using them as its data source, resulting in the presentation of misinformation as strategic insights.
How Data Repositories Became Data Swamps
The data swamp risk is a natural progression of the ‘Copy Everything’ IT strategy. Like the ‘Integrate Everything’ strategy that went before it, what seems like a great idea sews the seeds of its destruction.
When businesses tried to integrate everything using ERP systems such as SAP and Oracle, they found out the hard way that the law of diminishing returns applied. Some data costs more to integrate permanently, in real-time, than the value it delivers. This, plus the tendency of businesses to necessarily customize their ERP systems, led to huge integration and upgrade costs and encouraged process-specific SaaS ERP systems like Workday and Salesforceorce to appear.
A new way was needed.
To access data across legacy ERP systems, as well as the new SaaS solutions, we arrived at the ‘Copy Everything’ strategy. Starting with Data Warehouses, and then Data Fabrics and Data Lakes, the idea is that if all the data was in one big pool, then the business can combine it in any way it likes.
It is a good idea!
But a copy is by definition a duplicate that is always a little out of date. It also costs a lot to hold and maintain these seas of duplicate data. Quite often different bits of the business create extra copies of the same data because they need to use it in different ways. For example, Finance didn’t know Marketing was doing something similar and unintentionally created duplicate data sets and definitions. These near duplications and actual duplications that, over time, come to exist in the Data Lake are easily confused. The result is that someone has to pay to store them and maintain them.
Diving Deep Into Data Swamps, and How They Happen…
Let’s look at the factors contributing to data swamps:
- Lack of Data Governance: Data governance establishes policies and procedures for managing data throughout its lifecycle, from inception to disposal. Without proper data governance, data can become disorganized, duplicated, and inconsistent, making it difficult to trust and utilize.
- Inadequate Data Quality: Data quality refers to the accuracy, completeness, and consistency of data. Poor data quality can lead to unreliable insights, erroneous decisions, and a diminished understanding of business operations.
- Limited Data Accessibility: Data accessibility refers to the ease with which users can find and retrieve the data they need. In data swamps, data is often hidden in obscure locations, lacks documentation, and is difficult to query, hindering its utilization.
- Unrealistic Data Retention Policies: Data retention policies define how long data should be stored. Excessive data retention can lead to data proliferation, increasing storage costs and making it challenging to identify and manage relevant data.
- Inadequate Data Maintenance: Data maintenance involves cleaning, validating, and transforming data to ensure its quality and usability. Neglecting data maintenance can lead to data degradation and hinder its effective utilization.
Life Is Not Pleasant in the Data Swamp
The consequences of ending up in a Data Swamp are pretty ugly, and can include:
- Increased Costs: Data swamps incur significant storage, maintenance, and management costs. The unorganized nature of data in swamps makes it difficult to prune obsolete or irrelevant data, leading to unnecessary expenses.
- Poor Decision-Making: Unreliable and inaccessible data can lead to misguided decisions, hindering business growth and performance. Erroneous insights derived from poor data quality can result in missed opportunities and strategic missteps.
- Reputational Damage: Data swamps can pose security risks, making organizations vulnerable to data breaches and reputational damage. Inaccurate or misleading data can also erode customer trust and diminish brand value.
- Reduced Employee Productivity: Data swamps hinder employees’ ability to find and utilize the data they need, leading to wasted time and reduced productivity. The frustration of dealing with data swamps can also lead to employee dissatisfaction and attrition.
Draining the Swamp is Hard – You’re Better off Avoiding It
So what can be done to prevent the pristine Data Lake from silting up to become a Data Swamp? There are some important hygiene actions:
- Implement Data Governance: Establish clear data governance policies and procedures to ensure data quality, consistency, and accessibility. Define data ownership, data classification, and data access rules.
- Enforce Data Quality Standards: Define data quality metrics and implement processes to monitor and maintain data quality. Utilize data cleansing tools and techniques to identify and correct data errors and inconsistencies.
- Optimize Data Storage: Regularly review data retention policies and prune obsolete or irrelevant data to reduce storage costs and improve data accessibility. Implement efficient data compression techniques to optimize storage utilization.
- Enhance Data Accessibility: Create comprehensive data catalogs and documentation to make data easily discoverable and accessible to users. Implement data discovery tools and search functionalities to facilitate data retrieval.
- Promote Data Literacy: Educate employees on data governance principles, data quality standards, and data access procedures. Foster a data-driven culture within the organization to encourage responsible data usage.
Avoiding Data Swamps by Avoiding Data Lakes
But is there a better way? How can we get out of the swamp to the open water of reliable live data that we can access from anywhere without letting it stagnate?
Yes, Data from the Source…
Data from Source Technologies give access to existing data sources, without rigid and expensive integration, and without creating data copies in repositories. Direct data access simplifies the data landscape, making it easier to manage, maintain, and navigate. This streamlined approach reduces the burden on IT teams and allows businesses to focus on extracting insights from their data instead of managing data. With direct data access, using a Virtual Data Layer, businesses can eliminate the unnecessary costs associated with data duplication, preparation, and storage. By accessing only the data they need, businesses can optimize their data infrastructure and save valuable resources.
The key benefit of ‘Data from Source’ tools is reduced latency. This has four dimensions:
- The data used is up-to-date live data.
- Bypassing the need for the Extract Transform Load [ETL] of traditional data replication gets to solutions faster.
- This speed and ease of access in turn enables solutions to be built iteratively and quickly.
- The minimal cost to operate, avoiding significant storage and maintenance costs.
However, ‘Data from Source’ is still quite a new approach. The tools are in development, and it does raise questions in the CTO/CIO community. These questions often relate to security and performance impacts on source systems. Be that as it may, early adopters are seeing data analytics project times improving by 5 times. Considering this, Data from Source is an important approach that can enable early adopters to leapfrog the data swamp.
Conclusion – Don’t let your Data Lake stagnate. Try Going Around it Instead of Through it
Data repositories such as data warehouses, data lakes, and data fabrics are potentially valuable assets for organizations that need to store and manage vast amounts of data. However, if not properly managed, these repositories can quickly become data swamps, hindering data utilization and posing significant risks to businesses. It is important in the first instance to implement data governance, enforce data quality standards, optimize data storage, and promote data literacy to enhance data accessibility. That way organizations can hold off the stagnation of their data repositories, avoiding data swamps. They can also move on to the next generation of Data from Source tools and leave the data lakes behind long before they become swamps.