As the CEO of a data science consulting company, I’ve noticed many organizations fall short in effectively managing their data. A lot of this has to do with their systems. Even if their data systems are not technically flawed, they are still unable to solve business problems and drive profitable decisions.

Companies in the initial stages of data maturity are more susceptible to the risks of bad data engineering, but even fully transformed data-driven companies may experience issues. For these companies, issues are often caused by a lack of meaningful conversations between data engineers and stakeholders. Data for data’s sake is not what I would call insights, and failing to rectify this can lead to worse problems. To learn more about fixing data disorganization specifically, read this.

But the real urgency of these problems lies with the increasing dependence on AI and machine learning in data-driven decision-making. It is important for business leaders to remember that the success of AI models hinges on the quality of data that’s fed to it – quality that comes from strong, smart, and scalable data engineering.

The Risks of Inadequate Data Engineering Today

The risks of bad data engineering are many and the rise of generative AI has amplified them. Unity Technologies’s Q1 report in 2022 had described a situation in which a large volume of bad customer data diminished the accuracy of the training sets for its predictive algorithms, leading to a loss of around $110 million. More recently, a paper published in IEEE found that automated data cleaning in machine learning-based systems could negatively impact the fairness of predictions that affect historically disadvantaged groups. There is no doubt that data engineering has needed more attention and governance than ever. Here are the specifics:

It raises compliance and security issues

Inadequate data engineering can expose organizations to significant penalties and reputational damage. Data breaches are a common example of this, but non-compliance with regulations such as GDPR, HIPAA, etc are just as dangerous. Facebook learnt this the hard way, when in 2019, they were levied a fine of $5 billion by the Federal Trade Commission for failing to protect user’s data.

However, security risks have increased in the last 5 years, as the ability of AI to extract personally identifiable information (PII) from datasets raises ethical and legal concerns. To mitigate this, business leaders and data engineers should work together to implement comprehensive data governance frameworks that include data access controls, encryption, and regular security audits, while adhering to relevant industry standards.

These must encompass setting up rigorous role-based access controls, encrypting data both at rest and in transit, having a well-defined incident response plan, conducting employee training programs on maintaining security and identifying threats, and implementing data loss prevention (DLP) solutions to prevent sensitive data from being accidentally or maliciously leaked.

It affects decision making

According to Gartner, poor data quality costs businesses almost $13 million every year. This comprises data that is outdated, inaccurate, or incomplete and as a result, poor data engineering leads to misinformed strategies that could cost organizations millions. If generative AI systems are trained on or utilize flawed data, the outputs can be misleading or entirely incorrect, leading to erroneous business decisions and operational inefficiencies

Poor data engineering can cause this issue without the organization even realizing it. Say a streaming service seeks to improve retention and requests data engineers to build a dashboard to monitor watch time and churn. The data suggests a correlation between long watch times and retention and managers conclude that increased content will boost retention. While this seemed like a data-driven insight at the time, it was an oversimplified approach that neglects binge-watching, viewer fatigue, and other factors. Investing in more content, therefore, could actually cost the streaming service more money while driving away consumers.

Ensuring good quality data is a significant part of your data engineer’s job, but they will miss the mark if they receive incomplete requirements, no contextual understanding, and misaligned objectives. Having a chief data officer is a benefit here because they are best placed to translate business needs into technical requirements. Brownie points, of course, if they have experience with artificial intelligence.

It hinders organizational growth

When bad data engineering leads to bad data quality, organizations lose critical dollars needed for growth. A 2019 marketing survey by Forrester Consulting found that poor data quality wasted over 20% of a company’s media spend. Additionally, as your organization grows, the need for scalable data infrastructure becomes critical for success. Without this, you run the risk of experiencing performance bottlenecks and system failures.

In 2015, Twitter’s growth (now X) had reached a point where their existing infrastructure was no longer sufficient. The complexity of their network also became a major hurdle, impacting performance and reliability. They recognized that their traffic growth outpaced their handling capacity. To address this, Twitter adopted a strategy of incremental capacity additions rather than disruptive, wholesale replacements. This approach enabled them to scale their infrastructure efficiently and respond more effectively to the demands of their growing user base.

To future-proof your data infrastructures, you must adopt scalable, fault-tolerant, and reliable data engineering solutions. This may involve the use of distributed computing frameworks, cloud-based storage, and containerization technologies that can easily scale to meet growing data demands. I also recommend monitoring two useful metrics – Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO). These are the maximum tolerable data loss and system downtime, respectively. Establishing clear RPO and RTO targets can help your data team protect against data loss and minimize disruptions to organization growth.

It degrades customer experience

As mentioned before, scalable data engineering plays a huge factor in customer experience, like in 2022, when Netflix described a slowness during peak viewing times and identified the need for a more scalable data workflow orchestrator which they achieved in Maestro.

There is so much data about customers being collected every second. To be specific, about 5.5 exabytes about one person in a day, based on IBM’s approximates. This is not a number manageable through manual processing and requires smart data engineering. When I say smart, I don’t mean AI. Generative AI has the potential to revolutionize customer experience, but only if it is trained on high-quality, representative data. Poor data can lead to AI models generating irrelevant or even offensive content, damaging customer trust.

When customer data is fragmented, outdated, or inaccurate, it can lead to ineffective personalization, underperforming marketing campaigns, and unsatisfactory customer service. If a company misidentifies a customer’s preferences due to outdated information, it may send promotions for products that the customer has no interest in, leading to disengagement. On the other hand, too much accurate data could also lead to problems, like in the case of Target’s pregnancy prediction scandal, reiterating my point that poor data engineering is not always just a technical issue, but an executive level decision-making issue.

Conclusion

Poor data engineering is no longer a mere inconvenience; it’s a strategic threat that undermines decision-making, erodes customer trust, and exposes organizations to significant financial and reputational risks. “Insights” that have been derived as a result of improperly developed systems that can lead to both wasted costs and missed revenue opportunities, hindering growth.

Thus, to thrive in a data-driven landscape under the sky of generative AI, businesses must prioritize data engineering as a core competency. By investing in robust data infrastructure, cultivating data-literate teams, and giving data professionals a seat at the table, organizations can unlock the full potential of their data and cross greater milestones for years to come.

Aryng's Blog

The Urgent Risks of Bad Data Engineering

The Urgent Risks of Bad Data Engineering