We have heard statements like “Data is the new oil”, “Data is the lifeblood of any business” etc. for almost a decade now. “Big Data” was the buzzword not too long ago followed by “IoT” and now AI/Machine Learning. Irrespective of the terminology, the fact is that data is the underlying factor that propels these emerging technologies and their business applications forward.
According to World Economic Forum, back in 2019 they had predicted that the entire universe of data would grow to 44ZB (1 zettabyte (ZB) = 1 million petabytes) by 2020. Rapid digitization, remote working, learning and new applications that evolved during the pandemic catapulted the data growth to now touching 94ZB by the end of 2022. That is more than twice the data growth in just the past 24 months alone.
AI and Machine Learning feed on large amounts of data to build models and train them to deliver the inferences needed to drive meaningful outcomes. As adoption of AI continues to grow, the hunger for more data gets even more exacerbated.
Related Blog: 10 Things You Should Know About Quantum Computing and AI
The Many Facets of Data
There used to be a time when data was almost always structured (I.e., tables, columns, forms, transactions, etc.) and often housed in databases and data warehouses, siloed across various functions within an organization or even within one of its branches. As the world of sensors, images, videos, voice, social media exploded over the past decade and half, unstructured data (raw data captured as is with no specified format) has become prevalent in many business. Organizations now have to cope with managing multiple data types and taming unstructured data which is highly unpredictable. Additionally, new regulations and privacy laws (e.g., Europe’s GDPR, California Consumer Privacy Act) further limit the use of all data that is collected and stored within an organization, not to mention the multiple copies of data used for various business applications.
More data also leads to strategic, governance and policy issues on who manages, stores, accesses and ultimately owns the data, and importantly, for what purposes or analyses will data be used. Poor data governance could also hamper regulatory compliance requirements which may not have been foreseen when data was being collected in the first place. Most mid to large organizations today have data stewards like CDOs (chief data or digital officers) in place to lead and be accountable for organizational data governance—certainly a far cry from the IT/CIO team that used to own all of this in the past.
There is No AI without Data
By its very definition, artificial intelligence or AI doesn’t get created in a vacuum. It requires lots and lots of data for the algorithms to deliver a solution that is meaningful and more probabilistic than looking into a crystal ball. The foundational data feeding into AI models directly impacts the behavior and resulting outcomes of those models. These could be potentially manipulated, or inadvertently unattended models used for credit or loan decisioning, security bypass, benefits eligibility, medical imaging and diagnosis, product quality inspection, fraudulent insurance claims, etc.
The pandemic created an enormous challenge for most supply-chain pipelines as demand outstripped supply and transportation bottlenecks further caused material shortages causing price increases and inflation. None of this could have been predicted, no matter how much data organizations had in their repository. Since then, organizations have had sufficient data on consumer behavior to be able to reasonably predict demand, although snags at ports resulted in significant losses at major retailers.
AI is not the silver bullet to address every use case. However, in the example above multiple datasets spanning across inventory at stores, POC analytics, store footfalls, seasonal trends, shipment tracking, material and factory capacity, worker availability etc. could be leveraged in building probabilistic based AI models to smoothen the hiccups in the Supply Chain. Organizations are now looking to expand use cases across functional areas beyond supply chain into marketing, sales, finance, and customer support to name a few.
Is Data the New Currency?
Since data is at the heart of generating insights and potentially used for making business decisions, it is becoming more of a currency and a competitive differentiator to many organizations. Currency by its very definition is fluid and needs to be monitored through its cycles to ensure it doesn’t cause an imbalance within the market it operates. While there are guidelines in place to manage currency in regulated environments, the same cannot be managed in non-regulated payment avenues or currency alternatives like crypto.
Of course, there are also many differences between data and currency, including relative, contextual valuations, simultaneous claims to ownership, lack of traceability, need for repetitive consents, more is not always better, and not being a vehicle (yet) for loan/borrowing/leverage.
In some respects, data could also be considered the new "inventory” because of all the issues of spoilage, cost of warehousing, depreciation, liabilities, accounting, etc., that are starting to apply to data too.
Extrapolating any of these attributes to data within an enterprise is no small task—especially when there aren’t explicit guidelines on who owns the data, how to protect it and how to further leverage it for building models that drive business decisions. While industries like banking, health care, insurance, capital markets, and government entities are often controlled by regulations on how data is handled, irrespective of the applications, the same isn’t true for online e-commerce platforms, retail, consumer, hospitality sectors. Of late, regulators like the Federal Trade Commission have cracked down on numerous companies in these industries to protect consumer rights as well as set a precedence for other companies to follow.
This type of data also drives associated privacy requirements e.g., machine data, which usually include data logs, IT tickets, and process anomalies. That data may not carry the same weight as human-generated data from body sensors, voice recordings, videos and images, and other biometric measures—often associated with personal identifiable information (PII).
Yes, You Need a Strategy for Data and AI
Before your organization embarks on the next AI project, you should start with a basic rubric of questions like:
- What data will you use?
- Do you have rights to use it?
- What sort of cleaning, processing and balancing does it require for your use case?
- Is the data sufficiently relevant and likely to contain the answers and value you seek?
- What data will you discard, and after how long?
- Will you use synthetic data to augment your current data environment?
- How will you measure data quality?
- What safeguards do you have in place to prevent model drift resulting from bad or biased data?
Andrew Ng (a computer scientist, tech entrepreneur, co-founder of Google Brain, adjunct professor at Stanford and a lead AI evangelist) has floated a nascent idea of datacentric AI wherein to get to the right results, you hold the model or code fixed and iteratively improve the quality of the data. This is a bold concept to address the growing challenge that 80% of an AI practitioner’s time is spent on data cleaning and data wrangling before the training model is even built.
Organizations will need to rethink how data is sourced to drive the most appropriate AI-driven business outcomes coupled with innovative technologies to harness the data while also ensuring they have the guardrails to govern the data lifecycle. This will truly differentiate the AI driven organizations of the future!
Want to learn more about AI?
Join the Conversation in CompTIA's AI Technology Interest Group.