Analysts estimate that by 2025, 30% of the data generated will be real-time data. That’s 52 zetabytes (ZB) of real-time data per year – an approximate amount Total Data produced in 2020. Due to the rapid increase in data volume, 52 ZB is three times the amount Total Data produced in 2015. With this exponential growth, it is clear that conquering real-time data is the future of data science.
Over the past decade, technology has evolved through the likes of Materialize, Deepwen, Kafka and Redpanda to work with these trends of real-time data. They can convert, transmit and maintain data streams on the fly and provide the basic building blocks needed to create applications for new real-time realities. But to make such a large amount of data really useful, it is necessary to use Artificial Intelligence (AI).
The enterprise needs prudent technology that can create knowledge and understanding with minimal human intervention to keep up with the tidal waves of real-time data. The idea of implementing AI algorithms on real-time data is still in its infancy. Exclusive hedge funds and big name AI players – such as Google and Facebook – use real-time AI, but few others have entered the water.
To make real-time AI ubiquitous, supportive software must be developed. This software needs to provide:
- Easy way to transition from static to dynamic data
- An easy way to clear static and dynamic data
- An easy way to move from model creation and recognition to product
- Requirements – and the outside world – an easy way to manage software as it changes
Easy way to transition from static to dynamic data
Developers and data scientists want to spend their time thinking about important AI issues, not worrying about time consuming data plumbing. The data scientist should not worry whether the data is Panda’s static table or Kafka’s dynamic table. Both are tables and should be treated the same way. Unfortunately, most current generation systems treat static and dynamic data differently. Data is obtained differently, interrogated differently and used differently. This makes the transition from research to production costly and labor-intensive.
To get real value out of real-time AI, developers and data scientists need to be able to make a seamless transition between the use of static data and dynamic data in the same software environment. This requires a common API and framework that can process UX-consistently both static and real-time data.
An easy way to clear static and dynamic data
The most erotic task for AI engineers and data scientists is to create new models. Unfortunately, most of the time of an AI engineer or data scientist is devoted to becoming a data gatekeeper. Datasets are essentially dirty and must be properly cleaned and massaged. This is a grateful and time consuming task. With the rapidly growing flood of real-time data, this whole process should take less human labor and work on both static and streaming data.
In practice, concise, powerful, and expressively simple data cleaning is performed to perform simple data cleaning operations working on both static and dynamic data. These include removing bad data, filling in missing values, merging multiple data sources, and converting data formats.
Currently, there are some technologies that allow users to implement data cleaning and manipulation logic only once and use it for both static and real-time data. Both Materialize and ksqlDb allow SQL queries of Kafka streams. These options are a good choice for use cases with relatively simple logic or for SQL developers. Deephaven has a table-oriented query language that supports Kafka, Parquet, CSV and other common data formats. This type of query language is more complex and suitable for more mathematical reasoning or for Python developers.
An easy way to move from model creation and recognition to product
Many – perhaps even most – new AI models never make it from research to production. This hold-up is because research and production are usually carried out using very different software environments. The research environment is ready to work with large static datasets, model calibration and model validation. The production environment, on the other hand, predicts new events. To increase the fraction of AI models affecting the world, the steps from research to product should be extremely simple.
Consider an ideal scenario: First, static and real-time data will be accessed and manipulated by a single API. This provides a compatible platform for creating applications using static and / or real-time data. Second, data cleaning and manipulation logic will be applied once for use in both static research and dynamic production cases. Duplicating this logic is costly, and research and production are more likely to differ unexpectedly and consequentially. Third, the AI model will be easier to serialize and desegregate. This allows production models to simply switch out by changing the file path or URL. Ultimately, the system will make it easier to monitor – in real time – how well the product AI models are performing in the wild.
Requirements – and the outside world – an easy way to manage software as it changes
Change is inevitable, especially when dealing with dynamic data. In data systems, these changes can be input data sources, requirements, team members and more. No matter how carefully a project is planned, it will be forced to adapt over time. Often this adaptation never happens. Accumulated technical debt and knowledge lost through staffing changes destroy these efforts.
In order to handle the changing world, real-time AI infrastructure requires that all phases of the project (from training to recognition to production) be understood and improved by a very small team. And not just for the original team – it should be understandable and refined by the new individuals who inherit the existing product application.
With tidal waves of real-time data, we will see significant innovations in real-time AI. Real-time AI goes beyond the world’s Googles and Facebooks and into the toolkit of all AI engineers. We will get better answers, faster and with less work. Engineers and data scientists will be able to spend more time focusing on interesting and important real-time solutions. Businesses will receive high-quality, timely answers from fewer employees, reducing the challenges of hiring AI talent.
When we have software tools that simplify these four requirements, we will finally be able to get real-time AI.
Chip Kent is the leading data scientist at Defeven Data Labs.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including tech people working on data, can share data-related insights and innovations.
If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing to your own article!
Read more from DataDecisionMakers