Discovering, accessing, and incorporating new datasets for use in data analytics, data science, and other data pipeline tasks is usually a slow process in large and complex organizations. Such organizations typically have hundreds of thousands of datasets that are actively managed internally in various data stores and reach an order of additional external datasets of intensity. Finding only relevant data for a particular process is an overwhelming task.
Even once the relevant data has been identified, it can take several months in practice to go through the approval, governance and staging processes required for the actual use of that data. It is often a huge obstacle to organizational agility. Instead of encouraging data scientists and analysts to use a wide range of datasets in their analysis, they are forced to use pre-approved, pre-phased data found in centralized repositories such as data warehouses.
Furthermore, once data from new datasets become available for use in analytical functions, the fact that they come from different data sources generally indicates that they have different data semantics, which is a challenge to integrate and integrate these datasets. Makes. For example, they may refer to identical real-world entities using different identifiers as existing datasets, or associate different attributes (and types of attributes) with real-world entities modeled in existing datasets. In addition, data about that entity is likely to be sampled using a different context relative to existing datasets. Semantic differences across datasets make it difficult to integrate them together into similar analytical work, reducing the ability to obtain a holistic view of data.
Addressing the challenges for data integration
However, despite all these challenges, it is important for data analysts and scientists at the organization to be successful in performing these data search, integration and staging tasks. This is usually done today by significant human efforts, on behalf of some analysts, but by most central teams, particularly in the areas of data integration, cleanup and staging. The problem, of course, is that centralized teams become organizational barriers, further hindering agility. The current status quo is not acceptable to anyone and many proposals have come out to solve this problem.
The two best proposals are “data fabric” and “data mesh”. Instead of focusing on the overview of these ideas, this article focuses specifically on the use of data fabric and data mesh, the problem of data integration, and the challenge of overcoming the reliance on an enterprise-wide central team for how it works. The integration.
Let’s take the example of an American car manufacturer that acquires another car manufacturer in Europe. The American car manufacturer maintains a database of parts, with information on all the different parts needed for the production of the car – supplier, price, warranty, inventory, etc. This data is stored in a relational database – e.g., PostgreSQL. European carmaker Mongodib also maintains a database of parts stored in JSON within the database. Obviously, integrating these two datasets would be very valuable, as it is much easier to work with a single part database than two separate databases, but there are many challenges. They are stored in different formats (relational vs. nested), by different systems, using different terms and identifiers, and also different units for different data attributes (e.g., ft v. Meter, dollar v. Euro). Uses. This integration is a lot of work, and if done by an enterprise-wide central team, it can take years to complete.
Automated with data fabric approach
The data fabric approach seeks to automate the integration process as much as possible without human effort. For example, it uses machine learning (ML) techniques to detect overlap in features (e.g., it contains both supplier and warranty information) and values of datasets (e.g., many suppliers in one dataset appear in other datasets). As well as) to flag these two datasets as candidates for integration in the first place.
ML can also be used to convert a JSON dataset into a relational model: the soft functional dependency that exists in the JSON dataset is detected (e.g., whenever we look for the value for X’s supplier_name, we look for Y’s supplier_address) ) And used to identify groups. Attributes that are likely to correspond to independent semantic entities (e.g., supplier entities), and create tables for these entities and the foreign key associated in the parent tables. Entities with overlapping domains can be merged, the end result is a complete relational schema. (Most of this can actually be done without ML, as with the algorithm described in this SIGMOD 2016 research paper.)
This relational schema produced from the European dataset can then be integrated with the existing relational schema from the American dataset. ML can also be used in this process. For example, query history can be used to observe how analysts access these individual datasets in relation to other datasets and find similarities in access patterns. These similarities can be used to jump-start the data integration process. Similarly, ML can be used for entity mapping across entire datasets. At some point, humans should be involved in finalizing data integration, but data fabrication technologies can automate key steps within the process, the less work humans have to do, the less likely they are to be hindered in the end.
Human-centered data mesh approach
Data Mash takes a completely different approach to this same data integration problem. Although ML and automated technologies certainly do not disappoint in the data mesh, basically, humans still play a central role in the integration process. Regardless, these humans are not a central team, but a group of domain experts.
Each dataset owns a specific domain that has expertise in that dataset. The team is responsible for making that dataset available to the rest of the enterprise as a data product. If another dataset comes along with it – if integrated with an existing dataset – it will increase the usability of the original dataset, if data is integrated the cost of the original data product will increase.
To the extent that these teams of domain experts are encouraged when the value of the data product they create increases, they will be motivated to work harder for data integration. After all, integration is done by domain experts who understand the data of car parts better, rather than a central team that does not know the difference between a radiator and a grill.
Changes in the role of humans in data management
In summary, the data fabric still needs a central human team that performs crucial functions for the overall orchestration of the fabric. However, in theory, this team is unlikely to be an organizational hurdle as much of their work is automated by artificial intelligence processes in the fabric.
In contrast, in the data mesh, the human team is never on the critical path to any task performed by data consumers or producers. However, there is much less emphasis on replacing humans with machines, and instead, the emphasis is on shifting human efforts to distributed teams of domain experts who are the most component in doing it.
In other words, the data fabric is basically about eliminating human effort, while the data mesh is about smarter and more efficient use of human effort.
At first glance, of course, it seems that removing human effort is always better than reusing it. However, despite the wonderful recent advances in ML, we are still not at the point where we can fully rely on machines to perform these key data management and integration activities that are performed by humans today.
As long as humans are still involved in the process, it is important to ask the question of how to use it most efficiently. In addition, some ideas from the data fabric are complementary to the data mesh and can be used together (and vice versa). Thus it is not clear today which question to use (data mesh or data fabric) and whether there is a question of one versus another in the first place. Ultimately, a better solution will probably take the best ideas from each of these approaches.
Daniel Abadi is a Darnell-Canal Professor of Computer Science at the University of Maryland, College Park, and chief scientist at Starburst.,
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including tech people working on data, can share data-related insights and innovations.
If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing to your own article!
Read more from DataDecisionMakers