Follow the information (The real data problem)

Come straight up! One come all come! Welcome to Three Card Montenegro’s top stacks game the world has never seen.

Deep learning data is facing a problem: the demand for labeled data is almost endless, and, arguably, the lack of labeled data in the enterprise is a major obstacle to progress.

Let’s find the answer.

First, we are going to choose from an astonishing number of technologies that have emerged over the last few years to address the data problem at the root of artificial intelligence. All of these cards are placed in front of us and, of course, under one of them is the mystery of the next number of unicorn and decacorn.

Unsupervised Education, Foundation Models, Poor Oversight, Transfer Learning, Ontology, Representation Education, Semi-Observational Education, Self-Inspection Education, Synthetic Data, Knowledge Graphs, Physical Simulation, Symbol Manipulation, Active Learning, Zero-Shot Learning and General .

Just to name a few.

The concepts bob and weave and connect and split strangely and unexpectedly. It is not a single word in a long list that has a universally agreed definition. Powerful tools and excessive promises overlap, and an amazing range of techniques and tools is enough to make even the most discerning customers and investors lose their balance.

So, which one would you choose?

All data, no information

The problem, of course, is that we should never see cards in the first place. There was never a question as to which magical buzzword was going Abolished Data problem because this problem was never really about data. At least, not exactly.

By itself, the data is useless. In less than a hundred keystrokes, I can set up my computer to generate enough random sound to keep up with the modern neural network through instability until the heat death of the universe. With a little more effort and a picture from a 10 megapixel phone, I would be able to black out every combination of three pixels and create more data than is available on the internet today.

Data is just a vehicle. Information That’s what it carries. It is important not to confuse the two.

In the examples above, there is plenty of data, but almost no information. The problem is also reversed in complex, information-rich systems such as loan approvals, industrial supply chains, and social media analysis. Rivers of thought and galaxies of human expression are boiled in the binary of reduction. Such as trying to mine with a mountain.

This data is the core of the problem. It is an incomprehensible gift of information – a billion cars on the road – that is somehow both tangible and inaccessible. There are thousands of people and billions of dollars in captcha tests and classification datasets with low loads of tailings and gravel behind them.

That’s where the tsunami of buzzwords comes from. For the complexity of hundreds of papers and methods, inspiration and key principles are simple. The best and simplest explanation is that I give credit to Google’s unders specification paper.

Molding Neural Networks

Imagine every possible neural network as a vast, obscure space. He can do almost anything, but frankly he does nothing.

There is something we want to do with this neural network, but we are not sure yet. It is like an unmolded clay with endless possibilities. It’s an uncontrollable mess, bursting with Shannon entropy, the mathematical formalization of possibility – the amount of freedom left in the system. Similarly, to eliminate those possibilities we will need to add a quantity of information and work to the system.

Today, we are primarily interested in imitating humans. So that information, and that function, must come from humans.

So, to make progress, Men Decisions have to be made. It must win a huge space. Decrease in Shannon entropy. It’s like finding a whole drop of water in a sea of ​​possibilities, and it’s just as impractical as you can imagine. More practically, it’s like finding the right part of the ocean. This is the set of similarities – the infinite subset of the infinitely vast ocean where each option is equally best.

As far as you can tell.

Surveillance, information captured in information, is how we conquer the ocean. This is how we say: “Out of whatever you do To be able to Do, this is you Should Do it. “That’s the key to eliminating noise. There’s no free lunch here, and in the snowstorm of technology and math flowing towards you, the flow of information is what you need to focus on.

Where is the new information entering the system?

Nvidia’s Omniverse Replicator is a wonderful example. It is a synthetic data platform. In truth though, it tells you very little. It describes the data, but Information Physics is a simulation. It is completely different from other synthetic data platforms like statice.ai that focus on using generative models to convert information trapped in individually identifiable data into unrecognized synthetic data that contains similar information.

Another case study is Tesla’s unique active learning approach. In traditional active learning, the main source of information is the data scientist. By referring to an active learning strategy that is well suited for the task, the new training examples will reduce your set of equivalents even more than usual. In a recent conversation with Andrej Carpathian on the subject, he explains how Tesla has significantly improved this technology. Instead of having the best active learning strategies that data scientists have, they take advantage of many noisy strategies together and use more human selection to identify the most influential examples.

Unsurprisingly, they improve the performance of the entire system by adding additional human intervention. Traditionally this would be seen as regression. More intervention means less automation which is less good in conventional lenses. However, when viewed through an information lens, this approach makes perfect sense. You’ve dramatically improved the information bandwidth in the system, so the rate of improvement is faster.

This Is the name of the game. The eruption of buzzwords is frustrating, and without a doubt, the large number of people who co-opted those buzzwords have misunderstood their promise. However, buzzwords are an indicator of real progress. There are no magic pills, and we’ve been exploring these areas long enough to find out. However, each of these areas has led to advantages in its own right, and research has shown that there are still significant benefits to be gained by combining and integrating these monitoring patterns.

It is an age of incredible potential. Our ability to access information from previously unused sources is accelerating. The biggest problems we face right now are the confusion of money and the confusion of noise. When all of this seems overwhelming, and you’re having trouble sorting out facts from fiction, remember:

Follow the information.

Slater is the founder and CTO of Victorof Indico Data.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including tech people working on data, can share data-related insights and innovations.

If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing to your own article!

Read more from DataDecisionMakers

Similar Posts

Leave a Reply

Your email address will not be published.