The importance of data audits when building AI

We are excited to bring Transform 2022 back to life on 19th July and virtually 20th July – 3rd August. Join AI and data leaders for sensible conversations and exciting networking opportunities. Learn more about Transform 2022


Artificial intelligence can do a lot to improve business practice, but AI algorithms can also introduce new avenues of risk. Consider, for example, Zillow’s recent offer to buy the company’s branch fixtures, after significantly overshooting its predictive model home values. When housing price data changed unexpectedly, the group’s machine-learning models did not adapt quickly enough to take into account volatility, resulting in significant losses. This type of data mismatch or “concept drift” occurs if you do not give proper care and respect to data audit.

Zillow’s failure to properly audit its data has not only hurt the company; It can cause widespread harm by intimidating other businesses away from AI. Negative perceptions of technology can hinder its progress in the commercial world, especially for a series like AI that has already gone through many winters. Machine-learning pioneers like Andrew Ng recognize what stands in the balance and launch campaigns to emphasize the importance of data auditing by doing annual competitions for the best data quality assurance methods (instead of just selecting winners by model. is coming).

In addition to my own work creating AI as the host of The Robot Brain Podcast, I have also interviewed dozens of AI practitioners and researchers about their approach to auditing and maintaining high-quality data. Here are some of the best practices I’ve compiled from that work:

  • Beware of outsourcing your data curation and labeling, Data maintenance is not the sexiest task and it is time intensive. When time is short, as it is for most entrepreneurs, it is tempting to outsource the responsibility. But beware of the dangers that come with it. The third-party vendor may not be so closely acquainted with the vision of your product, know the nuances of the context, or have personal incentives to keep the reins as needed. Andrej Carp, head of AI for Tesla, says he spends 50% of his time maintaining vehicle data playbooks because he That Important
  • If your data is incomplete, remove the spaceIf your data sources reveal gaps or potential areas for erroneous predictions, not everything will be lost. One source that is often problematic is demographic data. As we know, historical demographic data sources turn to white men, and that can bias your whole model. Olga Rusakovsky, a Princeton professor and co-founder of AI4All, created the REVISE model, which sheds light on the pattern of correlations (possibly fakes) in visual data. You can use the model to request insensitivity to these patterns or decide to collect more data that does not have a pattern. (Here is the code to run the model if you want to use it.) In this type of situation demographic data is often cited (i.e. medical history data traditionally contains a high percentage of information about Caucasian men), but be that as it may. Can. Applicable in any situation.
  • Understand the effects of intelligence sacrifice for speed, Your data audit can motivate you to plug in a larger data set with more complete coverage. Theoretically, this may seem like a great strategy, but it may actually be inconsistent with the business goal at hand. The larger the data set, the slower the analysis. Is that extra time justified by the value of increased insight?

    Financial services companies have often asked themselves this question because of the large amount of dollars and the technology of the industry is becoming faster and faster (think nanoseconds.) Keeping in mind that a more accurate model, powered by more data, can often result in a longer estimated time during deployment, probably not meeting your speed requirements. Conversely, if you make long-term horizon decisions, you will have to compete with others in the market who include a very large amount of data, so you have to do the same to be competitive.

Implementing AI models to solve commercial problems is becoming commonplace as the open-source community makes them freely available to all. The downside is that as AI-generated insights and predictions become more and more intact, the less flashy task of data maintenance can be overlooked. It’s like building a house on sand. It may look good at first, but as time goes on the structure will break down.

Professor Peter Abel is the director of the Berkeley Robot Learning Lab and co-director of the Berkeley Artificial Intelligence (BAIR) Lab. He founded three companies: Coveriant (AI for intelligent automation of warehouses and factories), Gradescope (AI to help teachers with grading homework and exams), and Berkeley Open Arms. It also hosts podcasts Robot brain,

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including tech people working on data, can share data-related insights and innovations.

If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing to your own article!

Read more from DataDecisionMakers

Similar Posts

Leave a Reply

Your email address will not be published.