To optimize data curation for AI, Lightly turns to self-supervised learning

All machine learning models are bound by one crucial factor: the quality of the data on which the model is trained.

The challenge of data curation to improve the quality of machine learning and AI models is well understood. The 2021 MIT research study found systemic problems in how training data is labeled, leading to inaccurate results in AI systems. A study in the journal Quantitative science studies Who analyzed 141 previous investigations of data labeling found that 41% of the models were using datasets that were labeled by humans.

One of the vendors trying to meet the challenge of optimizing data curation for AI is a Swiss startup, Lightly. Founded in 2019, the company announced this week that it has raised $ 3 million in a seed round of funding. However, Lightly is not looking to become a data-labeling vendor. Instead, the company seeks to help curate data using a self-supervised machine learning model that could one day completely reduce the need for data labeling operations.

“I keep wondering how much work in machine learning is not manual, very tedious and automated,” Matthias Heller, co-founder of Lightley, told VentureBeat. “People have always believed that everything is very advanced with machine learning, but machine learning and deep learning, in particular, is such a young technology and a lot of tooling and infrastructure is being made available right now.”

Emerging market for data curation and data labeling

There is no shortage of money or vendors in the market to help optimize data for machine learning, be it data curation or data labeling.

For example,, formerly known as DefinedCrowd before rebranding in 2021, has raised $ 78 million so far to help advance its data curation vision.

And Grand View Research predicts that the data labeling market will reach $ 8.2 billion by 2028 with an estimated annual growth rate of 24.6% between 2021 and 2028. VentureBit’s own list of top data labeling software vendors includes Apen’s figure eight, Amazon Sageman. Ground Truth, Super Annotate, Datalloop and V7’s Darwin.

Other popular vendors include Labelbox and Open-Source LabelStudio, both of which can be integrated with Lightly’s technology. In general, a lightly open approach is planned, so that users can use the company’s technology with any labeling vendor.

How the self-observed model works

Three years ago, Heller and his co-founder Igor Susmelaj were working on a machine learning project that required them to label their data.

“We’ve always wondered if the data we’re labeling really helps improve the model,” Heller said.

It led to Lightly, which included a series of open-source projects. The primary project is Lightly Library, which provides a self-inspection approach to machine learning on image.

Heller explained that there are multiple approaches to training data for machine learning. In supervised approaches, such as with computer vision, an image and an associated label are used to teach the model in which the man is labeling.

Uneducated education, on the other hand, is the opposite – there is no need for human interaction. The self-observed model that enables lightness falls somewhere in the middle, requiring minimal human interaction.

“You can use a self-observed model to curate data because the model learns specific information, specific similarities, interrelationships and what is different,” Heller said.

From open source to commercial solution

While Lightly can be used as an open-source technology for free, it still requires users to work hard to set up the right environment and manage the configuration.

Lightly’s commercial service provides managed offerings with infrastructure, tuned algorithms and learning frameworks tailored to all users.

“Our main competition today is in-house tooling,” Heller said. “We use self-supervised learning to tell you what 1% of data you should label and use for model training.”

Looking ahead, Heller provocatively predicts that the day may come in the future when data labeling will no longer be needed, as unsupervised machine learning continues to improve.

“I think the need for labels will decrease significantly over the next few years,” Heller said. “Maybe in the future, we won’t need labels anymore.”

Venturebeat’s mission Transformative Enterprise is about to become a digital town square for technology decision makers to gain knowledge about technology and transactions. Learn more about membership.

Similar Posts

Leave a Reply

Your email address will not be published.