22 open source datasets to boost AI modeling

We’re excited to bring Transform 2022 back to life on July 19th and virtually July 20-28. Join AI and data leaders for sensible conversations and exciting networking opportunities. Register today!

Some say, “Data is the new oil,” with an air of seriousness. And while the line can capture certain truths about the modern digital economy, it fails to model the bits in a way that can be replicated over and over again. Sometimes the ease of distribution creates a clear absence of scarcity and it changes the economics of the whole game. One of the best ways to visualize this is to tap on some open source datasets spread across the Internet. All are free to use and one of them can be as per the requirement of your project.

Why do people share them? Some are using it for promotion, a kind of cheap advertising. Some cloud providers create datasets knowing that those who need them are more likely to sign up for computational power from the same company. If the data is ready, why wait to send it across the country.

Some governments share them because they are part of the tradition. Taxpayers should get something – in these cases, transparency about what their funding is.

Others understand that collaboration often wins. Datasets made up of hundreds, thousands, or even millions of small contributions can be more accurate and useful than individual company datasets.

Still others share data because it is part of a scientific process. Maybe it was collected thanks to a grant that needed to be shared. Maybe the responsible team wants others to build on it. Presumably, there is someone who believes that the scientific community can use it.

Undoubtedly, some of this information may not be as accurate as we would like it to be. Sometimes well-owned data collection is the only way to pay for reliable information. But if your project can sustain the risk, if your calculations can deal with an error range of data, well, it’s best not to look the gift horse in the mouth.

Here are 22 options for free data:

Openstreet map

They call it the “world map, created by you.” Their browser-based editor makes it relatively easy for anyone to access the dataset and edit the locations of streets, buildings, landmarks and more. The results are bundled into a large tarball that anyone can use – including companies that make big maps and find routes.

US Census

While details of each census are kept secret by law for 72 years, the US Census Bureau shares statistics with everyone. They run many portals that make it possible to download details of neighboring areas and cities. Fast food restaurants use the information to plan new locations. States use it to fund local governments. Look here, here, here or here to get started.


The institute is dedicated to data science, data science learning and data. Their portal offers easy access to notebooks full of Python and R codes, as well as some lessons and some competitions for learning how to use them. An angle is a huge collection of datasets ranging from essential to fantastic. From Omicron daily cases, to country-tabulated, South Korean lottery winning numbers.


Governments run on data and the US government sometimes shares it. Data.gov is a central clearing house that lists many data sources, such as the Integrated Postsecondary Education Data System, a collection of data about colleges, or the US Geological Survey’s collection of topographic data about every square mile in the country. And in addition to meta surprises, they also provide a list of data hubs in individual agencies, bureaus and departments for further excavation.


Europe also believes in opening up data to the world and Data Europa is a project run by the European Union that collects bytes from all member countries. At the time of writing, the collection contains 1,397,730 datasets and covers a wide range of topics ranging from agriculture to transportation. Traditional areas of government oversight, such as policing and the economy, are well represented, but there are many strange and unexpected discoveries, such as a list of all medieval manuscripts in Basel University’s library or a survey of Internet users in Switzerland.


There is no reason to wonder about the state of Brexit. The United Kingdom also publishes a list of its own public data sources. Some data comes from the central government and some from local authorities or some public bodies.


The Public Library of Science was established in 2001 as an alternative to the lucrative scientific journals that dominate the world of research. Along the way, he also created PLOS Open Data, a collection of open datasets commonly associated with research in journals. If you have a question about analytics or you just want to re-run the numbers differently, there is a good chance the data will be available. This is a crucial opportunity for scientists to create meta-analyzes by combining the research of multiple studies to discover larger patterns and issues.

Open Science

Open Science Data Cloud is another method where scientists from many different disciplines can share their lab data with each other. Some of the largest projects include the collection of bookworms, books and other textbooks from Harvard’s Cultural Observatory and the collection of biological and biomedical data for the study of cells.

University collections

Many disciplines and sub-disciplines retain their data collection, which is often created by dedicated researchers with a specific understanding of the field and what other researchers want to use. For example, the Machine Learning Group at UC Irvine has a collection of hundreds of datasets already set up to train machine learning algorithms. CERN, home to large particle accelerators, shares petabytes and petabytes of data for physicists.

City data

Many cities in the country have embraced open data with varying degrees of devotion. Tax databases and real estate information usually appear first. Some sprinkle data on their various websites, but some have directories full of pointers. See New York City, Baltimore, Miami or Orlando for starters. Many smaller spaces like Ithaca or Auburn are also online.


AWS offers a large collection of datasets and also preloads some of their best services such as EMR, often for use as an example. Many of these include some large government datasets, such as the NEXRAD weather radar system or landset images. The company is advancing environmental awareness in this area so many collections focus on natural data as part of the meaning on the Amazon Sustainability Data Initiative and AWS. In January, they updated the biocoustic recordings of Orca Sound with streaming audio around Puget Sound.


Azure Open datasets are curated and preprocessed to make it easier to use with Azure patterns and AI routines. Many large government sets, such as weather data, are regularly polled and updated so that the latest information is available at the same location. Economists can track inflation with details of the Producer Price Index compiled by the US Department of Commerce. For example, city planners may be interested in New York City’s yellow taxi cab records with pick-up and drop-off times but no personal information.


Google’s cloud collects a wide variety of datasets from a variety of government sources. They have also explored making it easier to use direct data without creating anything. Public Data Explorer lets you drill down directly into data to create interactive charts and graphs from sources such as the World Economic Forum’s Global Competitiveness Report. Google’s Colab Jupyter Notebook interface offers an R or Python analysis of open data or to track your own private data.


For data scientists who need information, IBM operates the Data Access Exchange (DAX). Collection of datasets collected from major government and open data sources. IBM focuses on supporting machine learning and artificial intelligence in industries that build customer base. For example, the Oil Reservoir dataset is packed with over 30,000 different simulations. The fashion dataset, for example, comes with 60,000 costume images that have been certified for training machine learning algorithms.

Companies that want to create their own data repositories can also go for Open Data for Industries, a hybrid collection of tools designed to break down data silos in organizations while simplifying analysis, reporting and AI training.

Five Thirty Eight

The popular data journalism site Fifty Eight often includes data that forms the basis for their analysis and writing. For example, NHL predictions are based on thousands of simulations that are updated after each game. Political voting is ready for your own statistical check on questions such as whether voters prefer a Republican or Democrat general ballot. And if you’re curious as to which polls are more credible, the Fifty Eight also distributes their meta-analyzes on Polyester ratings.

GitHub Security

Programmers who use GitHub to store versions of their code need to worry about security issues and GitHub wants to help them. They gather security advice about the flaws found in various frameworks, libraries and other open source blocks of code for developers to see. They also decided to open a collection so that anyone could contribute.

Autonomous car

A big challenge for the automobile industry is to create the autonomous car of everyone’s dreams. Many car companies are sharing datasets collected by their car or lab equipment, so anyone can experiment creating many layers that are needed to run it smoothly. Some different sets contain data from Audi, ApolloScape. Google, Motional, Oxford, and Waymo.


As of this writing, Yelp distributes subsets of their vast collection of opinions about restaurants, shops and other organizations. The current batch has nearly 7 million reviews of more than 150,000 businesses from eleven major cities. Yelp expects text and photos to provide the best opportunities to train natural language processing algorithms and other AI applications, but you may come up with a different idea.


Many datasets are fairly raw and unorganized. DBpedia is an attempt to create an open knowledge graph filled with ontological information that can be queried with SPARQL. The structure makes it possible to create queries that contain strong guesswork and do not rely solely on raw keywords to find the answer. Most of the information comes from various Wikipedias.


Many bits of cultural floatsum are found in Facebook’s social network and one way to find them is through the Metana Graph API. We are all just nodes in this huge data structure and your code can poke around it through the API, more or less, the same things you can see if you are logged in.


While many people think of repositories like GitHub as places for code, many store data inside, sometimes even with some code as just an independent source. This approach brings all the built-in features to track the evolution of files over time, which are often missing from many databases. Some quick searches often feature multiple repositories that can do just what you want. MIT’s course on Deep Learning, for example, collects sample material for class assignments such as Autonomous Car Training. If you are studying NFT, some Python analytics can do just what you want. Thousands of treasures are away from the squirrel.

Industry organizations
Many businesses rely on the membership organization’s network to handle tasks such as publishing magazines, running conferences, sponsoring studies, lobbying governments, and sometimes collecting datasets that everyone can now use. The British Film Institute, for example, has been tracking box office receipts for years and publishing the data in raw form and in statistical yearbooks. The American Iron and Steel Institute monitors crude steel production. Most large businesses help someone collect useful data.

Venturebeat’s mission Transformative Enterprise is about to become a digital town square for technology decision makers to gain knowledge about technology and transactions. Learn more about membership.

Similar Posts

Leave a Reply

Your email address will not be published.