Data Sets

On the Data Sets page we intend to share any public available data sets that can be utilized for Machine Intelligence and Data Science purposes. If you know of any other data sets that we should share on the MIIA website or via our real-time community messaging platform (MIIA on Slack), please let us know either on Slack or via info@machineintelligenceafrica.org.

Data Sets

Covid-19 Data sets

Data.World has recently launched their COVID-19 Data Center, which compiles the best of the most trusted and up-to-date data in the fight against COVID. Here is a select list of some of the best data on their resource page:

You can find all of our COVID-19 datasets here.

Fueling the Gold Rush: The Greatest Public Datasets for AI

https://www.iflexion.com/blog/machine-learning-new-gold-rush/

https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2#.c4xidbev8

It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.

Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI.

However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility.

With that said, it can be hard to piece through what public datasets are useful to look at, which are viable for a proof of concept, and what datasets can be useful as a potential product or feature validation step before you collect your own proprietary data.

It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling. Standard datasets can be used as validation or a good starting point for building a more tailored solution.

This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world.

Legend:

? Classic — these are some of the more famous, legacy, or storied datasets in AI. It’s hard to find a researcher or engineer who hasn’t heard of them.

? Useful — these are datasets that are about as close to real-world that a curated, cleaned dataset can be. Also, these are often general enough to be useful in both the product and R&D world.

? Academic baseline — these are datasets that are commonly used the in the academic side of Machine Learning and AI as benchmarks or baselines. For better or worse, people use these datasets to validate algorithms.

? Old - these datasets, irrespective of utility, have been around for a while.

Springboard

19 Free Public Data Sets For Your First Data Science Project:

Find Free Public Data Sets for Your Data Science Project

Computer Vision

? ? ? MNIST: most commonly used sanity check. Dataset of 25×25, centered, B&W handwritten digits. It is an easy task — just because something works on MNIST, doesn’t mean it works.
? ? CIFAR 10 & CIFAR 100: 32×32 color images. Not commonly used anymore, though once again, can be an interesting sanity check.
? ? ? ImageNet: the de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category WordNet hierarchy from ImageNet.
LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
? PASCAL VOC: Generic image Segmentation / classification — not terribly useful for building real-world image annotation, but great for baselines.
? SVHN: House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
MS COCO: Generic image understanding / captioning, with an associated competition.
? Visual Genome: Very detailed visual knowledge base with deep captioning of ~100K images.
? ? ? ? Labeled Faces in the Wild: Cropped facial regions (using Viola-Jones) that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here.

Natural Language

? ? Text Classification Datasets (Google Drive Link) from Zhang et al., 2015: An extensive set of eight datasets for text classification. These are the most commonly reported baselines for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo!, Sogou, and AG.
? ? WikiText: large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
? Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.
? ? SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a span, or segment of text.
CMU Q/A Dataset: Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
? Maluuba Datasets: Sophisticated, human-generated datasets for stateful natural language understanding research.
? ? Billion Words: large, general purpose language modeling dataset. Often used to train distributed word representations such as word2vec or GloVe.
? ? Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s crawl of the WWW.
? ? bAbi: synthetic reading comprehension and question answering dataset from Facebook AI Research (FAIR).
? The Children’s Book Test (download link): Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering, reading comprehension, and factoid look-up.
? ? ? Stanford Sentiment Treebank: standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence’s parse tree.
? ? 20 Newsgroups: one of the classic datasets for text classification, usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
? ? Reuters: older, purely classification based dataset with text from the newswire. Commonly used in tutorials.
? ? IMDB: an older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.
? ? UCI’s Spambase: Older, classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.

Speech

Most speech recognition datasets are proprietary — the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.

? ? 2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
? LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
? ? VoxForge: Clean speech dataset of accented english, useful for instances in which you expect to need robustness to different accents or intonations.
? ? ? TIMIT: English-only speech recognition dataset.
? CHIME: Noisy speech recognition challenge dataset. Dataset contains real, simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.

Recommendation and ranking systems

? ? Netflix Challenge: first major Kaggle style data challenge. Only available unofficially, as privacy issues arose.
? ? ? MovieLens: various sizes of movie review data — commonly used for collaborative filtering baselines.
Million Song Dataset: large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems.
? Last.fm: music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems.

Networks and Graphs

? Amazon Co-Purchasing and Amazon Reviews: crawled data from the “users who bought this also bought…” section of Amazon, as well as amazon review data for related products. Good for experimenting with recommendation systems in networks.
Friendster Social Network Dataset: Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.

Geospatial data

? ? OpenStreetMap: Vector data for the entire planet under a free license. It includes (an older version of) the US Census Bureau’s TIGERdata.
? Landsat8: Satellite shots of the entire Earth surface, updated every several weeks.
? NEXRAD: Doppler radar scans of atmospheric conditions in the US.

❗️People often think solving a problem on one dataset is equivalent to having a well thought out product. Use these datasets as validation or proofs of concept, but don’t forget to test or prototype how the product will function and obtain new, more realistic data to improve its operation. Successful data-driven companies usually derive strength from their ability to collect new, proprietary data that improves their performance in a defensible way.

Awesome Data Science

Herewith a list from Awesome Data Science:

Academic Torrents
hadoopilluminated.com
data.gov – The home of the U.S. Government’s open data
United States Census Bureau
freebase.com
usgovxml.com
enigma.io – Navigate the world of public data – Quickly search and analyze billions of public records published by governments, companies and organizations.
datahub.io
aws.amazon.com/datasets
databib.org
datacite.org
quandl.com – Get the data you need in the form you want; instant download, API or direct to your app.
figshare.com
GeoLite Legacy Downloadable Databases
Quora’s Big Datasets Answer
Public Big Data Sets
Houston Data Portal
Kaggle Data Sources
Kaggle Datasets
A Deep Catalog of Human Genetic Variation
A community-curated database of well-known people, places, and things
Google Public Data
World Bank Data
NYC Taxi data
Open Data Philly Connecting people with data for Philadelphia
A list of useful sources A blog post includes many data set databases
grouplens.org Sample movie (with ratings), book and wiki datasets
UC Irvine Machine Learning Repository – contains data sets good for machine learning
research-quality data sets by Hilary Mason
National Climatic Data Center – NOAA
ClimateData.us (related: U.S. Climate Resilience Toolkit)
r/datasets
MapLight – provides a variety of data free of charge for uses that are freely available to the general public. Click on a data set below to learn more
GHDx – Institute for Health Metrics and Evaluation – a catalog of health and demographic datasets from around the world and including IHME results
St. Louis Federal Reserve Economic Data – FRED
Dept. of Politics @ New York University
Open Data Sources
UNICEF Statistics and Monitoring
UNICEF Data
undata
NASA SocioEconomic Data and Applications Center – SEDAC
The GDELT Project
Sweden, Statistics
Github free data source list
StackExchange Data Explorer – an open source tool for running arbitrary queries against public data from the Stack Exchange network.
San Fransisco Government Open Data
IBM Blog abour open data
Google Public Data
Open data Index

From South Africa

Data Science Weekly

Herewith a list of data sets via Data Science Weekly:

Amazon Public Data Sets
Public Data Sets on AWS: centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications
Wikipedia
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries
Freebase
A community-curated database of people, places and things
World Bank
DataBank is an analysis and visualization tool that contains collections of time series data on a variety of topics
Windows Azure Marketplace
Free datasets via Windows Azure Data Market including Academic data, Speech Recognition data, etc.
Machine Learning Repository
200+ Datasets from Center for ML & Intelligent Systems
Deep Learning Data Sets
Music, natural images, text, speech, faces, recommendation systems datasets for benchmarking algorithms
Stanford Large Network Dataset Collection
A collection of about 50 large network datasets from tens of thousands of nodes and edges to tens of millions of nodes and edges. It includes social networks, web graphs, road networks, internet networks, citation networks, collaboration networks, and communication networks.
Yahoo Datasets
We have various types of data available to share. They are categorized into Ratings, Language, Graph, Advertising and Market Data, Computing Systems and an appendix of other relevant data and resources available via the Yahoo! Developer Network.
And, if you are looking for something specific, you can always try your luck posting on reddit/r/datasets or on Open Data StackExchange