Blog

List of Lists

Dec 6, 2017

List of Lists

Data science, big data analytics, and machine learning are all fueled by datasets. Access to datasets is the first part of kicking off any project and the quality and amount of data determines how successful these projects will be. So where do you get datasets? Almost any person in the data science profession will tell you they had to acquire, collect, scrape, or aggregate data from other sources.

In order to make finding datasets easier, several websites have been created to aggregate datasets in publicly available lists. Datasets on these sites are largely available for free with sites that offer datasets for purchase just starting to pop up. As a co-founder of Data & Sons, a new dataset marketplace, I found myself spending a lot of time identifying as many of these as I could. I thought it would be useful to share this list of lists with others seeking to find the datasets they need.

Limitations. First, I stuck with broad listings and did not include ones that focused on a particular type of data (e.g. government, locations, mailing addresses, etc.). Second, I did not include lists that only allow access to certain groups or people (e.g. academics) or entities (e.g. businesses only).

In alphabetical order, here’s my list of dataset lists.

Amazon Web Services Decent sized list of well organized datasets you can search by several categories. A nice collection of government, scientific, and business data. However, it seems most of the data was posted back in 2015 with a limited amount of more recent datasets.

Analytics Bodhi provides data science/analytics training and resources. They’ve included a list of publicly available datasets in their resources with brief descriptions. They link to both datasets and other dataset lists…an early version of the project I am undertaking.

Awesome Public Datasets Authored by Xiaming Chen and posted on Github (caesar0301), Awesome Public Datasets is perhaps one of my favorites. Well organized into categories, the list identifies many publicly available scientific datasets. Simple, clean, elegant, a wonderful list for anyone in the sciences.

BigML The very smart folks at BigML have put together a list of data sources they have across in their work. Some really interesting finds on here well organized by types of datasets/data providers.

Data Circle  allows people to buy, sell, and share datasets easily. The site is in beta with a limited number of datasets. I like the clean, happy interface, and am a big believer in their business model ;-)

Datahub Basic listing page with around 200 datasets and a description of each. I found there were more unique datasets (i.e. data not available on other lists) on here than in other dataset locations.

Data is Plural is a publicly available Google docs spreadsheet created by Jeremy  Singer-Vine and curated by a team of diligent humans. I believe this is one the most extensive, diligent, and well maintained dataset lists available. As new sources of data are identified, they get updated on the spreadsheet with entries from 2015 to the present.

Data & Sons enables people to buy, sell, and share datasets. Data seems to be available at two extremes: free and very expensive. With over 90% of data inaccessible, we thought creating a market would create a more equitable exchange of data leading to increased data accessibility.

Datausa.io does perhaps the best job of data summation and presentation I've seen. There is a lot of information about the USA on here and they do a superb job of providing data driven averages a wide range of things in the US. They also provide many US related datasets for download and a slick shopping cart interface. Nod to Daniel Shorstein for bringing Datausa.io to my attention.

Data.world provides individuals and businesses the ability to collaborate on data science projects with a suite of analytics tools, storage, and professional networking. They also provide a free datasets and other data types. The only issue I have with data.world is that it requires a user account to view and access their datasets.

Google Public Data Datasets from governments around the world. Neat, clean, no fuss, but limited to around 130 sources.

Kaggle is without a doubt the center of the data science universe. Acquired by Google in March, 2017, Kaggle provides data scientists a place to connect, learn, and earn some extra money through their competitions. Kaggle also provides perhaps the most extensive lists of free datasets I have come across. Cleanly organized and always interesting to browse, Kaggle has over 5,000 datasets. Based on user profiles, it seems Kaggle has several employees engaged in creating new datasets. I think those Kaggle generated datasets are some of the most interesting and well put together sets you can find for free.

OpenDataSoft might just be the largest and most diverse offering of datasets I have seen with over 9,000 datasets from all over the world. Datasets range from government, busness, and social we see on other lists, but also include some really fun ones like workout data from Apple watches. I was part of a conversation recently on whether English was the language of datasets and you needed to be an English speaker to first get into data science. I was happy to see that there are many datasets in other languages on Open Data Soft and that over 4K of the datasets were in French. Thanks to Nicolas Terpolilli for directing my attention to Open Data Soft.

Socrata has a basic interface searchable by category. Socrata also offers several types of data products including datasets, documents, forms, etc. There is a massive amount of data products on Socrata making it one of the largest sources available. Many of the datasets seem to be from government sources.

ZeMiner  is primarily offers datasets on web sites, traffic, and usage from Amazon, Google, and Facebook. Although they are fairly focused in their data offerings, there is a lot of datasets here that can provide a great deal of insight into a broad range of Internet activities.

Please let me know of any dataset sources I may have missed and I will do my best to keep the list of lists updated!