Variety of curated data sources for data science
-
Kaggle: A data science site that contains a variety of externally contributed interesting datasets. You can find all kinds of niche datasets in its master list, from ramen ratings to basketball data to and even seattle pet licenses.
-
UCI Machine Learning Repository: One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets.
-
VisualData (Discover computer vision datasets by category, it allows searchable queries)
-
SnoopPi: A Raspberry Pi based Wifi Packet Capture Workhorse. ( Part 1/n for SnoopPi)
-
Wordego Coupon AdX - buy and sell digital coupons targeting consumers on the internet
-
Government
-
Data.gov: This site makes it possible to download data from multiple US government agencies. Data can range from government budgets to school performance scores. Be warned though: much of the data requires additional research.
-
Food Environment Atlas: Contains data on how local food choices affect diet in the US.
-
School System Finances: A survey of the finances of school systems in the US.
-
Chronic Disease Data: Data on chronic disease indicators in areas across the US.
-
The US National Center for Education Statistics: Data on educational institutions and education demographics from the US and around the world.
-
The UK Data Service: The UK’s largest collection of social, economic and population data.
-
Data.gov.uk: There are datasets from all UK central departments and a number of other public sector and local authorities. It acts as a portal to all sorts of information on everything, including business and economy, crime and justice, defence, education, environment, government, health, society and transportation.
-
Data USA: A comprehensive visualization of US public data.
-
US. Census Bureau: The website is about the government-informed statistics on the lives of US citizens including population, economy, education, geography, and more.
-
The CIA World Factbook: Facts on every country in the world; focuses on history, government, population, economy, energy, geography, communications, transportation, military, and transnational issues of 267 countries.
-
FBI Crime Data: The FBI crime data is fascinating and one of the most interesting data sets on this list. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20-year period. Alternatively, you can look at the data geographically.
-
Socrata: Socrata is a mission-driven software company that is another interesting place to explore government-related data with some visualization tools built-in. Its data as a service has been adopted by more than 1200 government agencies for open data, performance management and data-driven government.
-
European Union Open Data Portal: It is the single point of access to a growing range of data from the institutions and other bodies of the European Union. The data boosts includes economic development within the EU and transparency within the EU institutions, including geographic, geopolitical and financial data, statistics, election results, legal acts, and data on crime, health, the environment, transport and scientific research. They could be reused in different databases and reports. And more, a variety of digital formats are available from the EU institutions and other EU bodies. The portal provides a standardised catalogue, a list of apps and web tools reusing these data, a SPARQL endpoint query editor and rest API access, and tips on how to make best use of the site.
-
Canada Open Data: is a pilot project with many government and geospatial datasets. It could help you explore how the Government of Canada creates greater transparency, accountability, increases citizen engagement, and drives innovation and economic opportunities through open data, open information, and open dialogue.
-
Datacatalogs.org: It offers open government data from US, EU, Canada, CKAN, and more.
-
U.S. National Center for Education Statistics: The National Center for Education Statistics (NCES) is the primary federal entity for collecting and analyzing data related to education in the U.S. and other nations.
-
UK Data Service: The UK Data Service collection includes major UK government-sponsored surveys, cross-national surveys, longitudinal studies, UK census data, international aggregate, business data, and qualitative data.
-
CDC Cause of Death: The Centers for Disease Control and Prevention maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
-
Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
-
Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, including gross domestic product and exchange rates.
-
-
Finance and Economics
-
Quandl: A good source for economic and financial data — useful for building models to predict economic indicators or stock prices.
-
World Bank Open Data: Datasets covering population demographics, a huge number of economic, and development indicators from across the world.
-
IMF Data: The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments.
-
Financial Times Market Data: Up to date information on financial markets from around the world, including stock price indexes, commodities and foreign exchange.
-
Google Trends: Examine and analyze data on internet search activity and trending news stories around the world.
-
American Economic Association (AEA): A good source to find US macroeconomic data.
-
UN Comtrade Database: Free access to detailed global trade data with visualizations. UN Comtrade is a repository of official international trade statistics and relevant analytical tables. All data is accessible through API.
-
Global Financial Data: With data on over 60,000 companies covering 300 years, Global Financial Data offers a unique source to analyze the twists and turns of the global economy.
-
Google Finance: Real-time stock quotes and charts, financial news, currency conversions, or tracked portfolios.
-
Google Public Data Explorer: Google's Public Data Explorer provides public data and forecasts from a range of international organizations and academic institutions including the World Bank, OECD, Eurostat and the University of Denver. These can be displayed as line graphs, bar graphs, cross sectional plots or on maps.
-
U.S. Bureau of Economic Analysis: U.S. official macroeconomic and industry statistics, most notably reports about the gross domestic product (GDP) of the United States and its various units. They also provide information about personal income, corporate profits, and government spending in their National Income and Product Accounts (NIPAs).
-
Financial Data Finder at OSU: Plentiful links to anything related to finance, no matter how obscure, including World Development Indicators Online, World Bank Open Data, Global Financial Data, International Monetary Fund Statistical Databases, and EMIS Intelligence.
-
National Bureau of Economic Research: Macro data, industry data, productivity data, trade data, international finance, data, and more.
-
U.S. Securities and Exchange Commission: Quarterly datasets of extracted information from exhibits to corporate financial reports filed with the Commission.
-
Visualizing Economics: Data visualizations about the economy.
-
Financial Times: The Financial Times provides a broad range of information, news and services for the global business community.
-
Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One relevant data set to explore is the weekly returns of the Dow Jones Index from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.
-
Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The free data set lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan).
-
-
Real Estate
-
Joint Center for Housing Studies of Harvard University - LIRA: The Leading Indicator of Remodeling Activity (LIRA) provides a short-term outlook of national home improvement and repair spending to owner-occupied homes. The indicator, measured as an annual rate-of-change of its components, is designed to project the annual rate of change in spending for the current quarter and subsequent four quarters, and is intended to help identify future turning points in the business cycle of the home improvement and repair industry. Produced quarterly since 2007, the LIRA is released by the Remodeling Futures Program at the Joint Center in the third week after each quarter's closing.
-
FRED Housing Starts: New Privately Owned Housing Units Started
-
American Housing Survey (AHS): The AHS is sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau. The survey is the most comprehensive national housing survey in the United States.
-
Castles: Castles are a successful, privately owned independent agency. Established in 1981, they offer a comprehensive service incorporating residential sales, letting and management, and surveys and valuations.
-
RealEstate.com: serves as the ultimate resource for first-time home buyers, offering easy-to-understand tools and expert advice at every stage in the process.
-
Gumtree: Gumtree is the first site for free classifieds ads in the UK. Buy and sell items, cars, properties, and find or offer jobs in your area is all available on the website.
-
James Hayward: It provides an innovative database approach to residential sales, lettings & management.
-
Lifull Home's: Japan’s property website.
-
Immobiliare.it: Italy’s property website.
-
Subito: Italy’s property website.
-
Immoweb: Belgium's leading property website.
-
-
Geospacial Data
-
Marketing and Social Media
-
Amazon API: Browse Amazon Web Services’ Public Data Sets by category for a huge wealth of information. Amazon API Gateway allows developers to securely connect mobile and web applications to APIs that run on Amazon Web(AWS) Lambda, Amazon EC2, or other publicly addressable web services that are hosted outside of AWS.
-
American Society of Travel Agents: ASTA is the world's largest association of travel professionals. It provides members information including travel agents and the companies whose products they sell such as tours, cruises, hotels, car rentals, etc.
-
Social Mention: Social Mention is a social media search and analysis platform that aggregates user-generated content from across the universe into a single stream of information.
-
Google Trends: Google Trends shows how often a particular search-term is entered relative to the total search-volume across various regions of the world in various languages.
-
Facebook API: Learn how to publish to and retrieve data from Facebook using the Graph API.
-
Twitter API: The Twitter Platform connects your website or application with the worldwide conversation happening on Twitter.
-
Instagram API: The Instagram API Platform can be used to build non-automated, authentic, high-quality apps and services.
-
Foursquare API: The Foursquare API gives you access to our world-class places database and the ability to interact with Foursquare users and merchants.
-
HubSpot: A large repository of marketing data. You could find the latest marketing stats and trends here. It also provides tools for social media marketing, content management, web analytics, landing pages and search engine optimization.
-
Moz: Insights on SEO that includes keyword research, link building, site audits, and page optimization insights in order to help companies to have a better view of the position they have on search engines and how to improve their ranking.
-
Content Marketing Institute: The latest news, studies, and research on content marketing.
-
Yelp API: Yelp maintains a free dataset for use in personal, educational, and academic purposes. It includes 6 million reviews spanning 189,000 businesses in 10 metropolitan areas. Students are welcome to participate in Yelp’s dataset challenge.
-
Reddit Comments: Reddit released a really interesting data set of every comment that has ever been made on the site. It’s over a terabyte of data uncompressed, so if you want a smaller data set to work with Kaggle has hosted the comments from May 2015 on their site.
-
Airbnb: Inside Airbnb offers different data sets related to Airbnb listings in dozens of cities around the world.
-
Walmart: Walmart has released historical sales data for 45 stores located in different regions across the United States.
-
-
Journalism and Media
-
The New York Times Developer Network: Search Times articles from 1851 to today, retrieving headlines, abstracts and links to associated multimedia. You can also search book reviews, NYC event listings, movie reviews, top stories with images and more.
-
Associated Press API: The AP Content API allows you to search and download content using your own editorial tools, without having to visit AP portals. It provides access to images from AP-owned, member-owned and third-party, and videos produced by AP and selected third-party.
-
Google Books Ngram Viewer: It is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008 in Google's text corpora.
-
Wikipedia Database: Wikipedia offers free copies of all available content to interested users.
-
FiveThirtyEight: It is a website that focuses on opinion poll analysis, politics, economics, and sports blogging. The data and code on Github is behind the stories and interactives at FiveThirtyEight.
-
Google Scholar: Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. It includes most peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.
-
-
Business Directory and Review
-
LinkedIn: LinkedIn is a business- and employment-oriented social networking service that operates via websites and mobile apps. It has 500 million members in 200 countries and you could find the business directory here.
-
Open Corporates: OpenCorporates is the largest open database of companies and company data in the world, with in excess of 100 million companies in a similarly large number of jurisdictions. Our primary goal is to make information on companies more usable and more widely available for the public benefit, particularly to tackle the use of companies for criminal or anti-social purposes, for example corruption, money laundering and organised crime.
-
Yellow Pages: The original source to find and connect with local plumbers, handymen, mechanics, attorneys, dentists, and more.
-
Craigslist: Craigslist is an American classified advertisements website with sections devoted to jobs, housing, personals, for sale, items wanted, services, community, gigs, résumés, and discussion forums.
-
CertainTeed - Find a Pro: You could find contractors, remodelers, installers or builders in the US or Canada on your residential or commercial project here.
-
Companies in California: All information about companies in California.
-
Manta: Manta is one of the largest online resources that deliver products, services and educational opportunities. The Manta directory boasts millions of unique visitors every month who search comprehensive database for individual businesses, industry segments and geographic-specific listings.
-
EU-Startups: Directory about startups in EU.
-
Kansas Bar Association: Directory for lawyers. The Kansas Bar Association (KBA) was founded in 1882 as a voluntary association for dedicated legal professionals and has more than 7,000 members, including lawyers, judges, law students, and paralegals.
-
-
Other Portal Websites
-
Capterra: Directory about business software and reviews.
-
Monster: Data source for jobs and career opportunities.
-
Glassdoor: Directory about jobs and information about inside scoop on companies with employee reviews, personalized salary tools, and more.
-
The Good Garage Scheme: Directory about car service, MOT or car repair.
-
OSMOZ: Information about fragrance.
-
Octoparse: A free data extraction tool to collect all the web data mentioned above online.
-
Unicef: If data about the lives of children around the world is of interest, UNICEF is the most credible source. The organization’s public data sets touch upon nutrition, immunization, and education, among others.
-
-
Machine Learning Datasets
-
Images
Labelme: A large dataset of annotated images.
ImageNet: The de-facto image dataset for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images depict each node of the hierarchy.
LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.)
MS COCO: Generic image understanding and captioning.
COIL100: 100 different objects imaged at every angle in a 360 rotation.
Visual Genome: Very detailed visual knowledge base with captioning of ~100K images.
Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.
Labelled Faces in the Wild: 13,000 labeled images of human faces, for use in developing applications that involve facial recognition.
Stanford Dogs Dataset: Contains 20,580 images and 120 different dog breed categories.
Indoor Scene Recognition: A very specific dataset and very useful, as most scene recognition models are better ‘outside’. Contains 67 Indoor categories, and 15620 images.
-
Sentiment Analysis
Multidomain sentiment analysis dataset: A slightly older dataset that features product reviews from Amazon.
IMDB reviews: An older, relatively small dataset for binary sentiment classification features 25,000 movie reviews.
Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations.
Sentiment140: A popular dataset, which uses 160,000 tweets with emoticons pre-removed.
Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets.
-
Natural Language Processing
HotspotQA Dataset: Question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
Enron Dataset: Email data from the senior management of Enron, organized into folders.
Amazon Reviews: Contains around 35 million reviews from Amazon spanning 18 years. Data include product and user information, ratings, and the plaintext review.
Google Books Ngrams: A collection of words from Google books.
Blogger Corpus: A collection 681,288-blog posts gathered from blogger.com. Each blog contains a minimum of 200 occurrences of commonly used English words.
Wikipedia Links data: The full text of Wikipedia. The dataset contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.
Gutenberg eBooks List: Annotated list of ebooks from Project Gutenberg.
Hansards text chunks of Canadian Parliament: 1.3 million pairs of texts from the records of the 36th Canadian Parliament.
Jeopardy: Archive of more than 200,000 questions from the quiz show Jeopardy.
SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages
Yelp Reviews: An open dataset released by Yelp, contains more than 5 million reviews.
UCI’s Spambase: A large spam email dataset, useful for spam filtering.
-
Self-Driving
Berkeley DeepDrive BDD100k: Currently the largest dataset for self-driving AI. Contains over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. The annotated images come from New York and San Francisco areas.
Baidu Apolloscapes: Large dataset that defines 26 different semantic items such as cars, bicycles, pedestrians, buildings, streetlights, etc.
Comma.ai: More than 7 hours of highway driving. Details include car’s speed, acceleration, steering angle, and GPS coordinates.
Oxford’s Robotic Car: Over 100 repetitions of the same route through Oxford, UK, captured over a period of a year. The dataset captures different combinations of weather, traffic and pedestrians, along with long-term changes such as construction and roadworks.
Cityscape Dataset: A large dataset that records urban street scenes in 50 different cities.
CSSAD Dataset: This dataset is useful for perception and navigation of autonomous vehicles. The dataset skews heavily on roads found in the developed world.
KUL Belgium Traffic Sign Dataset: More than 10000+ traffic sign annotations from thousands of physically distinct traffic signs in the Flanders region in Belgium.
MIT AGE Lab: A sample of the 1,000+ hours of multi-sensor driving datasets collected at AgeLab.
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets: This dataset includes traffic signs, vehicles detection, traffic lights, and trajectory patterns.
Bosch Small Traffic Light Dataset: Dataset for small traffic lights for deep learning.
LaRa Traffic Light Recognition: Another dataset for traffic lights. This is taken in Paris.
WPI datasets: Datasets for traffic lights, pedestrian and lane detection.
-
Clinical
MIMIC-III: Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
PubMed: PubMed, developed by the National Library of Medicine (NLM), provides free access to MEDLINE, a database of more than 11 million bibliographic citations and abstracts from nearly 4,500 journals in the fields of medicine, nursing, dentistry, veterinary medicine, pharmacy, allied health, health care systems, and pre-clinical sciences. PubMed also contains links to the full-text versions of articles at participating publishers' Web sites. In addition, PubMed provides access and links to the integrated molecular biology databases maintained by the National Center for Biotechnology Information (NCBI). These databases contain DNA and protein sequences, 3-D protein structure data, population study data sets, and assemblies of complete genomes in an integrated system. Additional NLM bibliographic databases, such as AIDSLINE, are being added to PubMed. PubMed includes "Old Medline." Old Medline covers 1950-1965. (Updated daily).
Medicare Hospital Quality: The Centers for Medicare & Medicaid Services maintains a database on quality of care at more than 4,000 Medicare-certified hospitals across the U.S., providing for interesting comparisons.
SEER Cancer Incidence: The U.S. government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors. It comes from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program.
-
-
Date Time
-
Statista Inforaphics Bulletin Sources
-
[Forbes - Employer Ranking U.S.]
-
[Netflix]
-
[Priori Data]
Individual organizations' websites, press reports, media reports, company filings
-
-
Data for Sale
- CoreLogic: Nationwide Data. https://www.corelogic.com/products/corelogic-store.aspx