Every year, Stack Overflow conducts a massive survey of people on the site, covering all sorts of information like programming languages, salary, code style and various other information. This year, they amassed more than 64,000 responses fielded from 213 countries.
The data is made up of two files: 1. survey_results_public.csv - CSV file with main survey results, one respondent per row and one column per answer 2. survey_results_schema.csv - CSV file with survey schema, i.e., the questions that correspond to each column name m
Data is directly taken from StackOverflow and licensed under the ODbL license.
For the first time, Kaggle conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning. The survey received over 17,000 responses and we learned a ton about who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field.
To share some of the initial insights from the survey, we’ve worked with the folks from The Pudding to put together this interactive report. They’ve shared all of the kernels used in the report here.
The data includes 5 files:
schema.csv: a CSV file with survey schema. This schema includes the questions that correspond to each column name in both the multipleChoiceResponses.csv and freeformResponses.csv.
multipleChoiceResponses.csv: Respondents' answers to multiple choice and ranking questions. These are non-randomized and thus a single row does correspond to all of a single user's answers. -freeformResponses.csv: Respondents' freeform answers to Kaggle's survey questions. These responses are randomized within a column, so that reading across a single row does not give a single user's answers.
conversionRates.csv: Currency conversion rates (to USD) as accessed from the R package "quantmod" on September 14, 2017
RespondentTypeREADME.txt: This is a schema for decoding the responses in the "Asked" column of the schema.csv file.
Kernel Awards in November
In the month of November, we’re awarding $1000 a week for code and analyses shared on this dataset via Kaggle Kernels. Read more about this month’s Kaggle Kernels Awards and help us advance the state of machine learning and data science by exploring this one of a kind dataset.
This survey received 16,716 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity.
We excluded respondents who were flagged by our survey system as “Spam” or who did not answer the question regarding their employment status (this question was the first required question, so not answering it indicates that the respondent did not proceed past the 5th question in our survey).
Most of our respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels.
The survey was live from August 7th to August 25th. The median response time for those who participated in the survey was 16.4 minutes. We allowed respondents to complete the survey at any time during that window.
We received salary data by first asking respondents for their day-to-day currency, and then asking them to write in either their total compensation.
We’ve provided a csv with an exchange rate to USD for you to calculate the salary in US dollars on your own.
The question was optional
Not every question was shown to every respondent. In an attempt to ask relevant questions to each respondent, we generally asked work related questions to employed data scientists and learning related questions to students. There is a column in the schema.csvfile called "Asked" that describes who saw each question. You can learn more about the different segments we used in the schema.csv file and RespondentTypeREADME.txt in the data tab.
To protect the respondents’ identity, the answers to multiple choice questions have been separated into a separate data file from the open-ended responses. We do not provide a key to match up the multiple choice and free form responses. Further, the free form responses have been randomized column-wise such that the responses that appear on the same row did not necessarily come from the same survey-taker.