With data scientists in high demand, lots of training programs have started up to help people learn the skills necessary to enter the profession. I’m guessing many of you may have made a New Year’s resolution to learn/improve your data science skills, and thought a post identifying the numerous programs would be helpful. I spent a fair amount of time in 2017 researching the various data science training programs available and have categorized these programs by price, payment structure (flat fee, subscription, per class), and curriculum (structured vs ala carte). Please keep in mind I am not ranking these training programs, just categorizing them. For the complete list with basic details and pricing of each program, check out this dataset. Data Science Training Programs Gateway Drugs: Free, ala Carte Courses These programs allow you to select and take free data science courses. Kaggle’s kernels are very useful for picking up specific data science skills, whereas Data 36 provides more generalized courses. Both provide an easy way to get a basic familiarization with data science skills and Kaggle is great for picking up specific skills. Udacity actually uses their courses as a gateway to their paid nanodegree programs and you can take quite a bit of the content for free. Enter the Dojo: Free Training Programs with Structured Curriculum Two variations of the free program with structured curriculum are available. The first I’ll cover is general introduction to data science programs. These include Allison, Future Learn, edX, and Cognitive Class. All provide a structured sequence of courses you can take to learn various aspects of statistical analysis programming. Future Learn and edX charge for certifying completion of their programs, but otherwise is free. The second type is programs like Insight and Data Incubator target, which are for science Ph.D.s transitioning from their fields into data science. Both offer highly intensive data science boot camps to build on the statistical skills acquired in a Ph.D. program. Both programs then serve as recruiters placing their students with companies. They make their money on the program from the recruiter fees (remember if its free, you’re the product). Cool win/win idea for recovering academics. Training Buffets: Pay per Course Several programs allow you to select data science training courses and pay per course. These programs often have a large catalog of courses and are reasonably priced ($7 - $10 per course). Data Oragami, Udemy, and some Udacity content are available on a per course basis. Good way to try a few classes to see if data science is right for you or pick up some specific skills. Dataversity also follows this model, but at a much higher price point ($79 - $129 per course). District Data Labs also offers a premium offering focused on corporate training ($25 per participant). Subscribe Now! Structured Curriculum with Paid Subscription These data science training programs provide a structured curriculum with multiple learning paths for a low monthly rate. Coursera, Data Camp, Dataquest, Lynda, and O’Reilly all follow this model and offer monthly subscriptions from $25 to $50 a month. These programs incentivize you to hustle since you can learn at your own pace and the faster you move through the content, the less you pay. Data Camp and Dataquest appear to be some of the more popular in the data science community. Old School: Structured Curriculum with Upfront Tuition Not surprisingly, this old school model is the most prevalent way to offer data science training. Udacity is perhaps the king of this space offering the best value with good content in a structured course sequence ($499 - $699). The price goes up from there from $699 (Simplilearn Data Science) to $8500 (Thinkful). Brain Station, General Assembly, K2 Data Science, Springboard, and The Institute for Statistics Education all fall somewhere between these two price points. These programs typically differentiate themselves from the Subscribe Now! programs by offering one-on-one mentors that can help you through the program. If the one-on-one matters to you, this is the way to go. Otherwise, I did not see a lot of difference between the tuition and subscription based models. In fact, Springboard actually uses Data Camp’s content. I do not have university courses identified here. If you’ve got plenty of time and money to burn, check out this dataset listing all university data science programs. If I’ve missed a program or there's any inaccuracies, please shoot me an email.
December 2017 Newsletter Happy New Year! Business aside, Greg and I want to thank everyone that has supported us through 2017 and on into 2018. We especially want to thank our very patient wives Bailey and Chaz as the make us and Data & Sons much, much better in so many ways. We are big believers that Data & Sons will enable millions to join the knowledge economy by selling data on our marketplace. The amount and diversity of information in a society are two of the biggest predictors of that society’s new knowledge development. By providing a more equitable way to acquire and transfer data, we genuinely believe making all of this new data available will speed innovation and social progress. Thanks to everyone that is making this a reality. 2017 was a major year for Data & Sons and we are proud to announce we finished up strong. In December, our site traffic was up over 1000% from November and we continue to see strong user growth with more datasets being added to our marketplace. Our initial sales and site traffic have resulted in our first two investors coming on board. Both investors are seasoned tech entrepreneurs and leaders and we are very excited about the rapid growth we can continue to foster with their capital and expertise! In November, we anticipated completing a partnership agreement with a data brokerage to provide a greater variety of data. I am happy to announce we now have two such partnerships. The partnerships will mean Data & Sons buyers will now have access to over 20 Million business and customer contacts! We think this will be a win for all involved in the Data & Sons marketplace. Look for the new data to be added to the marketplace in early January. We also announced our data request and affiliate partner management platform. We are pleased to report the data request feature is in full development and the affiliate management platform in now in beta. The data request feature provides buyers the ability to request datasets at a specified price. This effectively creates demand in our marketplace reducing some of the uncertainty around not knowing what data will sell and for what price. Our affiliate partner management system enables people that refer data sellers to Data & Sons to receive a commission whenever their affiliated seller sells any data. We think this will drive a lot of new content to the site and allow us to rapidly scale. What’s next in 2018 January will see our first marketing campaign directed at driving buyers to our marketplace. We will focus on Lead Generation datasets since this has been the most active category on Data & Sons. Lead Generation datasets provide buyers the contact information of perspective customers they can use for direct (email, phone, mail) and social media (Facebook Audiences) marketing. We think the start of a new year os the perfect time to help businesses find new customers. We will also continue to bring on investors and will be presenting on January 25th at the Wave Tampa Bay. Our presentation will provide a more detailed understanding of Data & Sons and our revolutionary business model. Please contact me if you would like to attend. We anticipate continued investor funding will allow us to continue to grow Data & Sons. We will be adding several new team members in 2018. Our new team members will be focused on (1) growing specific data categories by finding data sellers and buyers underserved by how data is developed, acquired, and transferred for these types of data; (2) developing outstanding marketing content that educates buyers and sellers on how to make money on Data & Sons; and (3) more development engineers. Adding to our development team enables us to stay adaptive and continue to roll out great new features. We anticipate adding a bidding function to the marketplace. Buyers can make a bid on any dataset and the Seller can than take or counter the bid. This will make our market more price efficient. We will also be adding a tutorial section. Buyers and Sellers can learn how they can make money on Data & Sons’ revolutionary marketplace either by developing and/or acquiring data for sale. We hope 2017 was as an exciting year for you as it was for us. Here’s to all of us being empowered, successful and safe in 2018!
What is a Data Scientist? I’ve been trying to answer this question for well over a year. As an academic turned entrepreneur, I was intrigued by the title data scientist. We were building Data & Sons at the time, and identifying core customers was a key part of the design process. It seemed that people that both developed and utilized data would be natural sellers and buyers on our marketplace. Sounded like data scientists might just be this type of person. Answering that question took a surprisingly long time. After spending over a year reading, researching and discussing with people across Fortune 500 companies, startups, data science centric social media, and data science training programs, I think I have working solution. A prototype data scientist definition if you will. Given the number of posts on DataTau, Medium, and Reddit asking this same question, I think taking the time to put together a solid working definition is value added for lots of people in the data science field (profession, community, industry?) especially for people interested in joining the profession. So first, what’s data science? As an organizational scientist, I learned and applied the traditional scientific method: review/observation, theory/hypotheses development, collect data, test hypotheses using statistical analysis, and hope to find something publishable. The idea is that the data you collected (your sample) was generalizable to the overall population. So if you found results that supported your hypothesis in a sample of 600 people, you would argue this would be the case in the greater population when you published the study. Then along came big data. You would no longer need a sample because you could plausibly have the entire population. Instead of 600 people, you now had 4 Million if you were Facebook. No need to mess with theory and hypothesis development, you simply ran statistical analysis of the population and the results told you everything you needed to know. This is what led Wired Editor Chris Andersen to observe that Theory is Dead in 2008. It is in this context that Jim Gray coined the term data science. Data science was accumulating enough data that you could skip theory and hypotheses development and rely on the statistical relationships you found in the data. Data science is essentially a science hack. So does that make data scientists science hackers? I’m going to say no for one primary reason: you need more skills for data science than you do for traditional science. So sure, it’s a hack of the scientific method, but it takes more dedicated learning, experience, and effort to be able to hack that process. Not a very good hack if it requires more effort. It is possessing these skills that I think makes someone a data scientist. Therefore: Data scientist are professionals competent in statistical analysis, computer programming, and applied problem solving in their domain of interest. The Venn diagram below illustrates how possessing different combinations of these skills makes people good at different data centric jobs. Because there are so many people running around calling themselves data scientists today, I think the diagram also does a good job of illustrating who is not a data scientist. Let’s review each. Statistical Competence. I put this at the top of the Venn diagram because understanding statistics is at the core of data science (or really any other data centric role). The whole point is to skip theorizing to rely on statistical relationships. If you cannot find these relationships in your data, cannot play in data science. This also means you will need to be proficient in R, SPPS, SAS, or Stata, and likely some of the method/model specific software packages. Applied Problem Solving. I think there are lot’s of people out there that have statistical competence and/or computer programming skills with “data scientist” in their current job title. I would however argue that they are not data scientists. Why? Remember the first part of the scientific process is review/observation, which is studying and trying to develop a basic understanding of some subject or phenomenon before you start asking your own research questions. What do people already know about this subject? What don’t we know yet? While big data may take away the need for developing new theory and hypotheses, you still need to know what it is you are studying. If you don’t, you’re going to spend a lot of time and resources to get obvious answers to stupid questions. There’s no faster way to get marginalized in an organization than making more money than most people in the room and presenting them with a detailed research project that tells them exactly what they already knew five years ago. A data scientist has to know what questions to ask. This requires that you develop a thorough understanding of whatever you are examining with data science (e.g. business, public policy, educational outcomes, etc.). The practitioners (business people, policy wonks, educators, etc) know their subject area, but they often do not understand the tools data scientists bring to the table and thus have no idea what to tell you to do. In the 2017 Kaggle State of Data Science Survey, the fifth most cited barrier at work (30.2% of respondents) was “Lack of a Clear Question to Answer.” If you don’t know what questions to ask, you cannot have scientist in your job title. Inquiry, whether done through thoughtful theory development or studying massive amounts of data, is at the heart of ALL science. All inquiry starts with asking the right questions. Computer Programming. Large amounts of well organized, accurate, and authentic data is the world’s most valuable resource. This means you are unlikely to just come across it anytime soon so you’ll need to develop it yourself. You will also need to do this on a repeated basis (i.e. not a one time data collection). This maybe a few times a year, once a day, or continuously in real time. To collect and analyze data on a repeated basis, you’ll need to build a system that (1) acquires and updates data; (2) organizes that data from different sources into a coherent structure; (3) can pass that data into some sort of statistical analysis; (4) presents results in a clear manner (often as visualization); (5) all on an automated basis. It’s this last part (the automation) that separates people proficient in statistical analysis who can accomplish tasks 1-4 from data scientists. Most academic researchers (PhD types like me) are highly proficient in tasks 1-4, but are completely clueless when asked to repeat that process on a ongoing basis. Automating that process requires being able to tell a computer to do it, and that requires proficiency in Python, SQL, C++, and/or some other programming language. While strong in statistical analysis and applied porblem solving, I would not identify as a data scientist until I had imporved upon my current Python and SQL skills...unless of course you had a lot of money to throw at me. Reality is the job market for data scientists is very, very hot right now. I realize there are and will continue to be more and more people calling themselves data scientists that do not possess all three of the skills identified. I do think the three skills provide a good educational progression for becoming a data scientist. Starting with stats, moving to programming, and then gaining a solid understanding of the area you are going to apply your craft is a good educational progression. Likely, you will be marketable with a solid statistics background (Data Incubator and Insight Data Science both exist to train you up on the programming side while getting you hired), you will be highly desired as someone with both statistics and programming skills, and once you have several years experience in a particular industry, you will be extremely sought after and courted as a full fledged data scientist.
List of Lists Data science, big data analytics, and machine learning are all fueled by datasets. Access to datasets is the first part of kicking off any project and the quality and amount of data determines how successful these projects will be. So where do you get datasets? Almost any person in the data science profession will tell you they had to acquire, collect, scrape, or aggregate data from other sources. In order to make finding datasets easier, several websites have been created to aggregate datasets in publicly available lists. Datasets on these sites are largely available for free with sites that offer datasets for purchase just starting to pop up. As a co-founder of Data & Sons, a new dataset marketplace, I found myself spending a lot of time identifying as many of these as I could. I thought it would be useful to share this list of lists with others seeking to find the datasets they need. Limitations. First, I stuck with broad listings and did not include ones that focused on a particular type of data (e.g. government, locations, mailing addresses, etc.). Second, I did not include lists that only allow access to certain groups or people (e.g. academics) or entities (e.g. businesses only). In alphabetical order, here’s my list of dataset lists. Amazon Web Services Decent sized list of well organized datasets you can search by several categories. A nice collection of government, scientific, and business data. However, it seems most of the data was posted back in 2015 with a limited amount of more recent datasets. Analytics Bodhi provides data science/analytics training and resources. They’ve included a list of publicly available datasets in their resources with brief descriptions. They link to both datasets and other dataset lists…an early version of the project I am undertaking. Awesome Public Datasets Authored by Xiaming Chen and posted on Github (caesar0301), Awesome Public Datasets is perhaps one of my favorites. Well organized into categories, the list identifies many publicly available scientific datasets. Simple, clean, elegant, a wonderful list for anyone in the sciences. BigML The very smart folks at BigML have put together a list of data sources they have across in their work. Some really interesting finds on here well organized by types of datasets/data providers. Data Circle allows people to buy, sell, and share datasets easily. The site is in beta with a limited number of datasets. I like the clean, happy interface, and am a big believer in their business model ;-) Datahub Basic listing page with around 200 datasets and a description of each. I found there were more unique datasets (i.e. data not available on other lists) on here than in other dataset locations. Data is Plural is a publicly available Google docs spreadsheet created by Jeremy Singer-Vine and curated by a team of diligent humans. I believe this is one the most extensive, diligent, and well maintained dataset lists available. As new sources of data are identified, they get updated on the spreadsheet with entries from 2015 to the present. Data & Sons enables people to buy, sell, and share datasets. Data seems to be available at two extremes: free and very expensive. With over 90% of data inaccessible, we thought creating a market would create a more equitable exchange of data leading to increased data accessibility. Datausa.io does perhaps the best job of data summation and presentation I've seen. There is a lot of information about the USA on here and they do a superb job of providing data driven averages a wide range of things in the US. They also provide many US related datasets for download and a slick shopping cart interface. Nod to Daniel Shorstein for bringing Datausa.io to my attention. Data.world provides individuals and businesses the ability to collaborate on data science projects with a suite of analytics tools, storage, and professional networking. They also provide a free datasets and other data types. The only issue I have with data.world is that it requires a user account to view and access their datasets. Google Public Data Datasets from governments around the world. Neat, clean, no fuss, but limited to around 130 sources. Kaggle is without a doubt the center of the data science universe. Acquired by Google in March, 2017, Kaggle provides data scientists a place to connect, learn, and earn some extra money through their competitions. Kaggle also provides perhaps the most extensive lists of free datasets I have come across. Cleanly organized and always interesting to browse, Kaggle has over 5,000 datasets. Based on user profiles, it seems Kaggle has several employees engaged in creating new datasets. I think those Kaggle generated datasets are some of the most interesting and well put together sets you can find for free. OpenDataSoft might just be the largest and most diverse offering of datasets I have seen with over 9,000 datasets from all over the world. Datasets range from government, busness, and social we see on other lists, but also include some really fun ones like workout data from Apple watches. I was part of a conversation recently on whether English was the language of datasets and you needed to be an English speaker to first get into data science. I was happy to see that there are many datasets in other languages on Open Data Soft and that over 4K of the datasets were in French. Thanks to Nicolas Terpolilli for directing my attention to Open Data Soft. Socrata has a basic interface searchable by category. Socrata also offers several types of data products including datasets, documents, forms, etc. There is a massive amount of data products on Socrata making it one of the largest sources available. Many of the datasets seem to be from government sources. ZeMiner is primarily offers datasets on web sites, traffic, and usage from Amazon, Google, and Facebook. Although they are fairly focused in their data offerings, there is a lot of datasets here that can provide a great deal of insight into a broad range of Internet activities. Please let me know of any dataset sources I may have missed and I will do my best to keep the list of lists updated!
November Newsletter November was a very momentous month for Data & Sons. We had five major accomplishments this month and a lot of good news to share. (1) First, we completed an entire redesign of www.dataandsons.com. The new Data & Sons is designed with a more elegant interface to make the presentation of information on the site more structured and intuitive. Our number one goal is connecting data buyers and sellers and we think the ordering of information now makes that a much more fluid process. Easier to upload datasets, easier to find datasets, good for everyone. I want to recognize and thank Aaron Darr and his team in Perth, Australia for all the excellent work on making this an outstanding data marketplace. (2) Our next major accomplishment was being accepted into the Tampa Bay Wave’s Accelerator program. Data & Sons was one of 13 companies selected from over 100 applicants this year. We have now moved offices to the Wave and have enjoyed a wealth of great information, meaningful connections, and some very good business leads. For anyone considering the Accelerator program, we give our highest endorsement. Landing in the Wave also led to some media coverage for Data & Sons. We were mentioned along with the other Accelerator companies in Tampa Bay Business Journal, Tampa Bay Times, the Business Observer, tech.com, and niblets.com. We also want to thank Wave Founder and President Linda Olson and Accelerator Program Director Rich Heruska for all the work they put into the Wave and the Accelerator. A thanks and shout out to the Accelerator selection committee too! (3) Our third major accomplishment was shattering all of our November growth goals. Our initial goal was 150 users and 100 datasets uploaded on the site. I’m ecstatic to report we now have over 200 users and 174 datasets uploaded! We look forward to continued user and content growth to finish out 2017 strong. After completing the site redesign, we then began our first marketing campaign after the Thanksgiving break. Beginning on 11/27, we launched Google AdWords campaigns to increase site traffic and brand recognition as well directing buyers to specific datasets for sale. We also added social media campaigns on Facebook and had a lot of site traffic from a Reddit user posting our complete listing of US Craft Breweries on the sub-Reddit Datasets. (4) Combined with the Wave related media mentions, our site traffic was up 443% for the last week of November. Not a bad way to shake off the turkey! (5) Finally, we got our first sales. The increased site traffic led to two purchases of the US Craft Beer dataset. The first was from our good friend Tom Williams, Founder of eLease and St Pete Brewing that saw the dataset on Facebook. Being revenue positive is a major accomplishment for any startup and we are extremely thankful to have crossed over into the world of money making ventures. What’s next in December Big things ahead for Data & Sons before the end of 2017. First, we anticipate finalizing terms with our first third party data provider. Data & Sons would provide a branded, custom gateway to their excellent data. This would significantly increase our data content while providing our partner with a new marketing and distribution channel. We think this will be a win for all involved in the Data & Sons marketplace. Second, we will be adding a data request feature to our marketplace. The feature people asked for the most was the ability for buyers to request datasets at a specified price. You spoke, we listened. We will be adding this excellent idea to the market in December. This will be a great way for data sellers to understand what kind of demand exists and for buyers to readily get the kinds of data they need. Third, we will be launching our affiliate partner management system. The system will create a unique weblink the partners can use in their marketing. When a partner’s customer visits Data & Sons through the unique link and creates a user account, the customer is now affiliated with that partner. Anytime that customer sells data on Data & Sons, the partner receives a commission. This will allow us to reward individuals and businesses that drive content to the Data & Sons marketplace. It also allows a lot of people and businesses to get rewarded for creating new revenue streams for their customers. In keeping with out core values, we think our new affiliate system will be a major win-win-win-win for everyone on Data & Sons. Finally, growth goals for year end: 2 X users and content. Time to get to work! Photo by Patrick Tomasso on Unsplash
We are very honored and proud to have joined the Tampa Bay Wave Accelerator. Data & Sons was selected with 12 other excellent start ups from over 100 applications. Joining the Wave enables us to tap into the Wave's excellent resources and the opportunity to sit at the nexus of Tampa Bay's technology ecosystem. This is a big step towards connecting data buyers and sellers from across the world! Special thanks to Wave President Linda Olson, Entrepreneur in Residence Rich Heruska, and the Accelerator Selection Committee.