Talking About Practice: Finding Datasets To Develop Your Data Science Skillset
It’s been almost 6 months since I began my data science education, and we are now at the point of having opportunities to apply our skills outside of classically assigned datasets. This means I get to learn the valuable skill of finding fitting datasets to practice coding on, not only for the fun of development, but as a “choose your own adventure” project. In this case, we are focusing specifically on applying classification modeling.
For those reading this outside of the data science space and are asking
“but what does classification modeling actually do?”
my favorite way of explaining this is that I am assigned to create a simplified, non-image-based, version of hotdog/not hotdog (Silicon Vally anyone?).
Now, where to start finding that dataset…
As a base, we were given suggestions to look into available datasets on Kaggle, UCI Machine Learning Datasets Repo, and the Awesome Datasets Repo on GitHub. Each of these proved to be good starting places with a wealth of options ranging from agriculture to esports.
Beyond the guidance to specific website, the best advice that was given was to approach the search with a “problem first” mindset versus a “data first”.
When it comes the exercise of creating a model, it is an advantage to have contextual knowledge around the data. For instance, our last project focused on predicting housing prices in King County, Washington. Having semi-recently purchased a house, I had a general idea of what skewed prices higher (square footage, lot size, etc) as well as the appraisal process that assesses the price of your home compared to those around you to make sure it is set at a fair price for buyer and seller. This tied together well with the data points available including the square footage and lot size of other homes in a given radius. On the other hand, if I were a realtor, I would have been even better off having the lived-experience data and intuition to understand what are the biggest levers on a price that a one-time homebuyer would not immediately think of — or better yet, other data that would be helpful to fold into the original dataset to create an even more predictive model.
Now it is my turn to identify my contextual expertise and let that drive the data selection experience. This led me back to my first post on my personal why behind learning data science. I figured if those remain my goals for applying these skills, let’s start connecting the dots and getting familiar with data in these spaces.
In that post, I focus on data science as a complementary skill to my nearly decade experience in Sales and Marketing Operations and tech; and while going through this latest batch of learning around model development, I kept relating it back to the use case of lead scoring. Sales teams always want to be the most efficient with their time and marketing teams can help them with that by bringing in leads. Sometimes, the number of leads being funneled to Sales teams is overwhelming and not all of them are the highest quality. This is where lead scoring comes in to be a filter and only pass off the “hottest” leads to Sales with the highest and most urgent interest. So, why not find a leads dataset with data points on those that converted or ended in a closed deal?
Through Kaggle, I found a couple of datasets simply by searching “lead score”, but only one of them had a high usability score (Kaggle’s way of saying, the data is in a workable state for analysis and problem solving). The dataset was for the equivalent of a Brazilian Amazon platform where leads would come in from a landing page, be followed up with my a Sales team member and the Sales team would be able to close an opportunity with this lead. A closed won opportunity would mean the now customer would open an online store on the platform. This was perfect as lead management has been a major focus for me over the years. I could play the expert realtor for this round of of model development.
Of course this was the first attempt, and I ran into a problem. The dataset was separated into 2 files: one with the MQL (Marketing Qualified Lead) data and one with the closed won opportunity data. Once I merged the two sets, there was a small overlap of leads that Sales closed. My experience told me this was to be expected as you are lucky if 5–10% of your MQLs get to a closed Sales opportunity. This was not going to be the problem.
The major problem that nixed the idea of this dataset being a fit was that much of the lead data that should theoretically be available upon obtaining a lead like lead source and industry was only available on the leads that were connected to a closed won opportunity. I would not be able to build a well functioning model with this since technically the only features that applied to the merged dataset was the landing page the lead converted on. This is not enough to work with. With other features only available on those with closed won opportunity, the model wouldn’t be able to parse which of those feature values would be more or less predictive of a lead becoming a closed won opportunity. It’s like giving the model the answer key instead of training it to predict and generalize what leads to success.
My next stop was animal rescue. Austin is one of leading places of no-kill animal shelters, a mission I am passion about and donate my time and finances to regularly, and the city has an open dataset on animal intakes and outcomes all the way back to 2013. That told me there was more than enough data available, but what did it look like?
Upon doing some initial investigation, I found the extent of cleansing that needed to done to both datasets, the extent of null values, and how the instability of an animal coming in and out of the shelter multiple times (sadly this occurs more often than I’d like to think) led to a level of complexity in getting a good foundation dataset for the assignment. In service of my stress level for this project, I decided it was valuable to know this data exists; but the assignment was to focus on classification not the ins and outs of cleansing and data wrangling. I’m telling myself now, I’ll be back for that dataset outside of school life.
With those two promising datasets seemingly not a fit, I was off to the next pillar of fitness. Quarantine has made me more of a spin person after my stint in half and full marathon running, and yes, I am person in the Peloton memes. So, as any Peloton Road Warrior would, I went deep on a way to pull ride data outside of just my own rides, and even tried to find anonymized ride data from the company in places like Kaggle. The fortunate part is I am far from the first person who has tried, but the API documentation and spread of the data was intimidating to get to a foundational set to start on. I decided again to put this under the “projects to revisit” outside of assignments.
Back to Tech
After all this, where was I going to land? I generalized back up to my strengths and background in the tech industry and browsed Kaggle’s classification dataset options. This is where I found Google Play store data that recorded apps with their price, app file size, last updates, compatible android versions, and more. With an initial 10k+ dataset and enough features available to have some options on defining the best approach to modeling this data.
Part of this Google Play Store data was the app’s rating. To get more context, I researched further to find the scale and what was considered a highly rated app. From there I could define a target for my model.
The data itself is not perfect or standardized, so it leaves room for practice on cleansing and converting data into features that the model would find helpful in classifying whether an app will be highly rated or not.
The goal through all of this— the data search and the project — is practice. You don’t need the perfect data or the perfect model. You need to practice giving your brain the problems to solve and develop the intuition that comes through experience. Sure, you will have to refer to your notes and revisit some of the same StackOverflow threads multiple times over; but every time you do, the information gets more ingrained for the next time and the time you spent gets you closer to that 10,000 hour expert level. Keep practicing.