Talking About Practice: Finding Datasets To Develop Your Data Science Skillset

It’s been almost 6 months since I began my data science education, and we are now at the point of having opportunities to apply our skills outside of classically assigned datasets. This means I get to learn the valuable skill of finding fitting datasets to practice coding on, not only for the fun of development, but as a “choose your own adventure” project. In this case, we are focusing specifically on applying classification modeling.

For those reading this outside of the data science space and are asking

“but what does classification modeling actually do?”

my favorite way of explaining this is that I am assigned to create a simplified, non-image-based, version of hotdog/not hotdog (Silicon Vally anyone?).

Now, where to start finding that dataset…

Beyond the guidance to specific website, the best advice that was given was to approach the search with a “problem first” mindset versus a “data first”.

When it comes the exercise of creating a model, it is an advantage to have contextual knowledge around the data. For instance, our last project focused on predicting housing prices in King County, Washington. Having semi-recently purchased a house, I had a general idea of what skewed prices higher (square footage, lot size, etc) as well as the appraisal process that assesses the price of your home compared to those around you to make sure it is set at a fair price for buyer and seller. This tied together well with the data points available including the square footage and lot size of other homes in a given radius. On the other hand, if I were a realtor, I would have been even better off having the lived-experience data and intuition to understand what are the biggest levers on a price that a one-time homebuyer would not immediately think of — or better yet, other data that would be helpful to fold into the original dataset to create an even more predictive model.

Now it is my turn to identify my contextual expertise and let that drive the data selection experience. This led me back to my first post on my personal why behind learning data science. I figured if those remain my goals for applying these skills, let’s start connecting the dots and getting familiar with data in these spaces.

Lead Scoring

Through Kaggle, I found a couple of datasets simply by searching “lead score”, but only one of them had a high usability score (Kaggle’s way of saying, the data is in a workable state for analysis and problem solving). The dataset was for the equivalent of a Brazilian Amazon platform where leads would come in from a landing page, be followed up with my a Sales team member and the Sales team would be able to close an opportunity with this lead. A closed won opportunity would mean the now customer would open an online store on the platform. This was perfect as lead management has been a major focus for me over the years. I could play the expert realtor for this round of of model development.

Of course this was the first attempt, and I ran into a problem. The dataset was separated into 2 files: one with the MQL (Marketing Qualified Lead) data and one with the closed won opportunity data. Once I merged the two sets, there was a small overlap of leads that Sales closed. My experience told me this was to be expected as you are lucky if 5–10% of your MQLs get to a closed Sales opportunity. This was not going to be the problem.

The major problem that nixed the idea of this dataset being a fit was that much of the lead data that should theoretically be available upon obtaining a lead like lead source and industry was only available on the leads that were connected to a closed won opportunity. I would not be able to build a well functioning model with this since technically the only features that applied to the merged dataset was the landing page the lead converted on. This is not enough to work with. With other features only available on those with closed won opportunity, the model wouldn’t be able to parse which of those feature values would be more or less predictive of a lead becoming a closed won opportunity. It’s like giving the model the answer key instead of training it to predict and generalize what leads to success.

Animal Rescue

Upon doing some initial investigation, I found the extent of cleansing that needed to done to both datasets, the extent of null values, and how the instability of an animal coming in and out of the shelter multiple times (sadly this occurs more often than I’d like to think) led to a level of complexity in getting a good foundation dataset for the assignment. In service of my stress level for this project, I decided it was valuable to know this data exists; but the assignment was to focus on classification not the ins and outs of cleansing and data wrangling. I’m telling myself now, I’ll be back for that dataset outside of school life.


Back to Tech

Final problem

The data itself is not perfect or standardized, so it leaves room for practice on cleansing and converting data into features that the model would find helpful in classifying whether an app will be highly rated or not.

The Learning

Note: if you have clicked on any and all of the practice links, I applaud you. They are all the same “talking about practice” Iverson video. :)

MarTech, CRM, automation and data nerd. Managing a small zoo of 3 cats and a dog in Austin, TX.