Coming from a manufacturing and 3PL business background, where an oft-used tool for dealing with data to glean insights is Microsoft Excel, I’ve felt that exploratory data work could be a bit more robust, even when there are tool limitations. I came to data science by way of frustration with my tools. Cell formulas and VBA weren’t cutting it anymore. I wanted more control over my visualizations, and I really wanted to automate some of the boring out of my life. I was tired of VBA syntax. When that happened, I started learning Python. With a can’t-sleep, can’t-eat, is-this-mental-illness (< not a distasteful joke, this is something I watch for) level of focus I began devouring Python curriculum.
However, Python is not always the solution. If you work for a company, specifically an old, slow bureaucratic company with layers of security protocols in place, you probably won’t be in control of the decision about whether or not you’re allowed the use of Python. Towards Data Science was blocked SO FAST after I searched ‘install Python no admin’. In this article, I’d like to lay out some strategies for looking at data in Excel the way a data scientist might approach a new data set. When you pull information from an ERP system to create reports for executives, here are some places to start. This is less about the tools, more about the way of working.
Check the size and shape.
How many rows, how many columns? How much information do you have?
When columns are selected, you can view some descriptive information, including the number of active rows in the bottom right of the screen.
Columns are a bit trickier — you can use COUNTA to quickly count how many you have.
Check the summary statistics.
It’s helpful to see the mean, median, percentiles, standard deviation — general descriptive information — about the numerical data in a column immediately. This helps you understand what you’re looking at, statistically speaking. To do this, you could quickly run some of the built in formulas in Excel and do a little copy/paste action to deal with all of the columns you need to. Building a small table with this information would also make it easy to see. Inspired by the pandas.describe() method:
Or make some histograms!
Find out how many null values you have.
Figuring out how to deal with null values is an important part of your data assessment and cleaning. It requires contextual critical thinking — I’ve written macros to deal with the zeros or null values, I’ve scrolled through rows to get an idea of what I was dealing with — but the ability to see the amount of useful information immediately is extremely helpful. Using variants of the COUNT function in Excel you can check for specific values, or lack thereof. It’s much less elegant than the pandas.value_counts() method, but these formulas work:
=COUNT(selection) counts only numbers
=COUNTA(selection)counts all filled cells in the selection
=COUNTBLANK(selection) counts the unfilled cells in the selection
What are the columns, and what do they represent.
Again, you can scroll across that spreadsheet to see what you have and identify the columns as you go. An important part of the data wrangling is learning about what you’re looking at. Column headings can be really short, messy, or confusing and easy to misinterpret. Make sure to ask enough questions about the data you’re given to work with or pull — without knowledge about what the variable is it’s difficult to know if it’s useful for your analysis. I like to create a reference document to keep close at hand.
All of these techniques are likely already a part of your Excel workflow. The difference I’m proposing is that they be used at the very beginning of an analysis. Maybe you know exactly what you should do with your null values but don’t often check the centrality or dispersion of your data. This allows you to see more of the whole picture before you begin, and might lead to some new insights.
Happy data wrangling, friends. No matter how you do it.