Data Cleaning. Excel Tips and Tricks

Data Cleaning. Excel Tips and Tricks

Cover Picture Courtesy of BLOG

GIGO. Garbage In, Garbage Out. Those famous first and last words. I have been involved in several high visibility corporate projects where the project launch was contingent on having the correct data sets. Fancy schmancy systems like SAP are supposed to be configured properly with the underlying data tables so they spit out the right information with one click.

This post will not be for the semantics of getting data tables joined and configured for the purposes of feeding giant data systems like SAP. It will be about data for small and medium business (and some corporate), and some tips and tricks you can use in readily available sources for quick turnaround on data exactitude.

Our trusted friend Excel is always front and center for me when I think data. Every interface I have ever worked with either imports or exports to Excel. There is no escaping it. It's here to stay. Forever.

What you don't want to do as a business leader is get a fancy dashboard fed with dirty data. Making business decisions on a dashboard that reports with a dirty data feed will be disastrous.

What is clean data anyways? How does one know when the data is clean enough?

Some of that answer is subjective, but most of it is objective. Let me explain how to do it and give you some quick tips and tricks.

FIRST and FOREMOST

Off the cuff, the analysts that are crunching your data need to be well versed in your industry. Letting an analyst who has worked in the food service industry crunch data for an automobile retailer is not 100% effective. If said analyst or consulting team is willing to have some formal integration meetings and research the industry, then yes, hire them pronto.

Excel TIPS and TRICKS

Excel is an extremely powerful interface. With the new addition of a Data Mining add-in (needs and instance of SQL Server running on the back end) there is no limit to what you can investigate. Other powerful new features are Powerpivot which can join and manipulate endless amounts of databases in memory. New additions like PowerQuery and PowerMap along with the Microsoft Power BI reporting interface gives you a really powerful data reporting and visualization tool built right into homegrown world renowned Excel.

Those are pretty advanced. This post will use more simplistic Excel tools.

But what if the data we feed these systems is dirty? Redundancy, duplicates, syntax errors and anomalies, missing values, etc. Well, there are some simple excel tools you can use to look at your raw data sets and clean them up.

Here they are in no particular order:

1. LEFT and RIGHT functions

The "LEFT" and "RIGHT" functions in excel allow you to easily truncate data syntax in particular sets of columns with visual output. That is, you can take a text value in a cell and have Excel report the LEFT (or RIGHT) outermost (to whatever limit you want) characters. Here is an example:

The left column is a series of phone numbers of which I wanted to extract the area code only. I used the "LEFT" function to report in the column next to it just the first three characters of that target column.

 

 

 

 

 

 

 

The LEFT and RIGHT functions are also useful for capturing syntax errors in 'varchar' data. i.e. For spelling errors or user input errors.

2. BINARY switches

As the name describes binary is a qualifier, usually a "1" or a "0". This is not necessarily an Excel function per se, but more of an data anlytics technique via Excel. It can give you a quick overview of the qualifications of the data in a given column reported as a "1" or a "0" in another new column. Binary is the bread and butter of analytics.

You can create as many columns as you like with your binary switches (reporting a value of "1" or "0") depending on the attributes you wish to report on. Once you have your binary columns you can then apply a new SUM column to see different combinations of your attributes values as sums.

To continue with the same example of the area codes, here is a binary switch:

 

My final result wants to know which area "area codes" start with a 4 and end with a 1.

In the 3rd column I asked Excel with a combined IF and LEFT function to report columns from the area code list that started with the number 4. If it did, report with a 1, if not, report with a 0.

In the 4th column I asked Excel with a combined IF and RIGHT function to report columns from the area code list that ended with the number 1. If it did, report with a 1, if not, report with a 0.

The last column is just a sum of the previous two columns. And as you can see the binary reports properly by stating a 2 (both qualifiers reported) in two of the rows.

This example is very basic. You can use massively large data sets and lots of creativity with binary switches. Good Luck.

 3. CONCATENATE

Concatenate is great for building Unique Data Keys out of data sets that contain a lot of redundancy and duplicate values in large and deep data sets. I will just demonstrate a concatenate on the same above data set for the purpose of this blog post. Here is a concatenate of all the columns with an underscore between them:

Concatenate is a powerful function to join various data from different columns into one new fabricated column.

3. CONDITIONAL Formatting

In extremely large data sets Conditional Formatting with Excel can save you a lot of time. By using it to visually highlight anomalies in your data sets and then zooming out, you can get a pinpoint accurate shot of which cell is disrupting the spreadsheet ecosystem.

_________________________________________________________________

Those are my favorite Excel tips and tricks for cleaning data. Don't forget that you still need the fundamental business industry knowledge to do it correctly.

For those who want to experiment a little further with data cleaning and joining there is a free tool from Informatica called REV. It's a free interface (up to 20GB) where you can import and export data sets and manipulate them with aggregations, extractions, and combinations. It's easy to use and has some great learning videos built right into the interface. Follow the link:

Informatica REV

_________________________________________________________________

Thanks for reading, Comments, likes and ideas welcome.

To engage with us for your data analytics and digital transformation initiatives, send us a note. It will be our pleasure to share our deep industry knowledge experience to align your resources and give your project maximum ROI.

Please LIKE and SHARE this article with peers and industry colleagues. Send us an invite to connect on LinkedIN to stay abreast of any future articles. For updates to news articles that focus on data analytics in Business, Sports, Healthcare and other please visit the "Industry News" section on our website at www.thedataminingcompany.com.

[Read all our Posts - Link to The Data Mining Company BLOG]

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics