Excel Removing Outliers from Pivot Table Data

So if you’re watching this video you
probably know how helpful pivot tables are for aggregating data and grouping
data by different categories. In this case part numbers. And you probably also
realize that sometimes you run into outliers and these can actually skew the
mean value. So what I’m gonna do in this video is kind of explain some different
ways that we can filter out those outliers in order to give more accurate
averages! 🙂 Hello everyone! And welcome to The Engineering Toolbox Channel, where I
give you the tools you need to solve real-world engineering problems. So what
I’ve done is I’ve created an example where I’m looking at production data for
given part numbers and we’re trying to find the mean value for how long it
takes to actually make a certain part so if I look at the raw data we can see
that we have multiple jobs where we created a certain quantity of a part and
we can see how long it took and then we just divide that out to see how long per
part it took for that particular run and then we can make a pivot table from that
to find the mean and just as an example and I’ll go through the step-by-step
afterwards but we can see that the outliers if I take out the outliers so I
just removed the outlier values our average for this particular part number
drops down to 0.125 hours per part which is this green line here and that’s a
more accurate representation of the average amount of time that it takes to
complete a part versus if I add this one outlier back in the value or the mean
value jumps back up to point 137 which is actually over here and you can see
just visually when we look at the density distribution that that mean
value is actually incorrect because this outlier is skewing the the average quick
side note about outliers I’m not advocating or saying that all Liars
should always be removed all lawyers often tell us a lot about the data we’re
analyzing but in some cases it’s necessary and actually more accurate to
remove them especially in cases like this where extreme values are more
likely due to human error than not method we’re going to use for finding
our outliers we’ll be using different quantiles and the interquartile range so
if you’ve ever made like a box and whisker plot you’re probably familiar
with this but in order to do it we kind of have to trick Excel a little bit or
use some special functions so let’s just go ahead and add some columns
here for a different quartiles so we’re gonna add q1 q2 and q3 and Excel has
some functions built-in called quantile and we’ll use inclusive and this is
going to be an array function so if you’re not familiar with that I’ll try
to break this down and we’re gonna say if our part range so all of our part
values equals the specific part number then we will feed the hours per part
value back into the quartile function and we don’t need a value there and then
we need to tell it what quartile we’re gonna be solving so we have to use ctrl
+ shift to end hit enter so hold ctrl + shift hit enter for an array function
and that should be our quartile 1 value and then we can just take this and copy
it over to here and change this to quartile 2 by the way that’s the same as
the median value and then for a quartile 3 would be the same thing again base
that over change it to 3 for the 3rd quartile and then control shift enter
for that so just to kind of break this down we use the quartile inclusive
function which means that it includes the values that are at the line of each
quartile kinda hard to explain do a quick Google search if you need more
information on that and then the reason we did this array function is we’re
actually aggregating the data for each part number right here in our table so
what we did is we said look at each of the part numbers or look at all the part
numbers and if that matches the specific role or the part number in this specific
role and we’re gonna feed all the values where that matches back into the core
title function so when we hit control shift enter notice that it adds these
brackets so that we know it’s an array function alright it’s all downhill from
here so once we have our quartiles we can calculate the interquartile range
and that’s just simply the third quartile value minus the first quartile
value and that gives us the range between those two quartiles
and then in order to find our outliers we need to determine the upper and lower
bounds so let’s go lower bound I’ll be four lower bound and you be four upper
bound so the lower bound is going to be quartile one minus one point five times
the interquartile range and then you can just copy that over and we’ll go switch
that Plus and switch this to quartile three for our upper bound and now that
we have our bounds determined we just need to write an if statement to
determine if the values are outside of those bounds so we’ll just add a column
for outlier and then we’ll say if our hours per part value is less than or
equal to our lower bound and we’ll have to make this or as well because we need
to do both bounds so if it’s less than or equal to the lower bound or this
value is greater than or equal to the upper bound close off the or value of
true we’ll say it true it is an outlier and if not we’ll say false
is not enough letter so if we just filter that real quick we’ll see that
there aren’t too many outliers but there are a few and as we saw before that is
enough to see our data so if we select our data and insert a pivot table and
then group by the part number and take our hours per part as the value select
average from our values and this will give us the mean I mean hours per part
and if we insert a slicer and select outliers there now we can actually
filter this to either include or exclude outliers so we can see a few of the
values are affected by outliers this one in particular is pretty significant the
other ones aren’t too significant this is a this is the worst one so that is
how you remove outliers from aggregated data and pivot tables alright I hope you
all enjoyed that one to get more videos as they released make sure to hit that
subscribe button if you liked this video make sure to like and share and if you
didn’t like it well then leave me a comment and let me know how I can
improve last but not least this channel is for you so if you have any
suggestions for topics you want me to cover make sure to let me know thanks
for watching we’ll see you next time

Posts Tagged with…

Reader Comments

  1. The Engineering Toolbox Channel

    Hey everyone! Just to clarify, I was using the terms "quartile" and "quantile" in the video without explaining… A quantile is a GENERAL term for the equal division of a distribution. A quartile is a SPECIFIC type of quantile that breaks the distribution into quarters. The function in excel is for quartiles (i accidentally said quantile). Hope this clearifies! Let me know if there are other questions! 🙂

  2. blakejustinsmith

    I'm using Excel 2016 but I can't seem to build the array like you are. What are your keystrokes at 2:36 through 2:50. I can't seem to display the ([Part #]=[@[Part #]],[HRS/Part]] to select my columns and build my array. Is this a VBA function? I'm a relatively intermediate user of Excel and need to impose your displayed calculation on a million rows.

Write a Comment

Your email address will not be published. Required fields are marked *