In Pareto lines on bar charts – an Excel fudge, **Alex Kerin** of **Data Driven Consulting** took data from a very badly distorted pie chart, and generated a Pareto chart. I busted the pie in Extra Distortion in a Pie Chart, but thought I’d chime in on the subject of Pareto Charts.

A Pareto Chart is a horizontal or vertical bar chart with its data sorted in descending order. The largest items (categories) in the chart are listed first for emphasis. Often a line is overlaid on the bar chart, showing cumulative sums or percentages of the total.

The Pareto Principle, named for Italian economist Vilfredo Pareto, is based on the observation that most of the effects of an action come from a small amount of the causes. This is often called the 80-20 rule, implying that 80% of the failures come from 20% of the types of defect, or that 80% of one’s sales come from 20% of one’s customers, or pretty much any 80-20 metaphor you can come up with.

The simplest Pareto chart is a plain column chart.

The percentages of each value are used as data labels. In this case, obviously the first three bars are substantially taller than any of the rest.The last four bars aren’t even as tall as the axis line is thick, so the bottom five categories can be combined into a category called “Other”.

To show the cumulative proportion of items (in this case, apps you can buy for your iPhone), a line can be added to show the percentage on a secondary axis. In Excel this is a simple column-line combination chart on two axes.

This shows that we didn’t reach 80% of the effect until about 50% of the categories were accounted for. It’s still disproportionate, but not as severely as the “80-20″ nickname would indicate.

An alternative approach plots the accumulated value line on the same scale as the individual bars, with data labels showing the cumulative percentages. In Excel it’s even simpler: a column-line combination chart on one axis.

The heights of all bars is reduced substantially, but lowering the resolution also stresses that none of the bars is a huge amount of the total.

Vertical bars seem to be more common in Pareto charts, but there are problems with this orientation. The category labels have to be listed vertically, making them hard to read. Data labels may overlap; the labels were placed in a distracting staggered layout to make them legible.

To remedy these legibility problems, the chart can be rotated. Categories and percentage labels in the following chart are perfectly legible.

There is no difference in our ability to interpret the data in this chart, compared to the first vertically oriented chart above.

As before, a line can be added to show the percentage on a secondary axis. In Excel this requires a bit of a trick, using a bar-XY combination chart on two axes. For the XY series, X is the percentage and Y indicates the order of the categories.

Again, the readability is the same for horizontally and vertically aligned charts.

Anything done vertically can also be done horizontally.

I prefer the horizontal bar charts for improved readability of the labels. In most cases I also prefer the one-axis version, where the line shows the accumulated totals on the same scale as the individual values.

Because our “20%” comes from a ranked list of causes or customers, we can usually find a way to expand the total list so that it really is 20%, making the 80/20 rule a culturally contrived rather than a natural one.

Your 80% came from about 10 categories, but I bet a sufficiently ingenious fan of Pareto’s aphorism could break the 20% remainder (especially the mysterious “Other”) out into a further 40 items, proving the rule yet again! :-)

Goodness…had to stop myself from writing a naughty comment on your claim that

Anything done vertically can also be done horizontally.I like the way you’ve colored ‘Other’ differently to make it stand out.

Is it just me, or do all your percentages add up to 100.506%?

Here’s another couple of ways you could handle this without a 2nd series.

This first one uses data labels to display not only the % of total, but the cumulative %. I think it’s a bit distracting myself, but maybe for some people it is less distracting than a 2nd series:

This 2nd one just shows where the first 50% of reciepts came from, where the next 25% came from, and where the final 25% came from. I prefer this approach. Often you don’t need to see the cumulative total for each series, but rather you just need a feel for how things are distributed.

Your thoughts?

Just noticed your new banner at the top of this page i.e. http://screencast.com/t/Zjg0ZjlhMT

Would you overlap data labels like that? ;-)

Off on a tangent now…I read a good business book a while back by Richard Koch called the 80/20 principle. I thought it was great. It was one of the books recommended in The 4-Hour Workweek by Tim Ferriss. Ferriss recommended focusing one’s attention on those 20% that contribute to 80% of the income. More notably, he also recommends firing those 20% of customers who take up the majority of one’s time and cause the most trouble.

Either placed horizontally or vertically: once the number of labels exceeds 20 or so, the focus of a Pareto-chart may get lost a little.

In cases like this it might be helpful to additionally use the opposite axis, representing the share of element count on a scale of 0% to 100%, in steps of 5% or 10% perhaps.

However, dispite many attempts so far, I did not succeed in constructing that chart.

Any supporting ideas from the PTS-community?

Regards,

hgl

Jeff –

Thanks for your many contributions to this post ;-) As I showed with the horizontally oriented bars, some things are easier horizontally.

I find your first alternative rather heavy in terms of labels. It requires extra concentration to make sense of so many labels. Your second alternative does away with the excess labels in a brain-friendly way.

The text in the header is for marketing and branding, not for data labels, so there is more flexibility.

H.G. –

Do you mean something like the following? I extrapolated between the bars to get the percentages for the labels.

Tim Ferriss’s book reminds me of the publishers who resolve only to publish bestsellers, producers who resolve to make only multi-Oscar-winning films, and consultants determined to find only lucrative contracts. Sadly, it rarely works that way in many businesses, where you have to take care of 80% of your customers, because that’s where your future 20% cash cows are going to come from.

There is a bubble chart style created by the Boston Consulting Group, commonly called a BCG Matrix, that tries to identify both the present money spinners and the future ones, suposedly allowing you to discard the waste-of-time prospects and the used-to-be-profitable lines that are played out. Even then, I suspect it’s more important to have a detailed knowledge of and an instinct for the customers, than to apply a mechanical rule.

I really like the addition of the labels to the Pareto line – I find myself always trying to see where that 80% position is. I suppose some very light gridlines for the percent axis could also work.

Typically, 80+% of the value of a Pareto chart is in the ordered bars, letting us visualize the vital few versus the trivial many.

The cumulative percentages tend to be much less valuable, so devoting a second series (especially in red!) to show them can be more distracting than helpful. Here, a reader might want to easy see that about 20% of the categories account for 50% of the sales, and about 50% of the categories account for 80% of the sales: a somewhat weak Pareto effect.

My suggestion would be a de-emphasized P-P plot in the background in gray using a secondary axes grid: an XY plot of cumulative % effect versus % of categories behind your simple ordered bar plot in real units.

I am intrigued by the compromise (1½ series?) approach you are exploring with H.G. In the spirit of 80/20, why not use quintiles instead of a box plot’s quartiles, so the cumulative percentages 20% and 80% are displayed. Or simply show the 0%, 20%, 50%, 80%, 100% cumulative effect tick marks.

Thank you for feed back, I’ll try to be more precise:

the example used here has 20 elements (from “Games” to “Others”) on the primary axis. Thus, the “count share” of a single item is 5%, all 20 labels together make up 100%.

If we could plot this “count share” on the secondary axis, we could (more) easily read the diagram as follows (for example):

20% of all elements (here: the first 4 elements from “Games” to “Travel”) make already 52% of the total turnover (or whatever the numbers represent).

(That is also the common way students learn about using Pareto analysis in economic policy: like: 5% of the richest people possess 40% of all wealth, etc. etc.)

I tried to make such a diagram with the 77 products of the MS NorthWind sample database, with all of the product names on the primary X-axis (bottom), and the “share count” on the secondary X-axis (top).

A single item here represents 1/77 = 1,3%, but of course it would be desirable to mark only steps of 5% or 10%.

I found it particularly hard (impossible, to be honest) to show this scale ranging from 0% to 100% on the secondary X-axis, while keeping the corresponding item names on the primary X-axis.

Kind regards,

hgl

Required Reading: Revising the Pareto Chart by Lee Wilkinson (Author of The Grammar of Graphics) in The American Statistician, November 2006, Vol. 60 #4, 332-334.

http://pubs.amstat.org/toc/tas/60/4

There is a charge to access this for those who are not subscribers.

Derek –

There’s a difference between marketing to the long tail and cutting out the few customers who provide the lowest (or negative) profit margin. I think the latter situation is what Ferriss is addressing.

Dale –

Good point, the choice of red, while conventional, lends inordinate importance to the cumulative percentages. Since the actual percentiles chosen are arbitrary (80-20, quintiles, quartiles), I’ll keep with the quartile labels. I think Jeff’s brackets in the margin are more effective than the percentages in my latest attempt.

H.G. –

Is your “share count” based on the percentage of categories, or on the percentage of accumulated values? The chart I proposed used the percentage of accumulated values.

Jon,

many thanks for your feed back.

“Share count” is simply each item’s individual share in %. With accumulation.

If there are 20 items, each single item’s share is 1/20 = 5%.

With 77 items, each item’s share would be 1/77 = 1,3%, etc.

To mark 77 times 1,3% on the secondary axis wouldn’t look very professional, though. But steps of 5% or 10% (on a scale from 0% – 100%) seem a practical solution.

That allows for a classic chart interpretation in immediate “Pareto-speak”, such as:

20% of all products make up 80% of all turnover.

(The 20% are simple “share count”, the 80% are accumulated sales).

Kind regards,

hgl

H.G. –

So your data pairs would be {5%, 17%}, {10%, 32%}, {15%, 44%}, {20%, 52%}, etc.? Does it make sense to keep it as a bar chart?

Thank you again for kind considerations and patience.

Apparently I couldn’t explain in English what I mean, so I’ll prepare a drawing on my website tomorrow. If you find the time, please have a look at: http://www.enterplan.de.

Thank you in advance,

best regards,

hgl

Jon,

I think my P-P plot and H.G.’s cumulative share v. share-count are the same idea. I’m suggesting a secondary axes X-Y chart as background to your bar chart.

Since Pareto uses an 80/20 metaphor, not a 75/25 metaphor, and some people try to take this literally, I still like quintile labeling better than quartile labeling in this particular instance.

H.G. –

Like this, then?

I think the secondary axis showing linear percentages across the top is redundant, since one can get a sense for the width of one quintile (see, Dale? quintile?) of data without labels.

I nearly posted in response to a previous comment of yours, and I think it’s as good as the chart above.

The horizontal axis should be labeled “Cumulative X” where X stands for Causes, Inputs, or the specific name of the item you’re counting. The vertical axis should be labeled “Cumulative Y”, wheere Y stands for Effects, Outputs, or the specific name of the variable that the categories are sorted by.

Happy to see it CAN be done (and indeed labels alongside curve are more convincing than plotted on primary X-axis).

Thank you!

(If only I knew how to do it…)

Regards,

hgl

H.G. –

This is actually rather easy. It is an XY chart, X = cumulative % of factors (5%, 10%, 15%, etc.), Y = cumulative % of total effect (17%, 32%, 45%, etc.). The labels came from a third column, and were applied using Rob Bovey’s free Chart Labeler, and were colored to match the data series and oriented to avoid overlap.

Bovey’s utility is actually called the ‘XY chart labeler’. But it’s just as handy when using other chart types. For instance, I used it on a bar chart in my comments above to add data labels that displayed both the % of total and the cumulative % of total. Without the utility, I would have had to manually change all the data labels one by one, or add one series per data point, or play around with VBA.

Bovey rocks!

Rob has always called it the “XY” Chart Labeler, so I guess he’ll never change. It works on all chart types that accept data labels (i.e., all except surface/contour plots), so I just call it the Chart Labeler.

Jon, I like your labeled quantile-quantile plot, but I think the first, MAX, item should be at 0% just as the last, MIN, item is at 100%. By being at 0% it will still be a part of the first, 0%-20%, quintile, but on the very boundary, as appropriate for its statuis as the highest value in the distribution.

Sorry, of course it’s not quantile-quantile, or else the first point would have to be at the origin (0%, 0%), which would be ridiculous. It’s a cumulative against quantile graph, and I was distracted by the fact that they’re both scaled as percentages. As a cumulative, it is right that the first value should have a position 18% or so up the vertical scale right from the start; it’s only on the horizontal scale that the first data point needs to start from 0%.

Derek –

I’ve always had issues with how the first and last points are considered on this kind of scale. In an Excel histogram (cough!) the first bin always has one point, unless multiple points have the minimum of the data set. In fact, some of the different definitions of percentile (or in general, N-tile) came about because of different ways to do the math. (i-1/2)/N for example, which is like measuring to the middle of each bar in a bar chart, where the bars are one unit apart along X. But other methodologies use different algorithms; I remember using something like (i-j)/(N-k) in a Weibull package we used for reliability determination, which gave very similar positioning along X.

Using cumulatives allows me to skip out on the whole percentile/bin definition thing ;-)

Interesting discussion. So are you going to follow Naomi’s advice? Apparently, Lee Wilkinson proposed something called a Pareto dot plot. I’m sure that you must be curious to know what this is and to be the first to demonstrate publicly how to build one in Excel :)

Colin –

I’ve downloaded Wilkinson’s article and applied its techniques to this data set. Once I understood the algorithms and ideas, making the chart was straightforward. I wish I had more time to keep up with the discussion, but I expect to post a follow-up article next week.

Excellent! I should have guessed that you were already on the case :)

Jon – did you mean to say

Derek…I’ve always had issues…?Jeff – Yeah, fixed it. Sorry. I’d just written a response to you and forgot whose comment the next one was addressing.

Derek,

Pareto charts often include a P-P (Percent-Percent) plot such as Jon’s red curves. This represents a cumulative distribution of effect (# of apps) against the cumulative ordered distribution of categories (for Pareto).

It’s not at all ridiculous to show that such a curve passes through the (0,0) origin as well as the (100%,100%) point. In fact, at the risk of showing the obvious, that’s how I’d prefer to draw it. For example, QIMacros draws its Pareto chart in Excel that way, although it doesn’t explicitly put cumulative % on the category count as Jon finally did under duress.

If you check Wikipedia, you’ll find a distinction between P-P and Q-Q plots, and it is P-P plots that go through (0,0) and (1,1) at their edges, while Quantile-Quantile plots may go to infinity (for continuous variables) at their hypothetical vanishing edges (not generally plotted!).

Hello,

Why doesn’t the cumulative percentage line start at zero? How do you do the calculations so that it does?

This really is a fairly very good guide to the subject.