How do you display a lopsided distribution?
by Jon Peltier
Peltier Technical Services, Inc., Copyright © 2009.
Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
Kaiser wrote about a chart in The Economist in Dealing with skew. Kaiser’s discussion centered around how unclear the Economist’s chart was in even describing what it was showing. Even the title of thee chart, “Income distribution by decile in selected OECD countries”, was deceptive. We won’t get into the junky background.

The bars seem to show decile ranges, until you count the segments and realize there are only nine per bar. It turns out that each bar connects ten values, the average incomes within each decile. This strikes me as a little odd: I think if you had to leave off the extremes, perhaps taking the 5th, 15th, 25th, . . . and 95th percentile values might be a bit more natural. But maybe it’s a matter of taste.
The data originated with the folks who bring us the OECD Factblog, by the way. The data is available in an Excel workbook at Growing Unequal? Income Distribution and Poverty in OECD Countries.
Data in hand, I decided to slice it and dice it. My first impression was to look at cumulative distributions. I assumed that the value for each decile was the value for the middle of that decile, so I used 5% for the first decile, 15% for the second, etc. There were a lot of lines cluttered together, so I made myself an Interactive Multiple Line Chart with a listbox to highlight one country at a time. I have highlighted the US in each of these charts.
I did not include the protocol for creating this interactive chart, but if I get more than two comments asking for it, I’ll be glad to oblige.

Using linear values on the horizontal scale tends to compress the lower values together. I plotted the log of the income data to spread out the data. On a log scale, an equal distance on the chart relates to an equal percentage difference of the values.
The rightmost gray line is Luxembourg. It is steeper than the line for the US, meaning there is less disparity between the lowest earners and the highest. Most of the other curves have a similar slope (similar percentage disparity between low and high). The US and a few others are less steep, meaning they have a greater difference between the poorest and richest earners.

I converted the percentile values on the vertical axis to their Z-scores. This straightens out most of the curvature at the ends of these S-shaped curves.

I took this one step further, and computed linear regression coefficients for each country’s distribution using Z-scores as the independent variable and log income as the independent variable. The computed intercept provides a rough approximation of the median income, while the computed slope is related to the disparity between highest and lowest income.
The median income is shown below left and the income disparity below right. If you believe that higher income and less income disparity is good, then it’s better to be located toward the top of each of these charts. Luxembourg is close to the top of both charts; Mexico, Turkey, and Poland are near the bottom of both; and thee US is mixed, near the top of median income and near the bottom of income disparity.

I like XY charts, so I plotted Income Disparity against Median Income. The data shows a negative trend in the bottom right corner is Luxembourg, while in the top right are Mexico, Turkey, and Poland. The US is the furthest above a diagonal, meaning its combination of high median income and high income disparity is unmatched. To reduce clutter, I removed the labels from the data points in the middle of the chart.

While this analysis is interesting, none of these charts gives a really good overview of all of the data at once. The cumulative distributions give a sense of the data, but it is not realistic to try to label all countries on a single chart. Ironically I drifted back to the stacked bar representation that The Economist showed. With less distracting background, of course.
In Kaiser’s post, Chris Jackson mentioned his paper in The American Statistician, Displaying Uncertainty With Shading, which essentially applies a gradient to a bar that relates to thee density of data at each point along the bar. I decided to apply this approach, but without a gradient that fades too severely at its ends (too reminiscent of the conditional formatting data bars in Excel 2007). I applied his approach by applying lighter shades to the outer deciles and darker shades to the central ones. This worked reasonably well, considering that I made no effort to develop a color palette for this purpose.
Here is my own rendition of The Economist’s chart.

This has the same problem with the lower deciles being compressed. But having just plotted log income in my cumulative distribution plots, I decided it would be worthwhile to apply logs to this stacked bar chart.

This is a more effective display of income disparity than the linear version put forth by The Economist, but it assumes that the audience understands Logarithmic Axis Scales. The median income is clearly identified by the lateral position of the bars, and the disparity of income is directly related to the total width of the bars.
Possibly Related Posts:
Posted: Wednesday, November 19th, 2008 under General.
Comments: 6
Comments
I welcome comments from my readers. If you have an opinion on this post, if you have a question or if there is anything to add, I want to hear from you. Whether you agree or disagree, please join the discussion.
Read the PTS Blog Comment Policy.
Comment from fabrice
Time: Wednesday, November 19, 2008, 1:03 pm
Boxplots could work, loosing some granularity…
Comment from Ran Barton
Time: Wednesday, November 19, 2008, 1:37 pm
“I did not include the protocol for creating this interactive chart, but if I get more than two comments asking for it, I’ll be glad to oblige.”
Count me in as one vote for a write up, please and thank you.
Comment from Jon Peltier
Time: Wednesday, November 19, 2008, 1:58 pm
Fabrice - I did in fact think of that, but decided to leave it for another day.
Ran - That’s 1….
Comment from Jon Peltier
Time: Wednesday, November 19, 2008, 4:46 pm
Fabrice -
Nice. The boxplots lose resolution more in the way the stacked bars against a linear income scale lose resolution: everything at the lower end is compressed. You could apply a log scale, and the disparity in income would be easy to compare.
Comment from Jon Peltier
Time: Thursday, November 20, 2008, 9:22 am
Fabrice has shared his box plot above and a second one that uses a log scale in How to use BoxPlot charts.

















Write a comment