Recently in the Flowing Data blog, Nathan Yau wrote U.S. Census Bureau’s 2008 Statistical Abstract – Looking at America’s Data. Interesting, now I can look up anything and take on Cliff Clavin.
Nathan included a chart showing how the price of first class postage stamps had increased over the years. In a comment I noted that a step chart would have been a better choice to show this data, as the nature of stamp prices is not continuous in nature, but is characterized by constantprice segments with jumps from one price level to the next.
Another commenter disagrees, saying that a step chart makes the trend difficult to see. I thought this was a good excuse to make a comparison between line and step charts for this type of data.
I went to the Census Bureau’s Statistical Abstract to get the stamp price data. Strangely enough, searching a number of keywords were unsuccessful, so I visited my favorite source for all knowledge, and found what I wanted in three seconds.
In USPS Stamp Prices On The Rise – Historical Prices, Chuck posted the following table of USPS first class rates, and another line chart. And I ask parenthetically, who says Excel 2007 makes better looking charts?
Date  Rate  % Increase 
14Sep75  $0.10  
31Dec75  $0.13  30.00% 
29May78  $0.15  15.38% 
22Mar81  $0.18  20.00% 
1Nov81  $0.20  11.11% 
17Feb85  $0.22  10.00% 
3Apr88  $0.25  13.64% 
3Feb91  $0.29  16.00% 
1Jan95  $0.32  10.34% 
10Jan99  $0.33  3.13% 
7Jan01  $0.34  3.03% 
30Jun02  $0.37  8.82% 
8Jan06  $0.39  5.41% 
14May07  $0.41  5.13% 
12May08  $0.42  2.44% 
I took Chuck’s data and created my own line chart, and a step chart, of the USPS rate data.
My reasons for preferring the step chart are that it more accurately shows the actual daytoday prices of a stamp. The trend is pretty obvious, as this data is not very complicated: every several months or years, the price goes up by a few cents. The line chart smudges the data, reducing our ability to (a) discern the intermittent pattern of price changes, (b) compare the duration between price changes, and (c) compare the magnitude of price changes. The line chart does show the increasing trend, but no more effectively than the step chart.
Making a Microsoft Excel Step Chart
Excel has no native step chart, but creating one is only as difficult as properly preparing the data. I wrote about this in my main web site in Step Charts in Microsoft Excel. I showed how to use an XY series with no markers or lines, but with error bars to show the horizontal and vertical segments of the steps. I also showed how to selectively repeat ranges in the source data of a line chart using defined names, but I can tell from feedback on that page that a lot of people had problems conceptualizing that technique. Since I’m a strong proponent of the “Keep It Simple Stupid” philosophy, I’ve come up with an alternative way to prepare data for a step chart line chart.
The sequence of ranges below shows how to do this in about three minutes. Start by copying the original data (yellow), and pasting it below the original data (green), shown in the first block below. The second block shows two cells to be deleted: the first cell of the original dates and the last cell of the original rates. Delete these two cells, shifting cells up to fill the gap; the arrangement is shown in the third block. This arrangement is sufficient for creating a step chart if the X data are dates, because the line chart sorts the data internally before plotting it. However, to make the table more humanfriendly, sort it by date, as shown in the fourth block below.




To create a step chart, select the data (processed as in the fourth block above), and create a line chart using the lines without markers option.
It’s possible to modify the data protocol in order to generate a step chart that has only the horizontal segments, without the vertical segments (risers). Derek pointed me toward this in one of the comments, and I incorporated it into a new blog post, Step Chart Without Risers.
The step chart does make a better reference of the actual stamp prices. I’m by no means an Excel wizard so nice to see some other folks analyzing this data.
Jon, in general the line chart is the wrong choice when time intervals are unequal (what *really* happens during the missing periods?). In this specific case however, the trend is predictable (postal rates don’t go down or stay constant for very long periods), which also means that what happens during missing periods is predictable. But this predictability is also the problem. What do you learn from the line chart that you didn’t already expect to see?
In contrast, the step chart illustrates both the sizes of the changes and the discrete nature of the changes (stays constant from one change to the next and not slope upwards as the line chart shows) . The line chart totally obscures these characteristics.
@Colin: “in general the line chart is the wrong choice when time intervals are unequal”
When date intervals are unequal, a line chart is not necessarily the wrong chart type. If Excel recognizes the X values as dates, it applies a datescale* axis (if Excel does not recognize the X values as dates, but they are numerical, you can manually apply a datescale axis). The date scale axis spaces points according to the number of days (whole numbers) between them.
My charts in this post all use a datescale axis, with a base unit of days.
* The datescale axis was inaccurately called a timescale axis prior to Excel 2007. This axis type only looks at the integer value of a datetime, meaning is recognizes dates, but all times on a given date are rendered at midnight. Some day I’ll blog about this axis type.
“it more accurately shows the actual daytoday prices of a stamp”
It also more accurately shows the pricetoprice intervals between days, in my experience. In effect, it doubles the information visible on the chart, with almost no disadvantages, which is why it’s a favorite for one of my regular graphs at work. I only maintain the line version (which also has a datescale axis) against the day when I’ll hand the spreadsheet over to someone else to maintain, and they inevitably break the step chart.
Jon, the date scale axis is wrong for line charts, although it’s often used for that purpose. I use it mostly for formatting dates on the chart different from that of the source date format. Consider this: I have sales data for Jan, Feb, May, Jun, Sept, Nov and Dec. Now plot this data using a date axis. The date axis accurately plots the unequal dates (unlike the normal category axis which arranges the months equally on the axis). Now does the date axis based line chart represent the true trend over the year? It can’t, since we have missing intervals. If Feb sales are higher than May’s, it would appear that sales declined between Feb and May. However, sales might have been continuing on an upward trend in March and April but we have no way of knowing this. In essence, both the date line and category line charts are wrong if intervals are missing. Step charts work well with the date axis because intervals don’t have to be even (it’s more “transactional” in nature).
Colin –
I don’t think a date scale is wrong for a line chart. It’s often better than the category scale, and while it treats the dates in a similar manner to how an XY chart does, there is more flexibility in formatting the scale.
Your example describes a problem with missing data. Any misunderstanding in that example chart is the fault of the missing data, not of the chart. If you wanted to properly convey the missing data, it would be better to leave the cell(s) blank, so that a gap remained in the lines connecting points in the chart. In the case of missing data, the step chart would also be as wrong as any other representation.
Jon, good points. In the line chart, the data is assumed to be missing because equal interval data is supposed to be there to make the chart meaningful. In the step chart, the intervals aren’t expected to be equal, so there’s no need to make any assumptions about any data that should be there for the chart to make sense.
Coming back to the line chart, if you leave cells blank where there’s no data for a month, then you’re back to equal intervals on the category axis, so except for flexibility in formatting the date axis, there no difference to the plotted points if you use date or text axis. Now, you have a chart that accurately puts the months where they should be and shows gaps where’s there’s no data. But what can you do with this chart? Wait for missing data, I suppose. You can’t perform any meaningful analysis otherwise.
I think we agree in a roundabout way. I was originally making the point that folks often suggest using the date axis when date intervals are uneven, but plotting a line between uneven points on a datescale makes no sense, since the purpose of a line chart is to show the trend of equally spaced, related data. Stephen Few has written about this issue abundantly, but to be honest it’s bothered me for a long time.
Here’s a recent example of the problem. You may recall a post on the Microsoft Excel blog some time ago where someone at Microsoft explained how to use an XY chart to create a time axis (hours, min, sec). http://blogs.msdn.com/excel/archive/2007/12/17/intradaytimeseriescharts.aspx
The issue with the final chart is that you can’t really tell what’s going on because the time intervals are uneven. For example, what happens between 1:11pm and 3:11pm (an entire 2hr period out of the total 6)? Were there no calls waiting? Viewed over the entire day, the chart is meaningless.
Sorry for the longwinded post.
I don’t understand the obsession with an equal data interval. A line chart need not show the trend of only evenlyspaced data. Suppose I am observing temperatures, and I decide for simplicity that where the temperature hasn’t changes, or where it has been changing steadily, I do not need to record every value. Overnight after the temperature has dropped, I can characterize my temperature profile with one point per hour. As the sun rises, i may need more frequent recordings to capture the morning warm up. Then the clouds blow over, it starts to rain, then it clears up again; I may need minutebyminute data points to track this. When i make my plot, is it any less relevant because the spacing of the data ranges from minutes to hours?
What is intermittent in the charts in this post is the spacing of time between rate changes. In between changes, the rate is constant. I could make a chart with one data point per day, which would have equal data intervals. I suppose I could argue that the charts here have equal event intervals, one data point per event.
Regards to missing data, the date should be kept in the data range and only the data cell left bank, if you know there’s a date for which there’s data but you don’t have it. This maintains integrity between time duration on the chart axis and in the data.
The problem with the chart in the Microsoft blog (I’ll bet they wished it was never posted) isn’t the irregular intervals, it’s that there were times that had values of zero, and these times were not indicated in the data nor in the chart. A line chart may have been indicated if the chart plotted the zeros. In general, since there isn’t a strong relationship between people waiting at one minute and those waiting at the next (unless some are kept waiting for several minutes), perhaps a column chart would have been a better choice, with one column every minute (including zeroheight ones where nobody was waiting).
Here’s the chart Microsoft posted:
Here’s the chart they should have posted:
I simulated a column chart by making an XY chart, hiding lines and markers, and using negative vertical error bars with a percentage value of 100%.
I like your column chart rework. But aren’t we saying the same thing? I completely agree that had the zero values been shown for times where they were zero, the line chart would more accurately show what’s happening. From 1:11pm tod 3:11pm the line should be on the xaxis. The line shown between these two times doesn’t represent reality. The problem with the temperature example you mention is that if the points for which the temperature doesn’t change aren’t plotted, how do you reflect this constancy on the line chart to the next interval where the change happens (unless you’re plotting a step chart)? I think that I might have missed your point.
Colin –
My point is that a temperature change is continuous in nature: if it’s constant for ten hours, I need only record the beginning and end of that constant range; if it’s changing at a constant rate, I need only record the beginning and end of that steady state region. The change in postal rate is a step change. A person waiting for a table is either there or not. You seemed hung up on unequal time intervals, which are not an issue.
Funny story about continuous and discontinuous variables. My wife teaches 8th grade algebra. In one unit the students learn about continuous vs. discrete variables. Measurements of water are continuous. Counting the cars on the New Jersey Turnpike? that’s continuous. They were plotting pet store sales: one dog is $50, two dogs are $100. She asked a kid if it was continuous. He said it was, for a butcher!
Jon, if you have a constant rate from one point to the next or plot the ends of a steady temperature interval then yes, the line chart works, and for these cases interval isn’t an issue. Most of the incorrect plots I see with unequal time periods (like the Microsoft chart) don’t conform to the cases you mention though, and that’s probably why I overgeneralized the argument.
Phew! I’m glad this is over :)
<<They were plotting pet store sales: one dog is $50, two dogs are $100. She asked a kid if it was continuous. He said it was, for a butcher!<<
LOL!
Perhaps it’s the “risers” on the steps that make people uncomfortable, and make them focus on the change point. The idea behind the step chart is to remove the change points’ privileged position and focus on the periods of unchanging price in between.
Here is a step graph without risers, which I think solves some of the problem. Unfortunately in Excel it has to be a scatter graph to have the gaps, and it has to be a line graph to have the date scale, so there are two series there, one an invisible dummy.
Tufte, in VDQI, recommends, not absent risers but hairline ones. But in this tiny lores picture the hairline risers looked barely thinner than the steps, so I made them invisible again.
Derek –
That’s a very good point about the risers. To get the gaps, of course, you can use a line chart by itself (without thee XY chart and its horizontal error bars), by inserting extra spaces into the data. I was working on another post, one showing a chart that connects corresponding points of two XY series, and realized that the data preparation for that chart is not much different than for a regular chart, and it is exactly what is needed for a step chart without risers.
So visit this new post: Line Chart Without Risers to see the adjusted protocol.
Why not postediting your post and add a link to
http://peltiertech.com/Excel/ChartsHowTo/StepChart.html ?
Regards,
Michel
Michel – Good idea. I’ve done that, and entered links to the related blog posts to the web page.
Berkeley economist Brad DeLong rightly praises the Wall Street Journal’s step graph of Obama and Clinton’s delegate and superdelegate pledges over the course of the 2008 Democratic primary campaign. (and wrongly saves the original GIF as a JPG)
Dear mr. Peltier,
thanks for explaning the step chart procedure. I incorporated it in a Excel2003 macro.
Right now, I’m in the process of converting that macro to Excel2007 and that causes some problems. I put a post about this on the mrexcel forum:
http://www.mrexcel.com/forum/showthread.php?p=1723162&posted=1#post1723162
Hope you can take a look there, sinceyou seem to be expert on this matter.
Thanks, best regards,
Geert.
2007 uses existing VBA code successfully 99.5% of the time.
This is part of the other 0.5%.
In Excel 2003, the error bar value for the unused error bar direction was optional. In 2007 it is supposedly optional, but you do have to explicitly enter a zero for the unused direction.
Dear mr. Peltier,
Thank you the reply. I tried your solution and it works perfectly! Those 0.5% cases give me a lot of trouble… Well, thanks again for the quick and helpful response.
Best regards,
Geert.
Jon, the instructions were very useful and I obtained a beautiful line graph. Thank you!
However, I’d rather have a numbers on the xaxis and I don’t get that solved. For me, it is important at which day since the start of an experiment an event occurs, not the exact date. Does anyone have advice on how to do that? I know it is possible to do it with the error bars or just by copying data, but those are just not that elegant and involve producing a lot more data.
Thank you in advance!
Hi Bessie –
What kind of numbers do you need along the X axis? If it’s just integer values, you can use these in place of the dates, then tell Excel to use a date scale instead of a category scale. Dates are whole numbers, after all. Then you’ll have to apply a different number format, probably General, since Excel will format your whole numbers as dates.
If you need real numbers which may include fractions, you can use the XY chart with error bars, as described in Step Charts in Microsoft Excel.
Very useful – used your process to create this bank of england base rate chart:
Many thanks!
Brilliant! I’ve been looking for a way to plot a survival distribution (step graph) in Excel and this is hands down the simplest work around I’ve come across in my searching. Thanks!
Very nice demonstration on the contrast between line chart and step chart. I like it a lot.
Is there any method that you can suggest in constructing a line chart with repeated ranges whose values in the xaxis are numbers instead of dates?
Paul –
By repeated ranges, do you mean the trick to get the extra data into the chart to draw the vertical and horizontal segments of the steps?
You could go to the extra step of sorting the data (not required for a line chart using dates) and make an XY chart with lines and no markers.
Dear Mr. Peltier,
How can I create a step chart which has a no date based, but a numerical X axis. (Similar to a merit order curve)?
Thanks,
P.