We can forecast even when no historical data exists, if we use our experience and judgment. In Part 1 of our probabilistic forecasting series we looked at how uncertainty is presented; in Part 2 we looked at how uncertainty is calculated. Both of those parts presumed historical data was available.
Although estimating without historical data makes many people uncomfortable, acting responsibly often requires us to do it. Fear of being wrong may cause us to avoid making any forecast at all, leaving someone else to make uniformed decisions. Forecasting helps us make better decisions by reducing uncertainty, even when there is little information. Probabilistic forecasting may involve experts expressing their guesses as a range. Wider value ranges in their “guesses” may indicate more uncertain inputs.
We recommend adopting these practices to get good estimates from experts:
Estimate as a group to uncover risks that may expand the range of uncertainty (use Planning Poker or other anchor-bias reducing mechanisms to help expose differences).
Estimate using a range, not a single value
Coach experts to estimate using ranges to combat their particular biases towards optimistic and pessimistic
Range estimates must be wide enough that everyone in the group feels that the real value is within the range, as in “95 times out of 100 this task should take between 5 and 35 days.”
People can learn to be good estimators. Most people perform estimation poorly when faced with uncertainty (see “Risk Intelligence: How to Live with Uncertainty” by Dylan Evans and “How to Measure Anything” by Douglas Hubbard.) They found that practicing picking a range that most likely contains the actual value of known problems (wingspan of a Boeing 747, miles between New Your and London for example), then giving experts feedback on the actual answer, increased estimation accuracy. Practice helps resolve personal pessimistic and optimistic biases.
When estimating how long IT work will take, teams should provide a lower and upper-bound. When a project of sequential stories needs forecasting, it’s simple: the project forecast range is between the sum of the lower-bounds and the sum of the upper-bounds. However, few large projects involve completing stories strictly in sequence. If you have multiple teams, people working in parallel or complex dependencies, a simple sum doesn’t work (not to mention the unlikely luck of every pice of work being at the lower bound or the higher bound). Most projects need a more powerful technique for accurate forecasting.
Monte Carlo simulation can responsibly forecast complex projects, even if the only data you have is expert opinion. When Monte Carlo simulation is performed properly, we can propagate uncertainty accuracies from different components to create a responsible project forecast. For example, a statement like “We have an 85% chance of finishing on or before 7th August 2014” is mathematically supportable.
In next part of our probabilistic forecasting series, we will look at the likelihood of values within a range, how that can help narrow our forecast risk, and why work estimate ranges follow predictable patterns that help us be more certain.
We love supporting the community; especially in our home town. Come see us talk about agile metrics, risk reduction and cost of delay at the Modern Management Methods Conference in San Francisco May 5th to 8th. Troy Magennis and Dan Greening are both speaking on the Risk Management and Metrics track, click here for more details. Register for the main conference (Wednesday and Thursday) to attend our talks. Sign up for the 4-day option to attend our interactive tutorials.
Get 15% off conference registration by using the discount code LKSPEAK when registering through the website.
Risk-Reduction Metrics for Agile Organizations Dr. Dan Greening
Wednesday, May 7 • 2:20pm – 3:00pm
Agile and lean processes make it easier for organizations to measure company and team performance, assess risk and opportunity, and adapt. My colleagues and I have used delivery rate, concept-to-cash lead-time, architectural foresight, specialist dependency, forecast horizon and experiment invalidation rate to identify risk, and focus risk-reduction and learning efforts. With greater knowledge, we can eliminate low-opportunity options early and more deeply explore higher-opportunity options to maximize value. We’ve used these metrics to diagnose agility problems in teams and organizations, to motivate groups to improve, to assess coaching contributions, and to decide where to spend coaching resources. We face many problems in using measurement and feedback to drive change. Manager misuse or misunderstanding of metrics can lead organizations to get worse. Teams or people that mistrust or misunderstand managers often game metrics. And yet, what we can’t measure, we can’t manage. So part of a successful metrics program must involve creating and sustaining a collaborative, trusting and trustworthy culture.
Understanding Risk, Impediments and Dependency Impact:
Applying Cost of Delay and Real Options in Uncertain Environments Troy Magennis
Wednesday, May 7 • 4:20pm – 5:00pm
Many teams spend considerable time designing and estimating the effort involved in developing features but relatively little understanding what can delay or invalidate their plans. This session outlines a way to model and visualize the impact of delays and risks in a way that leads to good mitigation decisions. Understanding what risks and events are causing the most impact is the first step for identifying what mitigation efforts give the biggest bang for the buck. Its not until we put a dollar value on a risk or dependency delay that action is taken with vigor.
Most people have heard of Cost of Delay and Real Option theory but struggle to apply them in risky and uncertain portfolios of software projects. This session offers some easy approaches to incorporate uncertainty, technical risk and market risks into software portfolio planning in order to maximize value delivered under different risk tolerance profiles.
Topics explored include
how to get teams to identify and estimate impact of risks and delays
how to identify risk and delays in historical data to determine impact and priority to resolve
how risks and delays compound and impact delivery forecasts, and what this means to forecasting staff and delivery dates
how to calculate and extend Cost of Delay prioritization of portfolio items considering risk and possible delays
how Real Options can be applied to portfolio planning of risky software projects and how this can change the bottom line profitability
Capturing and Analyzing “Clean” Cycle Time, Lead Time and Throughput Metrics Troy Magennis
Thursday, May 8 • 11:00am – 12:30pm
On the surface, capturing cycle time and throughput metrics seems easy in a Kanban system or tool. For accurate forecasting and decision-making using this data, we better be sure it is captured accurately and free of contaminated samples. For example, the cycle time or throughput rate for a project team working nights and weekends may not be the best data for forecasting the next project. Another choice we have to make is how we handle large and small outlier samples (extreme high or low). These extreme values may influence a forecast in a positive or negative direction, but which way?
This interactive session will look for the factors attendees have seen that impair data sample integrity and look for ways to identify, minimize and compensate for these errors. The outcome for this session is to understand the major contaminants and to build better intuition and techniques so we have high confidence in our historical data.
We’re really looking forward to this conference and hope to see you there!
— Troy and Dan
In Part 1 of this series we discussed how probabilistic forecasting retains each estimate’s uncertainty throughout the forecast. We looked at how weather forecaster’s present uncertainty in their predictions and how people seem comfortable that the future cannot be predicted perfectly and life still continues. We need this realization in IT forecasts!
In Part 2 we look at the approach taken in the field of probabilistic forecasting, continuing our weather prediction analogy.
We can observe the present with certainty. Meteorologists have been recording various input measure for years, and evidence suggests ancient cultures have understood the seasons to the extent they knew what food items to plant and when. These observations and how they played out over time form the basis for tomorrow’s weather forecast. Modern forecasters combine today’s actual weather conditions with historical observations and trends, using computer models.
This is the first article in a series that will introduce alternative ways to forecast date, cost and staff needs for software projects. It is not a religious journey; we plan to discuss estimation and forecasting like adults and understand how and when different techniques are appropriate given context.
Stakeholders often ask engineers to estimate the work for a project or feature. Engineers then arrive at a number of story points or a date and present the result as a single number. They rarely share uncertainty or risk with those estimates. Stakeholders, happy to get “one number”, then characterize engineer estimates as commitments, and make confident plans that depend on achieving the estimate. Problems arise when uncertainty and risks start unfolding and dates shift. Failure to communicate engineering uncertainty is a key difference between estimation and forecasting.
I recently observed that most of a client’s backlog items had 5-point estimates. A colleague tweeted it, and an 180+ Twitter storm followed. This article explains the phenomenon, and why no one should call the police.
To manage capacity, forecast delivery dates and motivate architectural thinking, we often ask development teams to estimate proposed work. We were curious whether group dynamics cause statistical patterns in estimates. If so, can we reduce effort estimating detailed backlog items, especially in early forecasting, before we are really certain about what we might do?
Here we examine one organization’s data to look for signals and patterns. Our analysis is too anecdotal to provide conclusive recommendations for other groups and other individual teams, but it is highly likely that these patterns exist elsewhere. We illuminate biases and patterns in estimation behavior. We seek greater discussion on estimation behavior, with the goal of better forecasting with less effort.
This article has three parts
Introduction to Story Point Estimation and other techniques currently used by teams
A presentation of data and patterns seen by the author as part of a recent client engagement
Discussion and initial thoughts on causes of the observed patterns
Story Point Estimation
For Agile teams who estimate, story points have become a popular unit to measure estimated effort. Agilists invented an artificial unit of “story point” to distance effort estimation from time estimation, and to render comparisons between teams more obtuse. For its primary purpose, allowing people to compare the estimated team effort of different items, story points work fine.
The story point metric also provides a way to limit work to avoid overwhelming the team’s capacity, and to forecast future team progress. The story points completed over time provides “velocity”, which usually correlates with the team’s near-term capacity. Velocity can then used for medium or long term projections, with diminishing accuracy as time frames increase. Our case study used story points for both purposes.
How Teams Estimate using Points
No standard method dictates how teams assign story points to a work item. Estimation generally occurs in specific meetings between the team and the “Product Owner” (the author of the work items). Having the Product Owner accessible during estimation helps the team quickly clarify work intent and acceptance criteria before estimating. Often these discussions turn to how splitting a large work item into smaller work items that incrementally build functionality and spread the invention risk.
To estimate new work items consistently with historical work, teams often compare them against previously estimated and completed work items (“reference stories”). Teams can estimate faster by quickly finding reference stories that bracket the new work item’s estimated effort. Reference stories help normalize estimates and learn from history. Some teams take comparative estimation further by spreading a large number of stories on a table and clustering them in groups of similar effort, assigning a point estimate value to all stories in a “pile.” This process, called “Bulk Estimation” (see http://agilecanon.wpengine.com/bulk-estimation/), helps teams focus on what work is larger and smaller in effort from other work. For prioritization trade-off decisions, this early information helps do high-value low-effort work before high-effort low-value work.
Agile and Lean Startup methods and practitioners promote smaller work items. Teams work hard to split work into acceptably small items. This helps the team deliver smaller increments (with shorter lead time) and allows deferring complex features that may change after customers try the simpler features. In my experience, stories are often split to smaller units but rarely combined into larger units for work being attempted in a sprint cycle.
All teams in this article biased towards smaller pieces of work and estimates in the presence of the Product Owner who helped split work if it was too big.
To avoid team members getting too detailed in their estimation with an open ended story point scale to choose from, some teams restrict the story point options to a subset. The most commonly chosen subset of estimation values follows the Fibonacci Scale. The logic is that the “next” sizing option is 50% larger than the previous and this is about as good a guess as likely possible given the information on-hand and the uncertainty surrounding the technical and system risks.
Pure Fibonacci is where the next number is the sum of the previous two numbers in the sequence, such as: 1, 2, 3, 5, 8, 13, 21, 34. This is often modified by software teams as 1, 2, 3, 5, 8, 13, 20, 40 for “simplicity,” however this author is not exactly sure why its perceived simpler, it is just different.
By restricting to a smaller set of options, any continuous pattern in allocation is replaced by discrete stepped patterns. When people estimate using these options they are guessing which size option the story is closest too. This satisfies the intended purpose of discerning which pieces of work are more effort than others and by how much in discrete steps.
Modified Fibonacci scales were used by all teams in this article, although Team C used a mix of traditional and modified Fibonacci values in its story point estimates presumably in error.
Planning Poker is a game to get team members to discuss different estimation perspectives. Each team member estimates privately, then all reveal their estimates simultaneously. This reduces anchor bias. Planning poker was used by all teams in this article.
The Product Owner provides a work item description
The team discusses the work item with the author to clarify and refine
When the team is ready, team members who may work on the item reveal their estimates in unison (on the count of three). Revealing in unison avoids one estimator anchoring the group. Various ways of achieving this unison vote, pre-printed playing cards with Fibonacci numbers on them, peoples fingers in the air, text in a chat window, even just shouting
The group discusses the differences. For example “Joe, you say a 20 but everyone else says a 5, do you know something we don’t about that code?” Discussion continues until everyone in the group agrees to a number, or the facilitator chooses the highest remaining if consensus is taking too long
Planning Poker is fast when team members agree, and generates discussion when team members disagree; It avoids anchor bias among team members. Newer team members learn through during discussions why such a range of estimates were initially suggested.
I anticipated pattern arising from these estimation techniques, which would appear in the histogram and summary statistics of story point estimates assigned by teams.
The curve will be left-weighted (have a longer right tail).
The Mode would be to the left of the range center.
The rate of decay after the Mode would be exponential.
All teams under study used the modified Fibonacci sequence of 1, 2, 3, 5, 8, 13, 20, 40, 60, 100. 40, 60 and 100 were only used in special cases, for most purposes, 20 was the highest estimate given by teams, other numbers were “we don’t need to know yet, but its big”. Given this backdrop, the following pattern was expected –
a. Median would be 5 points. b. Mode would be 3 or 5 points (central or first below middle option). c. Average (Mean) would be 5 story points.
My reasoning for these expected patterns is based upon observing the subtle pressure for smaller pieces of work versus larger items. It was rare that any effort was applied to understanding work greater than 20 points, in fact, placing a 100 point estimate should mean higher density lower estimate groups.
Utilization of a modified Fibonacci numbers influenced patterns. Just by limiting the possible values to a subset of groups, estimators are forced to make a sequence of binary decisions rather than an absolute estimate. When this scale isn’t linear, the distributions should not be expected linear.
Observed Estimation Patterns
This article describes three different teams chosen from an organizational unit of approximately 80 teams. The teams chosen are of different disciplines. A management process change team (team A), a typical software development team (team B), and a team who performs estimation using a relative bulk estimation technique (team C).
The summary of the Mean, Standard Deviation and Median of the three teams is –
Sampling Method and Statistics
Estimates were pulled from an electronic tracking system. One data field for proposed and completed work held story point estimates. This data was already numeric. In addition to the estimate field, the date the item was created, the date the item was marked as resolved (if it was resolved), and a field indicating what type of work item the entry represented. This data was captured in Excel for statistical analysis and analyzed with EasyFit from Mathwave to obtain summary statistics and histograms.
Team A is a process change management team of approximately six people that delivered no technical code. Team A produced material for assisting the adoption of Scrum within the organization. Team A had 165 samples total. One sample was removed because it was 55 and appeared as a typographical or transcription error. 164 samples were analyzed, there was one value of zero.
Figure 1 – Statistics for Team A Story Point Estimates
Figure 2 – Histogram for Team A Story Point Estimates
Team A showed the common traits expected for teams estimating using a Fibonacci sequence. Meadian and Mode values were in fact both 5 points, and the higher values of 8, 13 and 20 decayed exponentially when plotted on a linear scale.
Team B is a software development team with little UI coding. They have a team size of approximately 12 people broken into two smaller groups. Data for this team can be analyzed in more detail. Team A produced no defects (they did, but new work items were created to fix errors and omissions), Team B did record the work item types into categories. This allows analysis of “work” and “defects” separately and combined.
Figure 3 – Statistics for Team B Story Point Estimates – All (work & defects)
Figure 4 – Histogram for Team B Story Point Estimates – All (work & defects)
Figure 3 shows the expected result for Mode, Median and Mean values being closest to 5 story points, but the histogram shown in Figure 4 was not expected. Although the 5 story point result was as expected curiosity demanded understanding why this team’s distribution wasn’t following the anticipated decay rate. A hypothesis was that Defects are often smaller fixes and may explain the high occurrence rate of the lower values, especially those of 1 story point.
Figure 5 – Histogram for Team B Story Point Estimates – Defects Only
Figure 5 does demonstrate that defects follow a specific pattern where 1 story point is the Mode. Also notable is the occurrence rate of the 8, 13 and 20 story point defects. It would appear that Defects whilst left weighted slightly, the distribution is more uniform. The Mean value for defects is 4.3 points, and the Median is 3 a shown in Figure 6.
Figure 6 – Summary Statistics for Team B Story Point Estimates – Defects Only
For completeness, Figure 7 and 8 show the summary statistics and histogram of the planned work. When faced with a forecasting situation, the work plays the majority role in most cases as it is the only work known in advance of the project. For this work there is a sizable number of 20 point estimates. Team B rarely undertook 20 point stories in their sprints, but carried a larger number than expected 20 point stories in their backlog that were split closer to when they were going to be started. This pattern is encouraged from a process perspective because some of this work may never be attempted and premature splitting is considered a form of waste.
Figure 7 – Descriptive Statistics for Team B Story Point Estimates – Work only
Figure 8 – Histogram for Team B Story Point Estimates – Work only
Team C is a software development teams that builds internal tooling. Team size is approximately 10 people. Team C also has recorded defects and work. The defects and work statistics followed the same pattern as Team B, where defects had a higher Mean value (6.7 average) but shared the Median of 5 story points. The Mode for Team C was 3, lower than the expected value of 5 points. The exponential decay was present after the Median value for the next unit of 8 points, but those estimates higher than that failed to follow the same exponential pattern. Team C carries higher story point estimates in the backlog.
Figure 9 – Descriptive Statistics for Team C Story Point Estimates – All work
Figure 10 – Histogram for Team C Story Point Estimates – All work
It is no surprise that story point estimates follow particular patterns. The patterns are influenced not only by the work effort, but also the process and the estimation techniques used. If it is found these patterns are predictable, this knowledge can be used for practical purposes –
Checking the validity of future estimation sessions. Failure to follow a similar pattern may be an indicator of reduced forecast utility
Medium to long term forecasts may be possible without complete estimation of future work. Early forecasts could be achieved by replicating the historical pattern seen Future work will be undertaken to correlate different types of teams, and different types of projects to determine if the patterns seen exists in a wider population.
Whilst it would appear that given the Median and Average of various types of teams is stable around 5 points estimating makes little sense. If forecasting is the only intent then that statement has merit, however the story point estimation plays a big role in selection of what work to attempt in a sprint. Team B and C for example carried many 20 point stories, but rarely attempted them in sprints in their original state. These higher estimates were always split into multiple smaller stories before starting the work. Estimation by the team played a key role in managing risk and allowed the team to make important trade-off decisions as late as possible. This is potentially valuable.
The patterns found were expected mainly due to the human bias of central tendency when faced with a set of options, in this case Fibonacci values of a fixed set. When faced with a set 1,2,3,5,8,13 and 20, groups would cluster around the center values. Given the tendency to like smaller rather than bigger work items and the ability to split work, erring on the smaller side was an inevitable outcome in the authors mind. Of course, there will always be outliers, but for the purposes of early forecasting, the most likely value to assume on average is 5 story points. The error in assuming 5 points would have been 10% error (actually, 7% in this data) against the eventual Mean values that emerged over time.
To Explore Further
We intend to provide a webinar on this topic in the future. If you would like to be invited, sign up below for “Webinars”.