A discussion of various benchmarking measures and the related metrics for systems development projects
Systems Development Benchmarking Exercise
Executive Summary:
This report attempts to define ten benchmarking measures and related metrics for the various systems development projects. A model for the acquisition and comparison of the relevant data is also established. There are two purposes for undertaking this exercise; performance improvement and the identification of outsourcing opportunities. The ten benchmarks have been identified over the stages of Design, Build, Test and Deploy as corresponding to the software lifecycle, with two measures being applicable to the entire project.
Each measure – with its related strengths, weaknesses and acquisition process – is discussed individually. The ten measures and metrics are:
Serial No. | Measure | Associated Metric | Phase | Type | Metric Type |
Strategic Alignment | Alignment Index | Design | Subjective | Performance | |
Productivity | Average lines of code produced per hour per project | Build | Empirical | Performance | |
Accuracy | Bugs per KLOC | Testing | Empirical | Performance | |
Defect Removal Efficiency | Bugs before deployment/ Defects after deployment | Testing | Empirical | Performance | |
User friendliness | Training time per new user | Deployment | Empirical | Time | |
Resolution time | Average time in hours to resolve highest priority problems | Deployment | Empirical | Time | |
Project Viability | Return on Investment | Design/ Project | Empirical | Cost | |
Average labour cost | Average cost per FTE per project | Project | Empirical | Cost | |
Cost of defects | Total cumulative cost of defects since operational commencement | Deployment | Partly Subjective | Cost | |
Percentage re-work time per stage | (Re-work time per cycle/Total time)*100 | Project | Partly Subjective | Time |
Introduction:
This exercise concentrates on taking a high-level view of the software projects. Some of the measures such as resolution time, strategic alignment, project viability and average labour cost are, in fact, not software-specific but may be applied to any managed project. The reason for maintaining a balance between software-specific measures and generic measures to keep in line with the limitation of ten measures; in order to effectively benchmark the software development specifically, the number of measures required would be far greater than ten.
There has also been an attempt to divide the measures equally between cost, time and performance measurements.
The key criterion considered in developing the measures was simplicity. Many benchmarking exercises are known to fail or are faced with complex challenges (Dervitsiotis, 2000) as a result of complex data and the absence of a standard methodology (Dattakumar and Jagadeesh, 2003).
Each of the measures has a set of associated challenges that have been discussed. Some of the measures, such as the productivity measure (in terms of lines of code), are not commonly used in contemporary benchmarking practices but have been retained for the sake of simplicity.
An assumption that has been made while undertaking the exercise is that the project does not end at the stage of deployment; instead, the maintenance and operation of the software constitutes the deployment phase.
The format maintained below is ‘measure (metric)’ for each of the measures.
Measures and Metrics:
Strategic Alignment (Alignment Index)
When designing the project, an important consideration is ensuring that the project is aligned with the strategic direction of the organisation. Ensuring that the IT projects within the organisation are aligned with the strategy of the organisation essentially ensures that the IT team managers involved in designing the projects have an understanding of the organisational strategy. This effectively averts one of the ten pitfalls of benchmarking – not positioning benchmarking within a larger strategy – described by DeToro. (DeToro, 1995)
Acquisition/Comparison:
Strategic alignment is a soft measure and involves collecting data through carefully designed surveys targeted at the senior management members involved in the strategy formulation process. The results are used to build an ‘alignment index’, which would be graded on a scale of 10 – with 1 representing the highest level of strategic alignment – for each project.
The required data may be collected only once every three years (for each project) or in the event of a shift of strategy resulting from an event such as a management reshuffle
Strengths:
The data for this process is relatively easy to obtain, especially if internal benchmarking is used.
This exercise effectively involves the top management in the IT project. Lack of top management involvement and support is said to be one of the main reasons for IT project failure (Whittaker, 1999).
Weaknesses:
The construction of appropriate questionnaires to elicit the required information can be a very challenging task and must be done with care. Additionally, maintaining consistency of measurement over a period can be a challenge (Nicholson, 2006).
The alignment index fails to reflect specific areas of a given project that may be misaligned, if the other parameters are sufficiently well-aligned to negate this impact. Therefore, a breakdown-analysis must also be undertaken.
Benchmarking against other firms is also susceptible to subjective variability since different management members will be involved in each exercise.
Project Viability (Return on Investment)
The return on investment (ROI) of each individual project will be calculated in order to determine the viability of the project. Discovery of an ROI that is below the industry standard would open the project up to further investigation and the possibility of outsourcing certain areas. Although the ROI appears to be a metric which is subjective for each company, it is more accurately a metric variable in accordance with the nature of the IT project.
Acquisition/Comparison:
The ROI will be calculated according to the standard formula ROI = (Payback – Investment)/Investment. Since distinct IT projects can have quite variable ROIs, this metric is not suitable for internal benchmarking. The ROI is to be benchmarked against similar projects.
The ROI will be projected prior to project commencement and subsequently calculated on a quarterly basis from the point of commencement (taking into account payback and investment fluctuations) of the project to track the flow of returns.
Strengths:
The cost of the benchmarking exercise involving the calculation of the ROI is low. ROI is a simple management metric project that requires only basic financial data for its calculation.
The ROI provides a very high-level, basic view of project viability.
Weaknesses:
Productivity (average lines of code produced per hour per project)
One way to track productivity over the building stage is to measure the physical lines of code produced each hour per project. However, there are clauses to this, as described below in the ‘comparison’ section.
Acquisition/Comparison:
The data is to be collected manually but the lines of code are ready to use only after some code formatting has been performed. Chunks of generated code (developed using generators), redundant code (semantically redundant code that performs the same function as a different bit of code) and re-used code are to be discounted while calculating the number of lines (Markus Pizka & Benedikt Mas y Parareda, 2007) for better benchmarking efficiency.
This measure is to be used to compare only projects coded in the same language otherwise major disparities may arise.
The measurement is to be made on a daily basis and would require individual reporting by each employee, to be verified by the respective team leader.
Strengths:
Measuring this metric is relatively cost-effective and the data is readily available. Owing to the simplicity of data, the exercise is immune to any risks that could arise from data that requires context in order to be interpreted.
Weaknesses:
There is the disadvantage that the measure does not reflect the quality of the code produced, which can be a significant factor if the quality is extremely low (high error rate) but volume high.
It also fails to differentiate important bits of code which take far longer to produce than vast chunks of unimportant bits of code.
Code verbosity can significantly vary the implementation of the same functionality in terms of coding. (Markus Pizka & Benedikt Mas y Parareda, 2007)
Measuring the productivity solely in terms of the coding rate fails to account for soft factors.
Accuracy (bugs per KLOC)
Tracking the number of bugs occurring for every KLOC(kilo lines of code) or thousand lines of code written, is a basic measure of the quality of work being produced. Measuring accuracy also puts the measure of productivity into context.
Acquisition/Comparison:
The acquisition of data for this metric is a very simple process. The employees involved in testing would be required to keep track of the number of bugs detected, which is already standard practice in many organisations. The data is to be acquired on a daily basis.
It would be required that bugs be classified according to predetermined levels of criticality.
The data obtained would be benchmarked against selected projects coded in the same language in other organisations. It is important to maintain the language specificity because different languages have highly variable line counts.
Strengths:
In the case of projects with relatively low coding complexity, tracking the bugs per KLOC is a very simple metric and it is insulated from any of the risks that complex data brings with it.
Measuring the bugs per KLOC and correlating the results with the lines of code produced per hour metric can deter programmers from seeking an incentive from producing large chunks of code in order to appear productive.
Weaknesses:
It can be difficult (expensive) to track the chunks of code to the individual programmer responsible for building it, and so this measure only provides the context for the overall productivity of individual projects.
Several applications use multiple programming languages and so counting lines can be very difficult. (Capers Jones, 2008)
Also, as there is no industry-accepted standard for counting lines of code, the benchmarking exercise can be complex. (Linda Westfall )
Defect Removal Efficiency (DRE) = (A/A+B)
Where,
A – number of bugs found (and fixed) before deployment
B – number of defects found (or existing) after deployment
The defect-removal efficiency (DRE) is in effect a measure of the quality (efficiency) of the testing process. The ideal value of DRE is 1 – when all bugs have been identified and removed prior to deployment.
The DRE is important because the cost per defect tends to be neglected. Every overlooked/unfixed bug affects the overall project quality and introduces additional costs.
Acquisition/Comparison:
The acquisition process involves pre-classification, according to criticality, of the common type of defects. Any bugs found before and after the testing phase must be categorised accordingly so that the DRE may be calculated separately for each class.
Since the DRE is a one-off calculation (non-requisition of regular collection of data), the figure would have to be updated every time a new defect is found during operation.
Strengths:
If used efficiently, the metric can help choose a series of defect-removal operations that maximize efficiency and minimise costs and schedules. (Jones, Capers)
When the average cost per defect is high (with a correspondingly high number of defects), the DRE measure assumes greater importance in the project.
Weaknesses:
DRE is a measure that can be fairly determined only after the project is complete, so any resulting improvements can only be applied subsequently to similar projects, without the guarantee that the DRE will be similar.
Resolution time (Average time in hours to resolve the most important problems)
The measure captures the system maintenance efficiency in terms of mission-critical problem-resolution time. Measuring the resolution time is useful for two reasons: maintenance is an activity that can be easily outsourced if performance is found to be below par and a long resolution time can introduce costs that are difficult to trace.
Acquisition/Comparison:
This exercise would require the pre-classification of commonly occurring post-deployment problems into levels of importance.
The time for resolution would be calculated from the point the problem is formally discovered to the point it is formally resolved. The metric is a ratio of the total time for resolution to the total number of problems (of highest importance).
The measure must be tracked on a weekly basis from the point of system deployment.
Strengths:
This measure gives a good indication of the quality of the maintenance activities and also of the nature of the errors occurring. A slow problem-resolution time can hamper the project’s success in terms of cost.
Weaknesses:
This measure ignores problems of lower levels of priority, which exposes the exercise to the risk of inaccuracy resulting from cases where there is a large imbalance between the numbers of low- priority problems and high-priority problems. However, the raw data to measure resolution times for other priority levels is also available.
Measuring the resolution time from the point at which a problem is formally reported to its resolution can introduce inaccuracies as it can differ from the actual time that the problem occurred. What is being measured is the responsiveness, not the quality of fault monitoring.
User friendliness: training time per new user
The user friendliness of the system is an important determinant of project success. A project performing well on the time and budget scales before deployment may still suffer from the problem of high user-training time. User-training time may be high when the system is well-designed, but is not user-friendly. Such a system can have several additional associated costs arising from poor employee software engagement even after training.
Acquisition/Comparison:
This exercise requires the most basic data – the training time for each new system user must be recorded.
Users are to be classified as fresh employees or experienced employees. Experienced employees are those who have previous experience in working with software.
User friendliness is calculated as the ratio of total user training time to total number of users for each category of users.
The data acquisition and measurement is a one-off process for each project
Strengths:
It is a very simple metric which gives a fair indication of the user-friendliness of the software. Time and cost invested in user training often contribute a significant chunk to the total resources, which is why it is important to benchmark this metric.
Weaknesses:
The training programme itself may be poorly designed (for instance, it might be drawn out and be longer than required), in which case the measure does not actually measure the user friendliness of the system.
Cost of defects: Total cumulative cost of defects from operational commencement
This metric seeks to calculate all costs and losses arising as a result of software malfunction during operation. Many of the costs of business downtime are not obvious. This can result in an underestimation of software-related malfunction incidents. Measuring the total cost of defects presents the management with a solid figure that accounts for all losses.
Acquisition/Comparison:
This exercise requires setting up a team with access to all company data. For each incident, the cost will be calculated as the sum of lost sales, business interruption costs, lost goodwill (to include loss of customers), cost of work-arounds, litigation costs and other business losses.
This metric will be updated every time an incident occurs.
Strengths:
Costs and losses arising from business downtime can often be quite high. Unless this figure is specifically calculated, losses such as lost goodwill and the cost of work-arounds tend to be neglected.
This measure is more robust than the cost per defect, which is a misleading value that tends to be low if there are a high number of defects.
Weaknesses:
The metric requires the quantification of certain intangible factors such as loss of goodwill, which is subject to subjective variation in the absence of a well-defined model for calculation.
Percentage re-work time per stage: {(Re-work time per cycle/Total time)*100}
The time spent on re-working in each stage is to be calculated as a percentage of the total project time. One of the best ways to reduce software-development costs according to Twentyman, (2005), is by reducing re-work.
Acquisition/Comparison:
Any activity being repeated (where repetition is not on the schedule of intended activities) is to be considered re-work.
Acquiring this data would require specialised reporting by employees whereby the number of hours spent on any re-work activity would form a separate column on the daily time-sheet. Since this measure depends wholly on subjective reporting, employees are to be made aware, with simple documents, of what constitutes re-work.
Since the acquisition of data is a daily activity, this metric may be updated on a daily basis.
Strengths:
‘About 80 percent of avoidable re-work comes from 20 percent of the defects’. ( Boehm & Basili, January 2001 ). This means that reduction of re-work time (if required) can be achieved by focusing on a small number of defects.
Calculating the percentage time spent on re-work as opposed to the percentage cost spent on re-work is useful for two reasons: it is easier to track accurately and schedule overrun is said to be a more common cause of project failure (Whittaker, 1999).
Weaknesses:
The ambiguity surrounding what constitutes re-work is open to abuse during calculation even if it is defined with substantial rigidity.
It is difficult to accurately track certain types of intangible re-work unrelated to core-IT processes. An example is a team meeting that essentially goes over the same content as during a previous meeting.
Average labour cost (average cost per FTE per project)
The average cost per full time equivalent employee, is an efficient way of measuring the cost per employee.
Acquisition/Comparison:
The average labour cost is measured as the ratio of total cost to total FTEs (in pounds)
The total cost in this context is all expenses associated with an FTE. These include administrative costs, wages, training costs, food, transport, other overhead costs and any variable costs.
The data is to be used strictly for benchmarking against projects of similar scope and nature since the FTE costs vary quite significantly as a function of the nature of the project.
The metric will be tracked on a monthly basis for ease of calculation relating to the monthly financial cycle.
If the employee-base includes contractors, the cost/contracted FTE must be calculated separately.
Strengths:
As Kathy Schwalbe (2009) says, employee overheads form a significant portion of the total project costs. Therefore, measuring the cost/FTE gives a strong indication of areas that may require improvement or outsourcing of the human resource and payroll functions. Therefore, the exercise would require the aggregation of certain HR and accounting data.
Weaknesses:
The total FTE cost can be difficult (or costly) to calculate in cases involving employees working on multiple projects and in cases where there are multiple variable costs involved.
This benchmarking exercise requires cross-departmental coordination, which can introduce an added set of challenges.
Methodology
The methodology for comparison has partly been described for individual measures under the heading of acquisition/comparison. However, the overarching methodology is briefly dealt with in this section.
The only measure that is suitable for internal benchmarking is strategic alignment; organisations may be reluctant to divulge strategic information and the metric itself is completely subjective. Therefore it can either be used to perform internal benchmarking to compare organisational projects against each other and against successfully implemented projects, (and thus arrive at a suitable minimum index that all projects would have to maintain) or, less desirably, external benchmarking.
All the other measures are suitable for external benchmarking, and, in fact, benefit from external benchmarking, mainly because of their simplicity (and related ease of data acquisition). In order to make comparisons, it would be advisable be a part of an organisation such as the ISBSG which would help in the process of making comparisons against anonymous data. Alternatively, since the measures do not require data that is very sensitive in nature, the possibility of setting up a co-operative benchmarking group of friendly companies may be explored
References:
Boehm, B., & Basili, V. (January 2001 ). Software Defect Reduction Top 10 List. Computer Volume 34 Issue 1, 135-137.
Capers Jones. (2008, march 1). MEASURING DEFECT POTENTIALS AND DEFECT REMOVAL EFFICIENCY. Retrieved from http://www.rbcs-us.com: http://www.rbcs-us.com/images/documents/Measuring-Defect-Potentials-and-Defect-Removal-Efficiency.pdf
Dervitsiotis, K. (2000). Benchmarking and business paradigm shifts. Total Quality Management, Vol. 11.
DeToro, I. (1995, January). The 10 Pitfalls of Benchmarking. Quality Progress, Vol. 28, No. 1.
Jones, Capers. (n.d.). Software defect-removal efficiency. Retrieved April 10, 2011, from Software Productivity Research: http://www.spr.com/
Linda Westfall . (n.d.). 12 Steps to Useful Software Metrics . Retrieved april 09, 2011, from www.westfallteam.co: http://www.westfallteam.com/Papers/12_steps_paper.pdf
Markus Pizka, & Benedikt Mas y Parareda. (2007). Measuring Productivity Using the Infamous Lines of Code Metric. Information Processing Society of Japan (IPSJ). Tokyo.
Nicholson, L. (2006, December). Project Management Benchmarking for Measuring Capability. PM World Today, pp. 2-6.
Schwalbe, K. (2009). Information Technology Project Management.
Twentyman, J. (2005, June 15). The crippling costs of IT project rework. Inside Knowledge, Volume 2( Issue 1).
Whittaker, B. (1999). What went wrong? Unsuccessful information technology projects. Information Management & Computer Security, 23-30.
R. Dattakumar, R. Jagadeesh, (2003) “A review of literature on benchmarking”, Benchmarking: An International Journal, Vol. 10 Iss: 3, pp.176 – 209