VBA Monte Carlo risk analysis spreadsheet with correlation using the Iman-Conover method. Part 5

Welcome to the fifth and final part of my series of blog posts about how to build a VBA application in Microsoft Excel that performs Monte Carlo simulation. It follows on from my post on 11 February, which assembled into a working prototype the parts developed in earlier posts. (If you haven’t read any of the other posts in this series, it would probably be best to read those first, starting with Post 1.)

Now that the code is complete all that remains is to explain how to use it. Using it involves the following steps:

  1. Set up a spreadsheet model.

  2. Import the modules developed in the previous posts.

  3. Insert a new empty module.

  4. In the new module, write a script to perform the simulation by calling functions and methods from the other modules.

The best way to explain the details of this is by way of a practical example.

A practical example

The first thing to do is to choose a model to apply it to.

To be a good explanatory example, the model needs to be large enough to have at least a dozen uncertain input variables, but simple enough that its development doesn’t hog the limelight at the expense of the Monte Carlo application. This requirement doesn’t, of course, apply generally. My original plan was for an economic analysis of a wind farm, but it soon became clear that most of the text would be about how to analyse the economics of a wind farm rather than about how to apply Monte Carlo to a spreadsheet model. Maybe I’ll do that one on another occasion.

The model that I have decided to use is a costing of a construction project. This is just a simple inventory of materials and labour, where the quantity and cost of each item can be estimated as a probability distribution.

Setting up the model

Inputs

Even a small building project will need many hundreds of different materials and components. For this exercise we only want about a dozen, so it will have to be very small, say a small brick building with a flat roof, no windows, no water and no electricity, such as a garden outhouse for storing tools. Lets assume that we have carefully designed it and established that we need the materials and labour listed in Table 1 below. (These are not, of course, what you would actually need to build such a thing, I just made them up at random, so don’t try to build it!)

We are going to apply Monte Carlo to the unit prices, using triangular distributions for everything. To do this, every price must have a lower, middle and upper estimate. The middle estimates were obtained from suppliers’ web sites, and the upper and lower ones were just guessed.

We are going to assume that we have done our design carefully enough that the required quantities of materials are correct and that only their prices are subject to uncertainty. This does not apply to the labour items, however, as the amount of effort needed to accomplish a particular task could vary depending on a large number of factors. Hours of effort are assumed to be uniformly distributed between upper and lower bounds, as shown in the last four rows of Table 5.1.

Table 5.1: Bill of inputs
Description Unit No of units Unit price
Lower Middle Upper
1 Sand 850kg bag 5 £35.00 £39.99 £47.00
2 Aggregate 850kg bag 4 £33.52 £39.99 £51.22
3 Bricks Pallet of 400 7 £173.21 £192.00 £294.00
4 Cement 25kg bag 42 £3.52 £4.19 £5.71
4 Mortar mix 25kg bag 45 £4.00 £6.78 £7.79
5 Plasterboard, sheet, 2400mm × 1200 × 12.5mm Pack of 5 3 £22.78 £28.45 £37.03
6 Plaster 25kg bag 4 £4.75 £5.73 £8.92
7 Mineral wool insulation, Slab 455mm × 1200mm × 100mm Pack of 8 2 £17.50 £20.89 £27.25
8 Timber, untreated, 38mm × 63mm × 2400mm Pack of 20 1 £60.00 £70.00 £92.50
9 Timber, treated kiln dried 45mm × 95mm × 3600mm Single 5 £8.50 £9.47 £13.65
10 Timber, planed square edge 18mm × 94mm × 2400mm Pack of 7 2 £28.75 £31.99 £39.25
11 Oriented strand board (OSB), 18mm × 1220mm × 2440mm Sheet 4 £15.00 £17.49 £25.00
12 Aerated breeze blocks, 440mm × 215mm × 100mm Pallet of 100 3 £105.50 £118.00 £155.00
13 Roofing felt, roll, 6m × 1m Single 5 £28.50 £48.29 £61.99
14 Nails Tub, 2kg 1 £11.99 £12.99 £14.99
Lower Upper
15 Bricklayer Hour 32 40 £9.91 £10.91 £13.41
16 Plasterer Hour 4 8 £11.32 £12.32 £14.82
17 Carpenter Hour 8 10 £10.29 £11.29 £13.79
18 Roofer Hour 4 6 £9.48 £10.48 £12.98
19 General labourer Hour 32 40 £6.82 £7.82 £10.32

To set this up as a spreadsheet model first open a new workbook and save it, with a filename of your choice, as a ‘macro enabled workbook’, with the filename extension xlsm. If you save it as an ordinary workbook (xlsx) then Excel will prevent any macros from running.

Next, paste Table 5.1 into a blank worksheet and rename it something meaningful. Figure 5.1 below shows a screenshot. Add an extra two columns (I and J) to the right of the table to contain the actual quantities and costs, a third column (K) to contain the product of cost and quantity and finally a grand total underneath (Cell K25) that contains a formula that returns the sum of the third column.

Figure 5.1: Screen shot of spreadsheet model
Figure 5.1: Screen shot of spreadsheet model

It may seem strange that we need columns I and J when the same numbers are also contained in columns C and F. Why can’t we just use those? The answer is that columns C to G are not actually part of the spreadsheet model, they are parameters of the distributions. The code that generates the random variables will read the contents of these cells and use them as arguments of the distribution functions. In fact, we could have written these numbers directly into the script and not had them in the worksheet at all. They are there to make it easier to change them and re-run the program to see what happens. If we were to make the code write results into these cells then the parameters of the distributions would change randomly every iteration. The results would be very peculiar and certainly not correct.

The numbers in columns I and J are not actually used in the calculations. The program stores their values in temporary variables, performs the Monte Carlo run and then puts them back again so it looks like nothing has happened. We could write any numbers in these cells or even leave them blank.

Correlation matrices

The program is structured so that the input and output variables of the model can be divided into subsets each with a different correlation structure—i.e. each subset either has its own correlation matrix or is uncorrelated. The subsets are grouped into collections, one for input variables and one for output variables. These collections can be passed as single entities to the graph-plotting function, enabling it to handle any permutation of different numbers and sizes of subsets without having to be modified by hand. This gives maximum flexibility in how the program can be applied. If you want, you can put all the input variables in a single subset and apply one big correlation matrix—this is just a particular choice of subset structure.

The bill of materials and labour in Table 5.1 can be straightforwardly divided into four correlated subsets and an independent subset.

Subset 1
Sand
Aggregate
Subset 2
Bricks
Cement
Plasterboard
Plaster
Mineral wool insulation
Subset 3
All wood-based items
Subset 4
Hourly labour rates
Subset 5 (uncorrelated)
Everything else

To construct correlation matrices, we need to know the correlation coefficients. In an ideal world we would get hold of a big set of historical data and calculate them. In these days of so-called ‘big data’ this is becoming easier for some kinds of data, but not in this case, so we will have to guess. I have assumed that any pair of correlated variables has a correlation coefficient of 0.5. I have no particular reason to choose this value.

Each matrix goes into a separate sheet, titled ‘Correlation matrix 1’, ‘Correlation matrix 2’ and so on.

Subset 1 contains materials that are dug from the Earth or dredged from the seabed. It seems reasonable to assume that the factors that affect costs in the aggregates industry will tend to drive the prices of all of the industry’s products together and so they will be correlated with each other. Its correlation matrix is:

Figure 5.2: Correlation matrix, subset 1, aggregate industry products
Figure 5.2: Correlation matrix, subset 1, aggregate industry products

Subset 2 contains materials that are made in high-temperature processes the economics of which are sensitive to the price of energy.

Figure 5.3: Correlation matrix, subset 2, high-temperature industry products
Figure 5.3: Correlation matrix, subset 2, high-temperature industry products

As with aggregates, wood products are produced in similar ways, using similar machinery to process similar starting materials, and so Subset 3 will be correlated.

Figure 5.4: Correlation matrix, subset 3, timber products.
Figure 5.4: Correlation matrix, subset 3, timber products.

Subset 4 contains all the labour rates. These tend to move together in response to upturns and downturns in general economic conditions.

Figure 5.5: Correlation matrix, subset 6, labour day rates.
Figure 5.5: Correlation matrix, subset 6, labour day rates.

Importing the modules and creating a new blank module

The modules can be downloaded in this zip file. If you’ve bothered to read these articles you are probably familiar with VBA and know how to import modules and add new ones. If not, a description can be found here.

Writing the script

Now at last we’ve arrived at the actual implementation of the program. To do this we need to write a script that tells the program how we want it to perform its Monte Carlo simulation.

The script has to do the following things. This is quite a long list, and consequently the script will also be quite long. However, all it does is perform input and output, and call functions from other modules. It doesn’t perform any calculations itself. Its length is a consequence of the nature of the problem we are applying the method to, not of the method itself.

  1. Declare variables that will be needed in the script, principally
    1. integer counters for loops and so on,
    2. the uniform random array, U (see Post 2),
    3. variables needed to hold values temporarily during the calculations (not always necessary).
  2. Create a collection for input variables and a collection for output variables (although the lines of code that do this are the same for every application, I haven’t found a way of hiding them in another module).
  3. Set the number of iterations.
  4. For each subset of variables
    1. create a new subset object,
    2. add the object to the appropriate collection (input or output),
    3. pass various parameters to the object, such as
      • the name of the subset,
      • the number of variables in the subset and
      • the number of iterations.
    4. Redim‘ the object’s internal arrays.
  5. For each input variable
    1. pass the variable’s cell reference into the the appropriate subset object,
    2. pass a descriptive name into the appropriate subset object,
    3. generate a uniform random array U, using the function ‘Unifun(n)‘, that was developed in Post 2 and
    4. calculate the inverse distribution function of each element of U and pass it into the appropriate subset object to be stored in the object’s ‘ordered array’ (see Post 4 for an explanation of how the program uses ‘ordered’ and ‘unordered’ arrays).
  6. For each subset of correlated input variables
    1. pass the cell references of the correlation matrix to the subset object,
    2. generate the ‘unordered sample’ as a correlated array inside the subset object.
  7. For the subset of independent input variables
    1. generate the ‘unordered sample’ as an independent sample array inside the subset object
  8. For each output variable
    1. pass the variable’s cell reference into the appropriate subset object and
    2. pass a descriptive name into the appropriate subset object
  9. Call the ‘RunModel‘ function
  10. Call the ‘Graphs‘ function on the input variables
  11. Call the ‘Graphs‘ function on the output variables

Having established the structure of the script, we can now put flesh on the bones and translate it into actual code.

Before proceeding further, I would just like to mention that all code and algorithms presented in this and other articles in this series are offered without any form of warranty, not even for merchantability or fitness for any particular purpose. You are free to use them as you wish but it is your responsibility to test them and make sure they are fit for the use to which you intend to put them.

The first item in the list is declarations of variables that will be needed in the script, such as integer counters for loops and so on, the uniform random array, U and variables needed to hold values temporarily during the calculations

The code looks like this:

' GENERAL DECLARATIONS

' Counters
Dim i As Long, j As Long, k As Long, l As Long, m As Long, n As Long

' Uniform array
Dim U() As Double

' Temporary storage variables
Dim left As Double, middle As Double, right As Double
Dim tmpName As String, tmpSheet As String, tmpRange As String

where left, middle and right hold the parameters of the triangular distribution when they are read in from the worksheet and then pass them to the inverse distribution function, and tmpName, tmpSheet and tmpRange temporarily hold strings specifying the location of the cells that contain the input variables and its descriptive name. These variables are needed because in this example the parameters of the distributions are held in worksheet cells, rather than written as literals in the script. Not every implementation will require temporary variables like these.

The lines of code that create the input and output collections are:

' Create collections to hold sets of variables
Dim InputVariables As Collection, OutputVariables As Collection
Set InputVariables = New Collection
Set OutputVariables = New Collection

To set the number of iterations, simply write

' Set number of "iterations"
n = 1000
ReDim U(1 To n)

and to create a new subset object, put

' INPUT VARIABLES, SUBSET 1: Aggregate products

' Create new subset object
Dim AggProds As ClsRandomVariableSubset
Set AggProds = New ClsRandomVariableSubset

In this case the subset object is called AggProds and contains the prices of aggregate industry products as described above. The other subsets are defined similarly.

Next, add the object to the collection:

' Add subset object to input variables collection
InputVariables.Add AggProds

and then pass parameters to the object and redim arrays:

' Pass parameters to subset object
With AggProds
    .SubsetName = "Sand and aggregate"
    .NumVars = 2
    .NumIters = n
' Redim arrays inside subset object
    .Size
End With

To set up the input variables, the following lines of code populate the ordered array of the first member variable of the AggProds subset object.

'First variable in subset: price of sand
k = 1
tmpSheet = "Materials and labour"
With ThisWorkbook.Sheets(tmpSheet)
    left = .Range("E3").Value
    middle = .Range("F3").Value
    right = .Range("G3").Value
End With
With AggProds
    .VariableName(k) = "Sand"
    .VariableSheet(k) = tmpSheet
    .VariableRange(k) = "J3"
    U = unifun(n)
    For i = 1 To n
        .OrderedSample(i, k) = TriangularInv(U(i), left, middle, right)
    Next i
End With

The first With block reads values of the parameters of the triangular distribution from the cells in the worksheet where they are written. The first three lines of the 2nd With block pass the variable’s cell reference and descriptive name. The statement U = unifun(n) creates a uniform random array and the For loop calculates the inverse distribution function of the triangular distribution for each element of U and reads it into the ordered array inside the AggProds object.

In some circumstances it is easier to write the arguments of the inverse distribution function directly into the function as literal numbers. If you decide to do it that way, the first With block can be omitted, and the code looks like this:

'First variable in subset: price of sand
k = 1
With AggProds
    .VariableName(k) = "Sand"
    .VariableSheet(k) = "Materials and labour"
    .VariableRange(k) = "J3"
    U = unifun(n)
    For i = 1 To n
        .OrderedSample(i, k) = TriangularInv(U(i), 35.00, 39.99, 47.00)
    Next i
End With

which is a bit shorter, but harder to update if you want to change some of the prices and also makes it harder to copy the block of code to create new variables in the script.

One such block of code is needed for every variable in the model that we are going to apply Monte Carlo to.

The above is an example of how to create input variables one at a time. In this spreadsheet model, however, some of the subsets have all their members in a contiguous series of rows in the worksheet. In such cases it is possible to create the entire subset by looping through the rows. This is much more economical in terms of lines of code. It is done like this:

m = 5  'Start-row of block
tmpSheet = "Materials and labour"
For k = 1 To 5
    With ThisWorkbook.Sheets(tmpSheet)
        tmpName = .Cells(k - 1 + m, 1).Value
        left = .Cells(k - 1 + m, 5).Value
        middle = .Cells(k - 1 + m, 6).Value
        right = .Cells(k - 1 + m, 7).Value
        tmpRange = .Cells(k - 1 + m, 10).Address
    End With
    U = unifun(n)
    With HighTempProds
        .VariableName(k) = tmpName
        .VariableSheet(k) = tmpSheet
        .VariableRange(k) = tmpRange
        For i = 1 To n
            .OrderedSample(i, k) = TriangularInv(U(i), left, middle, right)
        Next i
    End With
Next k

This code performs the same tasks as the previous listing, but does it for the variables described in the 5 rows of the spreadsheet starting at row 5.

Having created all the member variables of an input-variable subset object, the next thing to do is to create its ‘unordered array’. As Post 4 explained, every subset object has two arrays that contain the same data. One of them is ‘ordered’ and the other is ‘unordered’. The former is created by calling the inverse distribution function on the members of the uniform random array U, and so all its members are in increasing order.

If the subset is statistically independent, then the unordered array is created by taking a copy of the ordered array and randomly shuffling it. In the script this is accomplished by calling the ‘.GenerateIndependentSample‘ method on the subset object. It looks like this:

' Generate unordered array
MiscProds.GenerateIndependentSample

If the subset is correlated then we also need to import the correlation matrix into the subset-object before generating the unordered sample. The code is therefore a bit longer. The cell address of the correlation matrix is input by passing strings to the .CorrelationMatrixSheet and .CorrelationMatrixRange properties of the subset object and the unordered (correlated) array is generated by calling the .GenerateCorrelatedSample method.

With LabourRates
' Pass location of correlation matrix to subset object
    .CorrelationMatrixSheet = "Correlation matrix 4"
    .CorrelationMatrixRange = "A2:E6"
' Create correlated sample array inside subset object
    .GenerateCorrelatedSample
End With

Next, we need to look at the output variables of the spreadsheet model. All we need to do with these is to pass their cell references to appropriate fields within the subset object.

' OUTPUT VARIABLES

Dim OutVar As ClsRandomVariableSubset
Set OutVar = New ClsRandomVariableSubset
OutputVariables.Add OutVar

OutVar.NumVars = 1
OutVar.NumIters = n
OutVar.Size

OutVar.VariableName(1) = "Total cost"
OutVar.VariableSheet(1) = "Materials and labour"
OutVar.VariableRange(1) = "K25"

Now we have set up all the various arrays needed to run the Monte Carlo simulation. To actually run it, we just call the functions ‘RunModel‘ and ‘Graphs‘, the latter twice—once for the input variables and once for the output variables.

Call RunModel(InputVariables, OutputVariables)
Call Graphs(InputVariables, NumBins:=20, NumPoints:=100, SheetTitle:="Input variables")
Call Graphs(OutputVariables, NumBins:=20, NumPoints:=100, SheetTitle:="Output variables")

And that’s all there is to it!

A full printout of the entire script module can be found here. This also includes a few lines of code to time how long it takes to run the program and display the answer in a message box.

The output looks like this:

Figure 5.6: Output of the program
Figure 5.6: Output of the program

A copy of the workbook with all the modules in it can be downloaded here.

With 1000 iterations, this took 50s to execute on my laptop (a Toshiba Satellite T210 with a 1.2GHz dual core processor and 2.8GB of RAM running Windows 7). However, on another laptop in our house, that has 2GHz dual core processor and 4GB of RAM, exactly the same model took 11.4s to run. The different processor and memory don’t seem to me to be enough to explain such a large difference.

Anyway, that’s the end of this series of posts. If you have got this far, I hope you have found it useful. I will now finish off with a cautionary tale about what can go wrong when using this kind of analysis.

A cautionary tale

An interesting real-life example can be found in the technical feasibility study for the development of the Cornish Wave Hub. If you don’t know what that is, see http://www.wavehub.co.uk/ and https://en.wikipedia.org/wiki/Wave_Hub. Page 69 of this report contains the graph shown in Figure 5.7 below, which is a cumulative probability distribution of the cost of building the facility.

Figure 5.7: Projected cost of the Cornish Wave Hub
Figure 5.7: Projected cost of the Cornish Wave Hub

This shows that

  • the expected cost is £12.42M,
  • the probability of the cost exceeding £13.45M is 5% and
  • the probability of the cost exceeding £15M is vanishingly small, possibly zero (it’s difficult to tell from the graph).

Because the project was EU funded, data are available on what it actually cost when it was built. The following report: http://ec.europa.eu/regional_policy/index.cfm/en/projects/best-practices/united-kingdom/2458, says that it cost €40.504M (equivalent to £33.2M at the exchange rate quoted in the report). This is more than two and a half times what Figure 9 says was the expected value, and more than twice the value whose probability of exceedence it says is vanishingly small, if not zero.

There is not enough information in either of these reports, or anywhere else, to enable us to draw conclusions about the causes of this discrepancy. There are several possible factors that could lead to similar results, though, all of which are related to inadequate input data. They include:

  • accidentally leaving some items out (this is very easy to do while its opposite, accidentally putting something in that shouldn’t be there, is almost impossible);
  • using data outside their range of validity (e.g. using construction cost data that refer to onshore installations in a costing for an offshore installation);
  • assuming that all items cost within +/- x% of their central value, when in fact confidence intervals of different items are likely to differ from each other, and that some of them could be vary large. (If you assume that all the input variables are +/- x%, where x is always the same, then the output will automatically be +/- x% and there would be no need to do a Monte Carlo simulation!);
  • costing a design that is different from what actually gets built
  • and so on …

As was mentioned in Post 1, the presentation of an entire probability distribution makes the results seem much more authoritative than they really are, potentially leading to unwarranted confidence in their accuracy. The use of triangular distributions can make this effect much worse, because the probability of a triangular random variable is zero outside the extents of the triangle. This means that a costing made up of triangularly distributed input costs will have an upper limit beyond which its ‘probability’ is zero. If you present such a result to a client you are telling them that a higher cost than this is impossible! Although it is difficult to tell by looking at it, the plot in Figure 5.7 may possibly display this effect. In such cases, using distributions with long asymptotic tails that extend to infinity, of which there are many, could provide a useful fig leaf!

That’s it for today. Stay tuned for future posts on a wide variety of interesting topics!

© Copyright 2015 Howard J. Rudd all rights reserved.

2 thoughts on “VBA Monte Carlo risk analysis spreadsheet with correlation using the Iman-Conover method. Part 5

  • 9 October 2015 at 05:04
    Permalink

    Dear Howard Rudd
    Firs to of all, Sorry for my wrong english. rs.rs,rs.rs
    Could you send me your xls file of VBA Monte Carlo risk analysis spreadsheet with correlation using the Iman-Conover method? Because i need so much this solution, but i dont speak english very well for understanding your explanation completly
    My e-mail is mgrifo@ig.com.br (from Brazil)
    Tanks so much for your considerations

    • 10 October 2015 at 08:02
      Permalink

      Copies of the excel file can be downloaded from the following hyperlinks in the blog. These are:

      https://www.howardrudd.net/files/MC_modules_v2(40).xlsm

      which contains a placeholder model and all five modules, and

      https://www.howardrudd.net/files/Example(02).xlsm

      which contains the “building project” model discussed in the last article in the series. You might have missed these because the WordPress Theme I use highlights links in a colour that differs only slightly from that of the normal text so they are easy to miss. I must find a way of changing that.

      These are all “.xlsm” files, written for Excel 2007. I have not tried to run it on Excel 2003 or earlier versions, which have the “.xls” extension, as I don’t have a copy of that. If you need to run it on a version of Excel earlier than 2007 it may work but I can’t guarantee it.

Comments are closed.