As this is my first blog post, I thought I’d try something slightly more ambitious than just spouting opinions. Unfortunately it’s turned into something of a treatise. If I’d known how big it was going to be I probably wouldn’t have started but, as Magnus Magnusson would have said; “I’ve started so I’ll finish”! I will feel it has been worthwhile if someone somewhere makes use of it.
Because the amount of stuff is quite large, this is going to be a series of five posts. It will describe how to build a module, or, rather, a group of interconnected modules, in Microsoft Excel, that will perform risk analysis using ‘Monte Carlo’ simulation.
As with most things in life, the basic idea is simple but the practice is considerably more complicated. We begin with a mathematical model, implemented in a spreadsheet, that calculates one or more characteristics of some activity or endeavour with an uncertain outcome—such as the cost of manufacturing a new product or the rate of return on a project.
The model can be thought of as a ‘black box’ into which we feed numbers and out of which come other numbers. It is ‘deterministic’ in the sense that the same inputs always give exactly the same outputs. If all the inputs are certain then so are the outputs. If, on the other hand, the input variables are uncertain then the output variables will also be uncertain. Figure 1 below shows the idea schematically.
The problem is to find the probability distributions of the output variables given the probability distributions of the input variables.
Before the computer age this could only be done using complicated formulae, and then only for simple models such as the sum or product of the input variables. (Remember ‘propagation of errors’ in school physics?) But now we can use brute force, feeding thousands or even tens of thousands of values into models of arbitrary complexity, producing entire frequency distributions of the output variables that can be analysed statistically and plotted as histograms and other sorts of graphs.
The object of the exercise is not the model but the activity of feeding into it large quantities of randomly generated numbers and analysing and plotting the distributions of the outputs. The code that does this can be applied over and over again to all sorts of different spreadsheet models, the precise nature of which is irrelevant.
This type of analysis is useful in situations where the thing being modelled is in the future and the things it depends on are not known with certainty. If, for example, we are planning to build a wind farm we may have a reasonable idea how much the turbines will cost, how much electricity the farm will generate, how much we will get paid for each unit of electricity and how much the wind farm will cost to maintain. But, until the facility is built and operating, we don’t know any of these things for certain. When we come to build it we might find that the turbines are more expensive than we had anticipated, the wind speed and electricity price are lower and so on. If we took the worst-case values for all of these inputs we might find that the project would make a loss. Does that mean we shouldn’t do it? Using Monte Carlo risk analysis we can produce output like this:
Anyone looking at Figure 2 would probably say it’s reasonable to go ahead, even though there’s a small probability that it might lose money. (In case anyone out there has done a costing of a wind farm with Monte Carlo, I should point out that I made this example up to illustrate this point.)
There are, however, dangers with this approach. Probability distributions may be an improvement over single-valued predictions, but they are still predictions. While it is sometimes possible to derive them from historical data, usually it isn’t and we have to use ‘expert judgement’, i.e. guesswork. If the input data are wrong then the answer will be wrong. Furthermore, the presentation of an entire probability distribution makes the results look much more authoritative than they really are, potentially leading to unwarranted confidence in their accuracy. Commercial vendors’ promotional literature often gives the impression that this kind of analysis allows you to see into the future but this is not true, of course. All it does is present what you already know in a way that is more insightful and useful for making decisions. In Post 5 I describe a real life cautionary tale that illustrates this point.
Add-in packages are available from several vendors that will enable you to do this without having to write a single line of code, but rolling your own has advantages including:
You can give a working model to your clients and they can use it as much as they like on as many machines as they like without having to pay upwards of £800 per user.
You have more control over what goes on ‘under the bonnet’.
You don’t need to include features that won’t be used in your model, thereby making it smaller and lighter.
There are, however, a few disadvantages. These include:
It’s fiddlier to use.
Knowledge of the underlying mathematics is needed. (Actually, you could take this on trust, as you would if you used a commercial package.)
Writing your own code means you can also write your own bugs.
One way of speeding up the module would be to use functions from an external library for some of the operations that work on large arrays, such as random number generation, matrix multiplication, sorting, ranking and reordering. If we use an established one that has been around a long time then it would also give added confidence that it will produce the right answers. Some of these, such as the NAG library, are ready to use out of the box but cost money. Others, such as the GNU Scientific Library, are free but would require to be compiled into a ‘dynamic link library’, or DLL, that would work with Excel. This is a complicated and fiddly process and requires additional tools. Furthermore, part of the reason for doing this is that it is a learning exercise. For these reasons I have chosen to write all the code in VBA. You are free to do it anyway you like, however. Generally, the module will execute in a reasonable length of time (less than a minute) with ten input variables and two output variables.Statistical dependence
If there is only one uncertain input variable, or if there are several but they are all statistically independent of each other, then the problem is trivially simple. However, in real life, random variables are often correlated—that is, the probability of one depends on the value of one or more of the others. This could occur, for example, in the costing of a product where many of the components are made from the same material, say steel. A proportion of the variability of the price of each component will be due to the variability of the price of steel and the prices of these components will tend to move together to a greater or lesser extent. In our wind farm example quoted above, a mild winter might mean that the price of electricity falls due to lower demand and the output of the wind farm falls due to lower wind speeds.
If such phenomena are not taken into account, the resulting output distribution will be incorrect. We therefore need to be able to generate sets of random numbers that are correlated with each other in a way that corresponds to the correlation structure of the real-world variables we are trying to simulate. This is where it starts to get difficult. In fact, this aspect of the problem accounts for the majority of the code!
This series of five posts will cover the following areas:
Post 1 (this post) sets out the functional requirements of the module, its overall structure and how its various parts interact with each other. It explains how it works and sets the scene for subsequent posts where the individual parts are developed.
Post 2 goes into the generation of non-uniform random numbers from different distributions and explains how to write functions in VBA that will output them. At the end of this post we will have the required functions to generate independent random arrays according to any specified distribution.
Post 3 describes the theory and practice of how to generate correlated random variables. This is the most difficult part of the whole exercise. At the end of this post we will have functions that will induce correlation onto our random arrays.
Post 4 takes the mathematics described in the previous two posts and implements it. This involves inputting the random variables into the spreadsheet model, reading the output variables into an array and producing charts and statistics from them. At the end of this post we will have a complete working tool.
Post 5 explains how to use the resulting spreadsheet tool in various typical situations, with examples. This post functions as a concise user manual.
The functional requirements of the module are:
All information required to be entered by the user, such as the cell references of the input and output variables, the probability distributions to be applied to them and the cell references of the correlation matrix, should:
be located in one place;
be presented in a way that is easy to read and that makes it obvious where values, cell references and distributions should be written;
make it easy to change these values, references etc.;
make it easy to add new variables;
minimise the risk of accidentally changing other bits of code while entering user data.
The module should perform the following sequence of tasks:
Set up an array that will hold the input sample;
Set up an array that will hold the output sample;
Populate the input array with suitably chosen numbers that conform to the desired set of probability distributions;
Induce the desired correlation structure onto the contents of this array;
For i = 1 to n
Input the ith row of the input array into the input cells of the spreadsheet model.
Read the resulting values from the output cells of the spreadsheet model into the ith row of the output array.
Calculate statistics and plot histograms from the two arrays.
The module will also have to do a number of more boring things such as inserting sheets into the workbook to contain the results, writing the results and explanatory text into those sheets and plotting graphs.
Design choicesWorksheet interaction
There are two ways in which we could arrange for our randomly generated numbers to be put in to cells of the spreadsheet model. We could:
write them as worksheet functions that will output a different random number every time the worksheet is recalculated (e.g. when F9 is pressed);
write them as VBA functions that are specified within a module.
Most commercial add-in packages use the former approach, but that would be too difficult to implement in a VBA module unless all the random variables were independent. I have therefore chosen to use the second approach. This has other advantages too, for example that the process is more transparent and simpler to understand.Random number generation
Step 4 requires the generation of random numbers from user-chosen distributions. Excel only has one random function (which, annoyingly, is called Rand() if used in a worksheet formula and Rnd() if used in VBA code). This returns numbers that are uniformly distributed between 0 and 1. We need to be able to convert numbers generated by the Rnd() function into ones distributed according to an arbitrarily chosen distribution. There are many ways of doing this. In this exercise we will use the inverse transform method, which will be the subject of tomorrow’s post.Sampling strategy
There are two alternative sampling strategies that are commonly used in Monte Carlo simulation. These are ‘true’ Monte Carlo, where the numbers are randomly chosen from the entire range of the distribution, and so-called ‘latin hypercube’, which is a fancy way of saying that the range of the variable is divided into bins of equal probability and a value chosen randomly from each bin. The latter approach has the following advantages:
It makes the set of generated numbers converge to the desired distribution more rapidly, meaning that fewer instances are needed to get a good fit to the distribution.
The set of numbers can be produced natively in increasing order, which is a requirement of our chosen method of inducing correlation (see below), and so removes the need to apply a sorting algorithm to the input array.
‘Latin hypercube’ sampling requires us to use the inverse transform method for random number generation (which will be explained in tomorrow’s post). This is not necessarily the fastest method, especially for distributions whose cumulative distribution function is computationally expensive to calculate, such as the normal. Whether the above advantages are sufficient to offset this disadvantage, I don’t know. I suspect they are probably roughly similar.Inducing correlation
To completely specify the behaviour of at set of correlated random variables we would need a multivariate distribution function. This would require a great deal of information to fix its functional form. In the type of simulation we are concerned with here, which is commonly used in the fields of business and management, there is never enough information. The best we can hope for is to know which pairs of variables are correlated with each other and which are not, and to have some vague idea about the relative strengths of such correlations. This means that all we have to work with is a table, or matrix, of pairwise correlation coefficients. Generally this is insufficient to determine how the members of a set of random variables are distributed together, but we just have to put up with this and accept that the correlation structure of our variables will be underdetermined.
There is, however, one multivariate distribution, the multivariate normal, which is completely specified by a correlation matrix. By using such a matrix we are implicitly assuming that the correlation structure of our set of variables is the same as that of the multivariate normal, even if it isn’t.
It is possible to generate sets of random numbers that conform to a multivariate normal distribution with arbitrarily chosen parameters. This is quick and easy to do but the resulting marginal distributions (i.e. the one-dimensional distributions of the individual random variables) are all univariate normal. We want to be able to choose these arbitrarily, and so we need another approach.
There are two commonly used ways of generating a multivariate random sample with the correlation structure of the multivariate normal but with arbitrary marginal distributions. These are the Normal Copula method and the Iman-Conover method.
For this exercise, I have chosen to use the Iman-Conover method. This is based on the idea of rank-order correlation. It involves the following steps:
derive a ‘dummy’ multivariate sample with exactly the required correlation matrix;
calculate the rank orders of each column of the dummy array;
re-order the columns of our array of actual distributions so as to have exactly the same rank orders as the dummy array.
This is discussed in Post 3.Dividing the inputs into subsets whose correlation structure is treated separately
Not all of the input variables will be correlated with each other. It is likely that there will be a subset of them that is statistically independent. It will be more efficient if this independent subset is treated separately from the others. This will reduce the number of arguments that need to be fed into the Iman-Conover algorithm, which will therefore run faster because it will be handling fewer variables.
Furthermore, in some cases, those variables that are correlated could be divided into subsets where the members of each subset are correlated only with the other members of the same subset. If this is the case, then a further improvement in efficiency could be had by applying the Iman-Conover algorithm, with a different correlation matrix, to each subset separately. This efficiency gain occurs because, even through the same number of variables are being treated, all the matrix functions will be handling smaller matrices. Most matrix functions require an amount of computational effort that increases rapidly with the size of the matrix, so there are benefits from keeping the matrices as small as possible.
To achieve these requirements, the module is structured as follows:
Don’t worry of this doesn’t make much sense at the moment. It will become clear over the course of the next four posts. Briefly, the ‘main’ subroutine is the heart of the module. It basically does two things:
It contains the information that the user provides, i.e.
names, cell addresses and distributions of the input variables,
names and cell addresses of the output variables,
worksheet range(s) of the correlation matrix(ices) and
- specifications of the probability distributions and their parameters
It contains commands that run the overall module.
The latter commands call the various other functions that are shown in Figure 3. VBA’s limited object-oriented capabilities allow us to make these as simple as possible by hiding all the code in a class module.
The plotting and worksheet functions are the least interesting part of the whole exercise, but make up rather a lot of lines of code, simply because of the large number of fiddly little things that need to be specified when plotting graphs and writing data into worksheets.
That’s the end of today’s post. Join me tomorrow for an exploration of how to generate arrays of non-uniform random numbers.
© Copyright 2015 Howard J. Rudd all rights reserved.