VBA Monte Carlo risk analysis spreadsheet with correlation using the Iman-Conover method. Part 2

Welcome back! This is number two in a series of five blog posts that describe how to construct a Monte Carlo risk analysis application in Excel VBA. It follows on from yesterday’s, where I gave an overview of the problem and how I propose to tackle it. Today’s post describes how to generate random numbers according to arbitrarily chosen probability distributions using the inverse transform method.

The object of the exercise is to generate an array of random numbers that are distributed according to a given frequency distribution. Excel and many other applications natively provide a random function that returns values that are uniformly distributed on the range [0, 1). Out there ‘in the wild’ random variables occur in all shapes and sizes, not just the standard ones. As we are trying to develop a tool that is of fairly general applicability, we need a method that will generate samples distributed according to any distribution we may choose.

Ways of generating non-uniform random variables

We want to convert a random variable that is uniformly distributed on the range [0, 1), as supplied by VBA’s Rnd() function, into a random variable distributed according to any desired distribution. There are many ways of doing this. Most of them are specific to particular distributions with, as you might expect, the normal having the most. Other methods are more general and can produce random numbers conforming to many different distributions. The two most general methods, in the sense that they can be made to return distributions with any possible functional form, are the acceptance-rejection method and the inverse transform method, the latter has the further advantage that it can easily be made to return values from equal-probability subdivisions of the range of our desired random variable. This is exactly what we need for ‘latin hypercube’ sampling. Consequently, this is the method I have chosen to use in this project.

The inverse transform method

The inverse transform method is based on the idea that if you take a random variable that is uniformly distributed on the interval (0, 1) and put it into the inverse distribution function of a desired distribution, the resulting random variable will have that distribution. In symbols, if \(U\) is uniformly distributed on the interval (0, 1) then the random variable \(X\), defined by

\[ \begin{equation} X=F_{X}^{-1}\left(U\right) \end{equation} \]

will have the cumulative distribution function \(F_{X}\left(x\right)\). This means that, provided we can write a function that returns \(F_{X}^{-1}\left(\cdot\right)\), we can generate a random number with the distribution \(F_{X}\left(x\right).\) It is possible to prove this rigorously, but the best explanation is visual. Figure 4 below should be self-explanatory.

InverseTransform
Figure 4: The principle of the inverse transform method

This method has the following advantages:

  1. It is very general and can produce random variables distributed according to any desired functional form.

  2. The same code works with discrete and continuous random variables.

  3. It is easy to specify the distribution and to add new ones to our module when needed with minimal effort.

  4. It can produce a set of random numbers that are natively in increasing order, which is needed by our chosen method of inducing correlation. If this wasn’t the case it would be necessary to sort them, which would be computationally expensive.

  5. It is the only method that will work with the so-called ‘latin hypercube’ sampling strategy.

Its one disadvantage is that, sometimes, depending on the distribution, the inverse distribution function can be quite computationally expensive to calculate, with some requiring more effort than others.

Particular random variables

Overview

Excel already has a number of built-in functions that return the inverses of various distributions. These are listed in Table 1 below. It makes sense to use them whenever possible. As well as saving effort they are also likely to be faster than home-made equivalents.

Excel doesn’t have all the ones we need, however, so we will need to write some. The most important are the general uniform and the triangular. Other common distributions that Excel doesn’t natively provide include the exponential and the ones named after nineteenth-century French mathematicians (Cauchy, Poisson, Laplace etc). It is left to the reader to write functions for these if needed. However, the ones we already have should cover the vast majority of cases.

It would also make sense for any functions we write to use the same naming convention as the built-in ones, which is to use an abbreviated form of the name of the distribution followed by “inv”. Unfortunately, Microsoft has recently introduced a lot of new inverse distribution functions with slightly different names. Version 2007 and earlier uses one set and version 2010 and later uses another. The former are ‘still available for backward compatibility’ but ‘may not be available in future versions of Excel’. Table 1 lists them.

Distribution Versions up to 2007 Version 2010 & later
Beta BETAINV(probability,alpha,beta,[A],[B]) BETA.INV(probability,alpha,beta,[A],[B])
Binomial Not available BINOM.INV(trials,probability_s,alpha)
Chi-squared CHIINV(probability,deg_freedom) CHISQ.INV(probability,deg_freedom)
F FINV(probability,deg_freedom1,deg_freedom2) F.INV(probability,deg_freedom1,deg_freedom2)
Gamma GAMMAINV(probability,alpha,beta) GAMMA.INV(probability,alpha,beta)
Log-normal LOGINV(probability, mean, standard_dev) LOGNORM.INV(probability, mean, standard_dev)
Normal NORMINV(probability,mean,standard_dev) NORM.INV(probability,mean,standard_dev)
Standard Normal NORMSINV(probability) NORM.S.INV(probability)
Student’s t (two-tailed inverse) TINV(probability,deg_freedom) T.INV.2T(probability,deg_freedom)
Student’s t (left-tailed inverse) Not available T.INV(probability,deg_freedom)

Table 1: Microsoft inverse distribution functions

The stated reasons for introducing the new functions are that they ‘may provide improved accuracy’ and that their ‘names better reflect their usage’. This doesn’t make sense to me. I suspect the real reason is to force people to upgrade.

General uniform random variable

VBA’s built-in Rnd() function generates random numbers uniformly distributed on the range [0, 1). We sometimes need a random variable that is uniformly distributed on a different range, which we here call [a, b).

The inverse distribution function of this random variable is the straight line that passes through the points (0, a) and (1, b). This has slope (b – a) and intercept a, so it can be written as:

\[ \begin{equation} F_{X}^{-1}\left( y \right) = \left( b – a \right) y + a \end{equation} \]

A random variable uniformly distributed on the interval [a,  b) can therefore be generated using the formula:

\[ \begin{equation} \mathtt{X=\left(b-a\right)*Rnd()+a} \end{equation} \]

If we were writing this function for general use, it would be a good idea to:

  1. include some tests to check that the input data are suitable, (i.e. that the correct number of arguments has been supplied, that they are of the right type and that \(b > a\);

  2. decide what to do if this isn’t the case;

  3. implement whatever that is.

However, we don’t want to do that here because we wish to maximise the function’s speed. We may need to run it, say, 106 times, in which case every statement in it will be executed 106 times. Even a small amount of extra execution time could add up to something noticeable. As we are in full control of what gets put in to this function, there shouldn’t be a problem with nonconformist arguments. This applies to any inverse distribution functions we write.

Before proceeding any further, I would just like to mention that all code and algorithms presented in this and other articles in this series is offered without any form of warranty. You are free to use them as you wish but it is your responsibility to test them and make sure they are fit for the use to which you intend to put them.

The code to implement this function looks like this:

Public Function UniformInv(y, left, right) As Double

    ' Returns the quantile function of a random variable uniformly distributed on
    ' the range (a, b).
    
    UniformInv = (right - left) * y + left

End Function

Triangular random variable

Like most random variables, this one is defined by its probability density function which, as you might expect, is a triangle.

P.d.f. of the trinagular random variable
Figure 5: P.d.f. of the triangular random variable

Its inverse distribution function is:

\[ \begin{equation} F^{-1}\left(y\right)=\begin{cases} \textrm{undefined} & y\leqslant0\\ a+\sqrt{y\left(c-a\right)\left(b-a\right)} & 0<y\leqslant\frac{b-a}{c-a}\\ c-\sqrt{\left(1-y\right)\left(c-a\right)\left(c-b\right)} & \frac{b-a}{c-a}\leqslant y<1\\ \textrm{undefined} & y\geqslant1 \end{cases} \end{equation} \]

This distribution is so unnatural that it is difficult to imagine any real random phenomenon conforming to it. However, it is widely used because it fits in with the idea of ‘three point estimation’, where we have a central “best guess” together with upper and lower limits. The code to implement it is as follows:

Public Function TriangularInv(y, left, middle, right) As Double

    ' Returns the quantile function of a random variable distributed according to
    ' the triangular distribution with corners at x = a, x = b and x = c.
    
    ' TESTS GO HERE
    ' Rhubarb, rhubarb
    
    If y <= (middle - left) / (right - left) Then
        TriangularInv = left + Sqr(y * (right - left) * (middle - left))
    Else
        TriangularInv = right - Sqr((1 - y) * (right - left) * (right - middle))
    End If

End Function

Adding new distributions

Another advantage of the inverse transform method is that it is very easy to add new distributions. All you need to do is write a function that returns the inverse of the chosen distribution and add it to the module. You could, for example, write a function for, say, the inverse of the Weibull distribution and call it weibullinv(y, a, b). Once the function has been written and incorporated into the module, then it can be used straight away to generate random samples.

The 'latin hypercube' method

In this exercise I have decided to use the so-called ‘latin hypercube’ sampling strategy. With this approach we divide up the range of the random variable into sub-ranges of equal probability and choose a number randomly from each sub-range. This has the effect of making the output sample converge much more rapidly to the desired distribution as the size of the sample increases. Figure 6 below illustrates this. It shows an array of histograms, each with 100 bins, plotted from a set of random numbers. The plots in the left hand column were generated randomly from the entire range of the standard normal distribution and those in the second column were generated from the same distribution by ‘latin hypercube’. Graphs in the same row have the same number of points—103 in row 1, 104 in row 2 and so on.

Comparison of random and 'latin hypercube' datasets
Figure 6: Comparison of random and 'latin hypercube' datasets

The left hand plots in Figure 6 look much less ‘normal’ than the corresponding right hand ones. Not until we reach 106 points do the random data look as good as the latin hypercube data do with 104 points.

The name ‘latin hypercube’ appears to have been chosen because, if more than three variables are involved, it can be regarded as a multidimensional analogue of what, in the field of statistical sampling, though not in other branches of statistics, is called a ‘latin square’. This is different from the usual meaning of that term. According to Wikipedia:

'In the context of statistical sampling [emphasis mine], a square grid containing sample positions is a latin square if and only if there is only one sample in each row and each column.'
The usual definition of latin square, which is what most people understand by the term, is
an n × n array filled with n different symbols, each occurring exactly once in each row and exactly once in each column

which is clearly not the same thing, though the former could be regarded as an instance of the latter with only one symbol. I suspect the term was invented to make the idea seem more difficult than it really is. You can rest assured that no multidimensional geometry is involved!

Most methods of generating random variables return a value from the entire range of the distribution. That is, on invoking the function you could get any value the function is capable of producing. This behaviour precludes ‘latin hypercube’ because you can't make the output lie in a specified sub-range of the distribution.

The inverse transform method, on the other hand, can do this. A set of \(n\) equal-probability subdivisions of a non-uniform distribution is simply the image, under the inverse transform, of a set of \(n\) equal-probability subdivisions of the uniform distribution. It is easy to divide up the uniform distribution into equal-probability subdivisions—we just make them all the same width.

Actually, we don't even need to do that. All we need to do is choose values randomly from the sub-ranges of the uniform distribution using the function:

\[\begin{equation}y_{i}=\left(\mathtt{Rnd()}+i-1\right)/n\end{equation}\]

which is derived by applying the general uniform inverse distribution, as derived above, to the ith subdivision of [0, 1) as explained in Figure 7 below.

The formula for 'latin hypercube' uniform array
Figure 7: The formula for 'latin hypercube' uniform array

and then feed this into the inverse distribution function of our desired distribution, so the \(i\)th component of a random sample of \(X\) with distribution function \(F_{X}\left(x\right)\) is given by:

\[ \begin{equation} X_{i}=F_{X}^{-1}\left(\left(\mathtt{Rnd()}+i-1\right)/n\right) \end{equation} \]

Furthermore, if the index \(i\) goes from \(1\) to \(n\) in increasing order, which will happen automatically if generated in a FOR loop, the \(X_{i}\)s will also be in increasing order.

To do this, we first need write a function to output an array of uniformly distributed values corresponding to the y(i) in Figure 7, which I have decided to call unifun(n).

The function unifun(n).

VBA's built-in Rnd() function returns values in the range [0, 1), in other words 0 ≤ Rnd() < 1. This means that it can return a zero but never a one. To check this, I wrote a short routine that called the function 9.38 × 109 times and counted the numbers of ones and zeroes produced. It produced 560 zeroes and no ones.

This is good, as it means that there is no overlap between consecutive bins in our sequence. If it had returned values in the range [0, 1] then the probability density at bin boundaries would have been double that elsewhere and there would have been a chance that two successive calls to Rnd() within the function would result in the same value being returned twice. At least this is the case for Excel 2007, which is what I am using. If you are using a different version then it would be wise to check that yours behaves the same. Microsoft cannot necessarily be relied on not to change things. If your version behaves differently it may be necessary to modify the code.

As well as avoiding potential double counting on bin boundaries, we also need to make sure that unifun(n) doesn't return a zero in the first bin or a one in the last bin. This is because some inverse distribution functions are infinite at 0 and 1 and so would make the program crash if fed either of these values. The behaviour of Rnd() noted above prevents a 1 being returned in the last bin, but we need to take measures to prevent a zero being returned in the first bin. The best way to do this is to generate the value from the first bin separately, using a loop like:

Do
    U = Rnd()
Loop Until U <> 0

and the remaining values using a for loop from 2 to n. The code to implement the function goes something like this:

Public Function unifun(ByVal n As Long) As Double()

    ' Usage: y = unifun(n)
    '
    ' Returns an array of n random numbers in ascending order, the ith member of
    ' which is a uniformly distributed on the range (i - 1)/n <= x < i/n, except
    ' for the 1st member, which is uniformly distrubuted on the range
    ' 0 < x < 1/n.
    
    Dim U As Double, y() As Double, i As Long
    ReDim y(1 To n)
    
    ' First member of the sequence
    Do
        U = Rnd()
    Loop Until U <> 0
    y(1) = U / n
        
    ' Remaining members of the sequence
    For i = 2 To n
        y(i) = (Rnd() + i - 1) / n
    Next i
        
    unifun = y

End Function

Putting it together

Having written functions to generate ‘latin hypercube’ arrays and some inverse distribution functions, we can now put the two together to generate the uncorrelated sample arrays (tomorrow’s post will explain how to make them correlated).

Firstly, we need a way for the user to specify the input variables. Commercial packages use specially designed “random” worksheet functions entered into cells to identify the cells concerned and to specify the distribution to be used. For a project like this such an approach would be too difficult to implement. Instead we arrange for this information to be contained in a VBA module.

As far as possible we want this module to contain only user-specified information with a minimum of other code cluttering it up. The best way to do this is to make use of VBA’s limited object-oriented capabilities, which will be explained more fully in Post 4.

Briefly, we write a class module called “ClsRandomVariableSubset” to contain the arrays that will hold the numbers we generate, as well as much of the code that operates on them and which would otherwise have to be put in the user-input module. We can create an instance of the class for each of the subsets that will be required. This enables us to divide the variables into an independent subset and one or more correlated subsets whose members are correlated only with other members of the same subset, as mentioned in yesterday's post. Using a class module also enables us to gather the subsets together into collections that can be passed to and from functions as a single entity.

A “ClsRandomVariableSubset” object contains, among other things, an ordered sample array, that holds the output of the latin hypercube routine described above, and a random sample array that is produced from the ordered array either by randomly shuffling it, in the case of an independent set of input variables, or performing the Iman-Conover method on it, in the case of correlated sets (this will be explained in tomorrow’s post). The object also contains arrays that hold the name and cell references of the variables.

After creating an instance of the class, called InVarA in the example below, we can then populate its ordered array using the following block of code, which should be fairly self-explanatory:

'1st input varible in A
k = 1
InVarA.Name(k) = "Input 1"
InVarA.Sheet(k) = "Sheet1"
InVarA.Range(k) = "B2"
U = unifun(n)
For i = 1 To n
    InVarA.OrderedSample(i, k) = WorksheetFunction.GammaInv(U(i), 2.032, 4.117)
Next i

Where Name(k), Sheet(k), Range(k), and OrderedSample(i, k) are arrays and the index k refers to the kth random variable in the subset. Post 4 will go into the internals of this in much more detail. Here I'm only trying to illustrate the point that we can add as many variables as we want by simply copying this block, changing the value of k and writing appropriate things on the right hand side of each assignment statement.

That’s the end of today’s post. Tomorrow’s post will cover how to make our variables correlated.

© 2015 Howard J. Rudd all rights reserved.