Understanding M. E. O'Neill’s PCG generator. Part 3

This is the third and final part of a three-part article about my attempt to understand the PCG generator. If you haven't read the previous one then it may be a good idea to do so (assuming you want to read this one, that is). The previous part looked at the structure of the various output functions that are described in O'Neill’s report and paper. This part looks at the so-called ‘named generators’ that are provided by the C++ implementation and tries to reconstruct one of them.

ONeill (2014) says:

‘… although the generators are presented with mnemonic names based on the permutations they perform, users of the PCG library should rarely select family members by these mnemonics. The library provides named generators based on their properties, not their underlying implementations …’

The file ‘/include/pcg_random.hpp’ in the archive pcg-cpp-0.98.zip contains 42 such ‘named generators’, but a single article only has enough space to to look at one. The question is, which one?

One clue might be to look for the ones with the shortest names, since it is usually the case that unqualified names denote the most frequently used varieties of whatever it is that the word in question refers to. There are two of these, they are called PCG32 and PCG64.

Python's Numpy package includes two PC generators, called ‘PCG64’ and ‘PCG64 DXSM’. The former is described as:

‘PCG-64 is a 128-bit implementation of O’Neill’s permutation congruential generator ([1], [2]). PCG-64 has a period of \(2^{128}\) and supports advancing an arbitrary number of steps as well as \(2^{127}\) streams. The specific member of the PCG family that we use is PCG XSL RR 128/64 as described in the paper ([2]).

[1] This reference is to the front page of the PCG website: http://www.pcg-random.org/

[2] This is what I have called in this article ‘the report’: O'Neill (2014), but which which O'Neill confusingly refers to as ‘the paper’ The content of both documents is identical, however.

The dqrng package for the ‘R’ programming language includes one PCG variant, which it calls:

pcg64 The default 64 bit variant from the PCG family’.

The word ‘default’ appears once in ONeill (2014), on page 42, where it says:

‘PCG-XSL-RR … should be used when a fast general-purpose generator is needed …. It is also the default generator for 64-bit output.’

There are 16 typedefs in the file ‘pcg_random.hpp’ whose definitions contain the sequence of letters ‘xsl_rr’ and all but two of these are named ‘pcg64_something’.

It seems reasonable, therefore, to assume that the unadorned PCG64 is the ‘default’ and to choose it as the one to explore in detail.

Working out the structure of pcg64

The first stage in the chain of typedefs that will eventually lead us to discover what ‘pcg64’ actually does appears on Line 1675 of the file pcg_random.hpp. It says:

typedef pcg_engines::setseq_xsl_rr_128_64    pcg64;

This is just a simple renaming. The identifier setseq_xsl_rr_128_64 is not a template and no arguments or parameters are passed to it to transform it from something general to something more specialised. It is defined at Line 1592:

typedef setseq_base<uint64_t, pcg128_t, xsl_rr_mixin>  setseq_xsl_rr_128_64;

So setseq_base<uint64_t, pcg128_t, xsl_rr_mixin> could have been typedeffed directly to pcg64. In this statement the three types that are passed to setseq_base are:

  • uint64_t. From the C++ standard library header <cstdint> (C++11).

  • pcg128_t. If a native 128-bit unsigned integer type is available on the system, it typedefs to that, otherwise it typedefs to a 128-bit integer class defined in the file pcg_uint128.hpp. That file is (conditionally) #included in the file ‘pcg_extras.hpp’, which itself is #included in ‘pcg_random.hpp’. Figure 5 lists the pre-processor macro that chooses between these options.

  • xsl_rr_mixin. This implements the output function. It is listed in Figure 10 in Part 2. Figure 7 in Part 2 illustrates schematically how the degree of overlap is calculated in the XSL part of the function, and Figure 11, also in Part 2, the calculation of the degree of rotation.

The templated class setseq_base appears at Line 727:

template <typename xtype, typename itype,
         template<typename XT,typename IT> class output_mixin,
         bool output_previous = (sizeof(itype) <= 8)>
using setseq_base = engine<xtype, itype,
                         output_mixin<xtype, itype>, output_previous,
                         specific_stream<itype> >;

This is a ‘templated using statement’, also known as an ‘alias template’ that renames the templated class ‘engine’. The statement also involves the use of a template template parameter, which enables the use of types that themselves require their own template parameters. The fact that the parameters xtype and itype are passed to the template template parameter output_mixin means that when a class is substituted for output_mixin in the parameter list of the template setseq_base the compiler can deduce its parameters, and so only the unqualified name need be used. This is called template parameter deduction.

The class engine is a bit too large to dissect in detail in a short article, as it spans 218 lines of text from beginning to end made up as follows:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C/C++                            1             31             26            161
-------------------------------------------------------------------------------

The opening lines of the class are:

template <typename xtype, typename itype,
          typename output_mixin,
          bool output_previous = true,
          typename stream_mixin = oneseq_stream<itype>,
          typename multiplier_mixin = default_multiplier<itype> >
class engine : protected output_mixin,
               public stream_mixin,
               protected multiplier_mixin {

Figure 1: The opening lines of the class engine.

The way in which setseq_xsl_rr_128_64 feeds into setseq_base and how that in turn feeds into engine is best explored graphically. Figure 2 below shows this.

Figure 2: Hierarchy of typedefs.

so we need to investigate oneseq_stream and default_multiplier. The class oneseq_stream appears at line 254 of the file pcg_random.hpp and looks like:

template <typename itype>
class oneseq_stream : public default_increment<itype> {
protected:
    static constexpr bool is_mcg = false;

    // Is never called, but is provided for symmetry with specific_stream
    void set_stream(...)
    {
        abort();
    }

public:
    typedef itype state_type;

    static constexpr itype stream()
    {
         return default_increment<itype>::increment() >> 1;
    }

    static constexpr bool can_specify_stream = false;

    static constexpr size_t streams_pow2()
    {
        return 0u;
    }

protected:
    constexpr oneseq_stream() = default;
};

Figure 3: The class oneseq_stream.

The principal function in oneseq_stream is called stream() and returns the value of the increment inherited from the class default_increment divided by two. I am not interested in that, however, since I just want to use PCG64 as a simple RNG. Apart from that, oneseq_stream merely regurgitates things inherited from default_increment. That and default_multiplier are actually structs that are defined as a set of template specialisations implemented via a preprocessor macro:

#define PCG_DEFINE_CONSTANT(type, what, kind, constant) \
        template <>                                     \
        struct what ## _ ## kind<type> {                \
            static constexpr type kind() {              \
                return constant;                        \
            }                                           \
        };

PCG_DEFINE_CONSTANT(uint8_t,  default, multiplier, 141U)
PCG_DEFINE_CONSTANT(uint8_t,  default, increment,  77U)

PCG_DEFINE_CONSTANT(uint16_t, default, multiplier, 12829U)
PCG_DEFINE_CONSTANT(uint16_t, default, increment,  47989U)

PCG_DEFINE_CONSTANT(uint32_t, default, multiplier, 747796405U)
PCG_DEFINE_CONSTANT(uint32_t, default, increment,  2891336453U)

PCG_DEFINE_CONSTANT(uint64_t, default, multiplier, 6364136223846793005ULL)
PCG_DEFINE_CONSTANT(uint64_t, default, increment,  1442695040888963407ULL)

PCG_DEFINE_CONSTANT(pcg128_t, default, multiplier,
        PCG_128BIT_CONSTANT(2549297995355413924ULL,4865540595714422341ULL))
PCG_DEFINE_CONSTANT(pcg128_t, default, increment,
        PCG_128BIT_CONSTANT(6364136223846793005ULL,1442695040888963407ULL))

Figure 4: How the default_multiplier and default_increment structs are created by a pre-processor macro.

It's not obvious to me what the advantages are of defining these by way of a macro, since the amount of typing saved compared with writing the template specialisations out longhand is very small, and what is gained in keyboard strokes is lost in readability.

Figure 5 below reproduces the pre-processor macro that defines PCG_128BIT_CONSTANT (Line 91), which is needed if the internal state variable is 128 bits wide, and also the typedefs that define the type pcg128_t.

#if __SIZEOF_INT128__
    namespace pcg_extras {
        typedef __uint128_t pcg128_t;
    }
    #define PCG_128BIT_CONSTANT(high,low) \
            ((pcg128_t(high) << 64) + low)
#else
    #include "pcg_uint128.hpp"
    namespace pcg_extras {
        typedef pcg_extras::uint_x4<uint32_t,uint64_t> pcg128_t;
    }
    #define PCG_128BIT_CONSTANT(high,low) \
        pcg128_t(high,low)
    #define PCG_EMULATED_128BIT_MATH 1
#endif

Figure 5: The macro PCG_128BIT_CONSTANT (Line 91), preceded by the typedef that defines the type pcg128_t.

If the compiler is GCC on a 64-bit processor then the type __uint128_t will be available. In that case pcg128_t typedefs to that, otherwise it typedefs to the bespoke 128-bit type provided in the file pcg_uint128.hpp. In this article I will from now on assume that the GNU type is available.

Running the pre-processor over the above macro and definitions produces the following output for the 128-bit constants:

template <>
struct default_multiplier<pcg128_t>
{
    static constexpr pcg128_t multiplier()
    {
        return ((pcg128_t(2549297995355413924ULL) << 64) + 4865540595714422341ULL); 
    }
};

template <>
struct default_increment<pcg128_t>
{
    static constexpr pcg128_t increment()
    {
        return ((pcg128_t(6364136223846793005ULL) << 64) + 1442695040888963407ULL);
    }
};

Figure 6: Pre-processor output from Figure 4.

This means that in the 128-bit case the multiplier is 327738287884841127335028083622016905945 and the increment is 266050134430354249485913702517209329001. These values have to be created as a pair of smaller (in this case 19-digit) literals because the type __uint128_t, which is a GNU ‘extension’ and not part of the C++ Standard, wont accept literals that can't be stored in 64 bits or less. If in the future the C++ Standard introduces a native 128-bit type then this restriction may no longer apply. I cannot find on the internet any clue as to where these numbers come from. Interestingly, the increment and the multiplier are multiplicative inverses of each other modulo \(2^{128}\). That is

\begin{equation} 327738287884841127335028083622016905945 \times 266050134430354249485913702517209329001 \equiv 1 \quad \left(\operatorname{mod} 2^{128} \right) \end{equation}

I don't know whether this fact has implications for the performance of the generator. None of the other multiplier – increment pairs listed in pcg_random.hpp have this property.

The class engine contains 26 member functions. However, I am only interested in the basic function that outputs the next member of the ‘random’ sequence. That is operator()(). There are two forms of this:

result_type operator()()
{
    if (output_previous)
        return this->output(base_generate0());
    else
        return this->output(base_generate());
}

result_type operator()(result_type upper_bound)
{
    return bounded_rand(*this, upper_bound);
}

where result_type is a typedef for the type that was called xtype when it was passed to the class as a template parameter (there aren't enough typedefs in this code!), output() is from the class output_mixin which is passed to the engine class as a template parameter and then inherited as a parent class.

output_previous is a boolean ‘non-type-template-parameter’ passed to the engine class when it is instantiated. In the case we are studying, it was passed to engine via setseq_base, which is a re-naming of engine with specific values of template parameters. In this case it was assigned the value

bool output_previous = (sizeof(itype) <= 8)

That is, if the width of the internal state variable is 8 bits or fewer then the function base_generate0() is called, otherwise base_generate() is called. There's no point in exploring what base_generate0() does, since we are never going to use an 8-bit state variable, so the next stop in our tour is base_generate(). This can be found at Line 392 and goes like this:

itype base_generate()
{
    return state_ = bump(state_);
}

So base_generate() is just a wrapper around another function, called bump(), so we haven't got to the actual mathematics yet, but I feel we may not be far off. The function bump() appears at Line 387, immediately before base_generate(). It looks like this:

itype bump(itype state)
{
    return state * multiplier() + increment();
}

Yay! we have finally got there! The engine class doesn't contain explicit functions called multiplier() or increment(), they are inherited from the classes multiplier_mixin and stream_mixin, which are passed to engine as template parameters and then inherited from. In the case of PCG64, they are not specified directly but are left to take on the default values specified in the definition of the engine class (see Figure 4). The default value specified for multiplier_mixin is

typename multiplier_mixin = default_multiplier<itype>

which is just the struct default_multiplier<itype> shown in Figure 6 for the case where itype is pcg128_t. This exports the required function multiplier() which is thereby made available inside the engine class.

stream_mixin, on the other hand, is not quite as straightforward. Its default value specified in the parameter list is:

typename stream_mixin = oneseq_stream<itype>

which isn't any of the structs created by the pre-processor macro shown in Figure 4:

The class oneseq_stream was listed in Figure 3 and contains no explicit function called increment() in its body. However, it does inherit the class default_increment<itype> and so exports that class's increment() function. So stream_mixin does in fact provide the increment() function, just inherited from one generation further back.

The following using statements appear in a protected block of the engine class:

using stream_mixin::increment;
using multiplier_mixin::multiplier;

As far as I can see, their only function is to change the access level of the two functions from public (since they are declared in a struct) to protected.

The function output() is from the class output_mixin which is passed to engine as a template parameter by way of the wrapper template setseq_base. Figure 2 shows that output_mixin is xsl_rr_mixin. We have already analysed that class in Part 2, in the section on random rotation. Figure 10 in Part 2 shows the code of xsl_rr_mixin. The class's output() function combines a what O'Neill calls a ‘fixed XORshift to low bits’ with a random rotation. The former is discussed in Part 2's section on XSL and the latter in the section on random rotation.

So the the function operator()() of the class PCG64 carries out the following steps:

  1. The 128-bit internal state variable is incremented using the equation \begin{equation} x_{i} = ax_{i-1} + c \end{equation} where \(a = 327738287884841127335028083622016905945\) and \(c = 266050134430354249485913702517209329001\).

  2. \(x_i\) is fed into the output() function of the class xsl_rr_mixin.

  3. That function performs two tasks sequentially:

    1. A ‘fixed XORshift to low bits’ as described in the section entiled XSL. To summarise; let \(w_{int}\) be the width of the internal ‘state’ variable and \(w_{ext}\) the width of the external ‘result’ variable, then the ‘fixed XORshift to low bits’ consist of:

      1. Right shift the internal ‘state’ variable by \(\frac{1}{2}w_{int}\)
      2. XOR the result with the un-shifted version
      3. Return the least significant \(w_{ext}\) bits of the result (by casting to the output type).
    2. Apply a ‘random’ circular shift to the \(w_{ext}\)-bit wide result of the above procedure. The number of places through which the variable is to be rotated is equal to the integer interpretation of the most significant \(k = \left\lfloor \log_2 \left(w_{ext}\right) \right\rfloor\) bits of the internal ‘state’ variable. If \(w_{ext} = 64\) then \(k = 6\).

The next thing we need to know is how to ‘seed’ the generator. Neither the website https://www.pcg-random.org/ nor the file pcg_random.hpp contain explicit instructions about how to do that. However, the generator gives the appearance of trying to be a drop-in replacement for the standard library header <random> (which has been available since C++11), or the Boost random library. Those libraries pass the seed to the constructor when instantiating an instance of the generator class. Various constructors are provided to enable seeding using different types of seed, either a single integer or something called a ‘seed-sequence’. Unsurprisingly, PCG is no different. The engine class contains four constructors:

engine(itype state = itype(0xcafef00dd15ea5e5ULL))
    : state_(this->is_mcg ? state|state_type(3U)
                          : bump(state + this->increment()))
{
    // Nothing else to do.
}

// This function may or may not exist.  It thus has to be a template
// to use SFINAE; users don't have to worry about its template-ness.

template <typename sm = stream_mixin>
engine(itype state, typename sm::stream_state stream_seed)
    : stream_mixin(stream_seed),
      state_(this->is_mcg ? state|state_type(3U)
                          : bump(state + this->increment()))
{
    // Nothing else to do.
}

template<typename SeedSeq>
engine(SeedSeq&& seedSeq, typename std::enable_if<
              !stream_mixin::can_specify_stream
           && !std::is_convertible<SeedSeq, itype>::value
           && !std::is_convertible<SeedSeq, engine>::value,
           no_specifiable_stream_tag>::type = {})
    : engine(generate_one<itype>(std::forward<SeedSeq>(seedSeq)))
{
    // Nothing else to do.
}

template<typename SeedSeq>
engine(SeedSeq&& seedSeq, typename std::enable_if<
               stream_mixin::can_specify_stream
           && !std::is_convertible<SeedSeq, itype>::value
           && !std::is_convertible<SeedSeq, engine>::value,
    can_specify_stream_tag>::type = {})
    : engine(generate_one<itype,1,2>(seedSeq),
             generate_one<itype,0,2>(seedSeq))
{
    // Nothing else to do.
}

Figure 7: The constructors of the engine class.

Let's go through them in order of appearance:

Constructor 1

This constructor takes a single numerical value as the seed (with a flippant default that obviously has no theoretical basis). The default value has 16 hex characters and so is 64 bits wide, but if state_ is narrower it is truncated. The fact that the argument is called ‘state’ would at first sight appear to suggest that it becomes the internal state variable of the generator, but in fact its value is tweaked before being assigned to the actual state variable. The nature of the tweak depends on the value of this->is_mcg, which is a boolean variable that indicates whether the underlying linear congruential generator is multiplicative. If this->is_mcg is true, then the argument is bitwise-ORed with the number 3, which has the effect of setting the least significant two bits to 1 while leaving all the other bits unchanged. If this->is_mcg is false, then the bump() function is applied to the sum of the variable ‘state’ (which is actually the seed) and the increment, and then assigned to the actual state variable. The rationale for these tweaks is not explained. This constructor is not templated like the others in Figure 7.

Constructor 2

This takes two numerical arguments, the first being called state and the second stream_seed. The argument state is treated in the same was as in Constructor 1 and the argument stream_seed is used to initialise stream_mixin via its constructor. In these articles I haven't gone into what ‘stream mixins’ actually do, as the objective is to reconstruct the simplest possible generator and including ‘streams’ would be an unnecessary complication. ‘Streams’ are where a generator can be made to behave like multiple independent generators whose outputs are uncorrelated. This can be useful in situations where we want to generate the next member of each of multiple sequences one after the other, instead of generating each sequence in its entirety before starting on the next. Constructor 2 only works when the engine class's template parameter stream_mixin is set to the class specific_stream, since that is the only kind of ‘stream mixin’ that contains a type called stream_state, though that is just a typedef of itype.

Constructor 3

This calls constructor 1 and passes to it the output of a call to the function generate_one which itself has been passed as its argument a SeedSeq object.

Constructor 4

This calls constructor 2 and passes to that the output of two calls to the function generate_one which itself has been passed as its argument a SeedSeq object.

So there are really only two constructors, numbers 1 and 2, the difference between them being that constructor 2 initialises a generator that has ‘streams’, whereas constructor 1 doesn't have them. The named generator ‘PCG64’ that we are looking at here doesn't do streams, however, so from now on we can ignore constructors 2 and 4.

Constructors 3 & 4 both take as a template parameter an object of type std::enable_if<Condition, Type>::type. This is created if and only if Condition is true, in which case is a synonym for Type, otherwise it doesn't exist. This means that if Condition is false compilation will fail, but in C++ an invalid substitution of template parameters is not in itself an error and instead of stopping with a compilation error the compiler removes the potential overload from the candidate set. In other words, if the programmer calls engine<SeedSeq>(seedSeq, T) and Condition is true, then that constructor may be called otherwise the compiler will treat the constructor as if it doesn't exist. The two constructors have mutually exclusive conditions, the first being

!stream_mixin::can_specify_stream
&& !std::is_convertible<SeedSeq, itype>::value
&& !std::is_convertible<SeedSeq, engine>::value,

and the second being

stream_mixin::can_specify_stream
&& !std::is_convertible<SeedSeq, itype>::value
&& !std::is_convertible<SeedSeq, engine>::value

So only one of these constructors can get called because only one of stream_mixin::can_specify_stream and its negation can be true. For the condition to be true it is also necessary that the type SeedSeq is not implicitly convertible to either itype or engine. If it were convertible to itype then the first constructor should be called that just takes a numerical value. If SeedSeq were convertible to engine then the implicit copy constructor should be called—which provides a simple way of using the output of one generator as the seed to another.

Constructors 3 & 4 both call a function called generate_one. That is in the file pcg_extras.hpp and goes something like this:

/* generate_one, produce a value of integral type using a SeedSeq
 * (optionally, we can have it produce more than one and pick which one
 * we want)
 */
 
template <typename UInt, size_t i = 0UL, size_t N = i+1UL, typename SeedSeq>
inline UInt generate_one(SeedSeq&& generator)
{
    UInt result[N];
    generate_to<N>(std::forward<SeedSeq>(generator), result);
    return result[i];
}

which creates a C-style array and passes it to the function generate_to<N>(), which does soemthing to the array, and then the array is returned. The size of the array defaults to 1 and the index of the member that is returned defaults to zero. generate_to<N>() looks like:

template <size_t size, typename SeedSeq, typename DestIter>
inline void generate_to(SeedSeq&& generator, DestIter dest)
{
    typedef typename std::iterator_traits<DestIter>::value_type dest_t;
    constexpr bool IS_32BIT = sizeof(dest_t) == sizeof(uint32_t);

    generate_to_impl<size>(std::forward<SeedSeq>(generator), dest,
                           std::integral_constant<bool, IS_32BIT>{});
}

which itself is a wrapper around generate_to_impl<N>(), of which there are two versions.

template <size_t size, typename SeedSeq, typename DestIter>
inline void generate_to_impl(SeedSeq&& generator, DestIter dest,
                             std::true_type)
{
    generator.generate(dest, dest+size);
}

template <size_t size, typename SeedSeq, typename DestIter>
void generate_to_impl(SeedSeq&& generator, DestIter dest,
                      std::false_type)
{
    typedef typename std::iterator_traits<DestIter>::value_type dest_t;
    constexpr auto DEST_SIZE = sizeof(dest_t);
    constexpr auto GEN_SIZE  = sizeof(uint32_t);

    constexpr bool GEN_IS_SMALLER = GEN_SIZE < DEST_SIZE;
    constexpr size_t FROM_ELEMS =
        GEN_IS_SMALLER
            ? size * ((DEST_SIZE+GEN_SIZE-1) / GEN_SIZE)
            : (size + (GEN_SIZE / DEST_SIZE) - 1)
                / ((GEN_SIZE / DEST_SIZE) + GEN_IS_SMALLER);
                        //  this odd code ^^^^^^^^^^^^^^^^^ is work-around for
                        //  a bug: http://llvm.org/bugs/show_bug.cgi?id=21287

    if (FROM_ELEMS <= 1024) {
        uint32_t buffer[FROM_ELEMS];
        generator.generate(buffer, buffer+FROM_ELEMS);
        uneven_copy(buffer, dest, dest+size);
    } else {
        uint32_t* buffer = (uint32_t*) malloc(GEN_SIZE * FROM_ELEMS);
        generator.generate(buffer, buffer+FROM_ELEMS);
        uneven_copy(buffer, dest, dest+size);
        free(buffer);
    }
}

Both versions of this are essentially the same, i.e. they both invoke a function called generator.generate(begin, end). The object generator is of type SeedSeq, so next we need to investigate what that does.

It would seem that O'Neill intends that the type SeedSeq should walk and quack like the class std::seed_seq from the C++ standard library. That class is not what it would seem from its name. It isn't a sequence of seeds, it's a random number generator in its own right. It has a function called generate() that takes as an argument an iterable container, such as std::vector, and fills it up with randomly generated integers. Since std::seed_seq is an RNG, it needs to be seeded itself, which is done by way of a variable-length argument list passed to its constructor. Another way of looking at it is that std::seed_seq ‘improves’ a seed-value or sequence of seed-values.

All of the functions in this chain take their SeedSeq argument as an r-value reference, presumably so that it can be passed directly as either a literal or an expression, but once inside the function it becomes an l-value so std::forward<SeedSeq>(generator) is used to convert it back into an r-value ready for the next function in the chain. I guess all of the functions are made to require r-values because sometimes they may be used individually and not always in a chain like this.

To summarise, an object of a type that is trying to walk and quack like std::seed_seq, such as our type SeedSeq, must provide a function called generate(begin, end) that takes an iterator derived from a container and fills the container with integers. The chain of functions called by constructors 3 and 4 take as a template parameter an object called generator, of type SeedSeq, which is first passed to generate_one and then along the chain of functions to generate_to_impl. The first function in the chain, generate_one, creates an empty C-style array which eventually finds it way to the function generate(begin, end), where begin and end are raw pointers to the first byte and one past the last byte of the array.

Finally, we just need to revisit the content of xsl_rr_mixin, which we studied in Part 2. This consists of what O'Neill calls a ‘fixed XOR-shift to low bits’, followed by a ‘random rotation’. In Part 2 we boiled down the many lines of algebra in the ‘fixed XOR-shift to low bits’ to the following:

  1. XORshift the internal state variable right by half its width, then
  2. cast to the width of the output variable.

and the random rotation, though quite complicated in the general case, is very simple when the number of ‘spare bits’, i.e. the difference between the width of the internal state variable and the external result variable, is more than we need to perform a circular shift on the output variable by up to its full width. In the case of PCG64, the internal state variable is 128 bits wide, and the external result variable is 64 bits, so we have 64 ‘spare’ bits but we only need 6 bits to perform a circular shift on a 64 bit variable by up to its full width. However, in this case, those 64 bits are not really spare since the XORshift operation in xsl_rr_mixin folds the whole of the state variable back on itself so all bits in the state variable participate in the construction of the output. So, if we had wanted to exclude the top six bits we would have had to XORshift by \(64 - 6 = 58\) bits. However, the objective of the exercise is to reproduce what O'Neill did, so we will stick to the original plan.

So the overall procedure becomes:

  1. Extract the most significant 6 bits from the internal state variable and assign the result to a variable called r, then
  2. XORshift the internal state variable right by half its width, then
  3. cast the result to the width of the output variable, then
  4. apply a circular shift to the result by an amount equal to the numerical value of r.

So now we have enough information to construct our own implementation of PCG64. Figure 8 shows the implementation I have come up with, which I have called pcg64_hjr. (Because it's all about me!) I know it's really naff to include license boilerplate in something like this but the code will actually work if copied and pasted and it's possible that someone may tempted to do that and I'm paranoid about being sued if something goes wrong, and also it would be annoying if someone else then copyrighted it and prevented me from using it myself.

#include <cstdint>
#include <bit>

class pcg64_hjr
{

/* Copyright 2024 Howard J Rudd all rights reserved. The author grants the user 
a royalty free license to use this software for non-commercial purposes. This 
software is WITHOUT WARRANTY OF ANY KIND, either express or implied, not even 
for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is the user's 
responsibility to ensure it is fit for the use to which they intend to put it. 
The author accepts no liability for anything that may go wrong. */

private:

  __uint128_t m; // LCG multiplier
  __uint128_t c; // LCG increment
  __uint128_t x_; // Internal 'state' variable
  uint64_t z; // Intermediate output variable, after XORshift but before rotation.
  uint64_t r; // Degree of rotation

  __uint128_t bump (__uint128_t x)
  {
    return x * m + c;
  }

  uint64_t output (__uint128_t x)
  {
    z = static_cast<uint64_t> (x ^ (x >> 64));
    // 64 is half the width of the internal 'state' variable.
    
    r = static_cast<uint64_t> ((x >> 122) & 63);
    // 122 is the width of the internal 'state' variable (128) minus
    // the number of bits (6) needed to encode an integer in the range
    // 0, ..., 63. The final 63 is a bitmask in which the least-significant
    // 6 bits are 1 and all others are zero. In theory it shouldn't be
    // necessary to apply this but O'Neill does, so I've done it too.
    
    return std::rotr (z, r);
  }

public:

  pcg64_hjr (__uint128_t seed)
  {
    m = static_cast<__uint128_t> (2549297995355413924);
    m = (m << 64) + 4865540595714422341;

    c = static_cast<__uint128_t> (6364136223846793005);
    c = (c << 64) + 1442695040888963407;

    x_ = bump (seed + c);
  }

  uint64_t operator() ()
  {
    x_ = bump (x_);
    return output(x_);
  }

};

Figure 8: The reconstructed pgc64 generator class.

The code uses the constructor to ‘seed’ the generator, and operator()() to produce output. Following O'Neill, the initial LCG step is performed in a private function called ‘bump()’ and the output function is in another private function called output(). It also uses the new built-in circular shift function that was introduced in C++ 20, so we don't need to bother pfaffing around to construct our own and then worrying about whether it is optimal or not.

Figure 9 below shows a main() function that compares the output of pcg6_hjr with the pcg64 from a freshly downloaded copy of O'Neill's library. It constructs the 128-bit seed from two calls to std::random_device.

<pre class="prettyprint lang-c">
#include "pcg_random.hpp"
#include "pcg64_hjr.h"
#include <random>
#include <iostream>
#include <cmath>

int main ()
{
  std::random_device rd;

  uint64_t seed_upper = rd ();
  uint64_t seed_lower = rd ();

  __uint128_t seed = seed_upper;
  seed = (seed << 64) + seed_lower;

  pcg64_hjr gen_hjr (seed);
  pcg64 gen (seed);

  std::cout << "New implementation\tOriginal library" << std::endl;
  for (int i = 0; i < 10; i++)
  {
    std::cout << gen_hjr () << "\t" << gen () << std::endl;
  }

  uint64_t variance { 0 };
  uint64_t diff { 0 };
  for (int i = 0; i < 2e9; i++)
  {
    diff = gen_hjr () - gen ();
    variance += abs (diff);
  }

  std::cout << "Variance = " << variance << std::endl;

  return 0;
}

Figure 9: Main function to compare the new implementation of the PCG64 implementation shown in Figure 8 above, with a freshly downloaded copy of O'Neill's

This prints out the following:

New implementation    Original library
135472737322341482    135472737322341482
9329035867628166090   9329035867628166090
793768133254339388    793768133254339388
16024268172283510654  16024268172283510654
2300653954511388943   2300653954511388943
11638691889812511789  11638691889812511789
14051190976459833412  14051190976459833412
3617772801392788695   3617772801392788695
5962226159752129915   5962226159752129915
10115094317008627567  10115094317008627567
Variance = 0

Which allows us to visually compare the first ten outputs from both generators, and also shows that the two sequences are identical for their first \(10^9\) members. I think therefore that we are justified in concluding that our concise reconstruction is indeed mathematically identical to the typedef pcg64 from O'Neill's library.

References

M. E. O’Neill. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation. Technical Report HMC-CS-2014-0905, Harvey Mudd College, Claremont, CA, September 2014. URL https://www.cs.hmc.edu/tr/hmc-cs-2014-0905.pdf.

M. E. O’Neill. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation. ACM Transactions on Mathematical Software (submitted), 2014a. URL https://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf. Accessed 2023-11-01.

© Copyright 2024 Howard J. Rudd all rights reserved.

Leave a Reply

Your email address will not be published. Required fields are marked *