5  Pseudo-random number generators ✎ Very rough draft

Randomness is impossible to achieve. All “random” number generators are actually pseudo-random number generators (PRNGs). Computer scientists and mathematicians spend lots of time trying to increase the randomness of our random number generators, because any degree of predictability adds bias to any models built on them. Pseudo-random numbers are at the core of simulation studies, as they allow us to (pseudo) randomly sample simulated data from known population distributions.

5.1 Sampling from uniform distributions

A uniform distributions is when every value is as likely as every other value, and are selected from a given range. E.g., “pick a number between 1 and 10” where the picker is just as likely to say “1” as any other number in that range.

runif() is a random generation function (the “r” part) for a uniform distribution (the “unif” part). Not ‘run if’, which confuses some people.

So, this code, which generates a random number between 1 and 10, will generate 3s just as often as it generates 7s. You can re run this code yourself many times to see it generate different numbers between 1 and 10.

library(roundwork)

runif(n = 1, 
      min = 1,
      max = 10) |>
  round_up(0)
[1] 6

5.2 Random number generators are not truly random

You don’t need to understand how PRNGs work, but you do need to know that the these “random” numbers can be predictably reproduced. The ‘seed’ value from which a PRNG starts can be set to control which random numbers are generated.

set.seed(43) # set the starting seed value for generating the random numbers

runif(n = 1, 
      min = 1,
      max = 10) |>
  round_up(0)
[1] 5
set.seed(43) # set it again to the same value starting seed value for generating the random numbers

runif(n = 1, 
      min = 1,
      max = 10) |>
  round_up(0)
[1] 5

Note that if you run the function a second time without resetting to a known seed, the second value will be different to the first one.

set.seed(43) # set the starting seed value for generating the random numbers

runif(n = 1, 
      min = 1,
      max = 10) |>
  round_up(0)
[1] 5
runif(n = 1, 
      min = 1,
      max = 10) |>
  round_up(0)
[1] 9

This is because the Nth value of any sequence from a given seed is knowable, whether its run once or in multiple runs.

set.seed(43) # set the starting seed value for generating the random numbers

# generate both of the above numbers in one function call
runif(n = 2, # generate two numbers rather than one
      min = 1,
      max = 10) |>
  round_up(0)
[1] 5 9

5.3 Randomness and reproducibility

You might have heard people talking about “seeds” before when using R. Seeds, basically, are used to make sure that random output can be reproducible.

For example, look at the outputs of two calls to the rnorm() function below, which samples random values from a normal distribution with a mean of 0 and a standard deviation of 1:

output1 <- rnorm(n = 5, mean = 0, sd = 1)
output2 <- rnorm(n = 5, mean = 0, sd = 1)

output1 
[1] -1.5746044 -0.4859675  0.4651862 -0.9040981 -0.2774328
output2
[1]  0.38643441 -0.06040412 -0.68617976 -1.90613679  1.80375975

You can see that the outputs of the two calls are different. But sometimes we might not want this: for example, when we want other people to be able to reproduce our simulation results exactly. If we use truly random numbers, then every time we re-run the simulation we’ll get slightly different results. What we really want are pseudo-random numbers - numbers which are sampled randomly, but where the specific random sampling can be reconstructed and reproduced. This is exactly what seeds are for: if you set a seed to a specific value, then all calls to functions which use randomisation will reproduce their random samples in future runs.

Put simply: if you start from the same seed, you get the same sequence of “random” values. Check out the example below, now using set.seed():

set.seed(123)

output1 <- rnorm(n = 5, mean = 0, sd = 1)
output2 <- rnorm(n = 5, mean = 0, sd = 1)

output1 
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
output2
[1]  1.7150650  0.4609162 -1.2650612 -0.6868529 -0.4456620

In your own work, you’ll often set a seed once at the top of a script so you (and others) can reproduce your results.

Don’t put set.seed() inside a DGP function. If you do, every call will restart the random number generator in the same place, and you will keep generating the same dataset. This becomes a problem when we get to Step 4 (repeat many times), since you just end up repeating the exact same process every time.

bad_generate <- function() {
  set.seed(42)
  rnorm(3)
}

bad_generate()
[1]  1.3709584 -0.5646982  0.3631284
bad_generate()
[1]  1.3709584 -0.5646982  0.3631284
set.seed(42)
good_generate <- function() {
  rnorm(3)
}

good_generate()
[1]  1.3709584 -0.5646982  0.3631284
good_generate()
[1]  0.6328626  0.4042683 -0.1061245

5.4 TODO

  • Discuss the apparent contradiction of predictable unpredictability.