23  Exercises for ‘Data Generating Process functions’ chapter ✎ Polishing

TODO write a few lines on why we’re writing wrapper functions; in the boook talk about wrapping functions eg rnorm_multi().

These exercises accompany the Data Generating Process functions chapter.

You can complete these exercises in your local version of the .qmd file. Either download a copy of the whole book from github (see introduction), or download this .qmd using the download button on the top right of the page.

Write functions for each of the following. Remember that you can write pseudocode first if it helps.

23.1 Normally distributed data

23.1.1 Sample data from a normal distribution and return a tibble

Generate data for a study where the population is normally distributed, with a single sample sampled from a normal population, specifying its population mean and SD, and the number of participants sampled.

Use rnorm().

23.1.2 Sample data from a normal distribution, round to the nearest whole number, and return a tibble

Again, sample for a single group.

Rounding the data to the nearest whole number makes the data look a bit more like Likert data. Note that properly simulating realistic Likert data is harder than this.

23.1.3 Sample data from a normal distribution, calculate its average, and return a tibble

Again, sample for a single group.

Calculating its average is an arbitrary choice, suggested only to practice writing functions.

23.1.4 Sample data from two normal distributions, return a tibble with two columns called intervention and control

This looks more like data from an RCT.

The arguments should be mean_intervention, mean_control, sd, and n_per_condition.

23.1.5 Sample data from two normal distributions, return a tibble with two columns called intervention and control but more flexibly

This time, write a more flexible function that has separate arguments for each condition’s M, SD and N.

23.1.6 Sample data from a normal distribution, add outliers, and return a tibble

Like the previous exercise, this exercise asks you to generate two tibbles of data and then combine them. This time, both datasets are for the same single unnamed condition. One tibble represents the ‘legitimate’ data. The second tibble represents the outliers. Both should draw from normal distributions. The arguments should be a) the population mean of the legitimate data, b) the SD of the legitimate data, c) the difference between the legitimate and the outliers (i.e., outlier mean = legitimate mean + difference), and d) the SD of the outliers.

Note that this is not intended to be a realistic DGP for how outliers occur, other than there being two separate DGPs for legitimate vs. outlier data. It is an arbitrary example designed to help you build your function writing skills.

23.1.7 Sample data from two normal distributions, return a tibble in long format with a ‘score’ and ‘condition’ column

Use bind_rows().

23.1.8 Cohen’s d simulator

Write a function like the two-group RCT function we did before, but it should only take the arguments cohens_d and n_per_group (i.e., not means and SDs). Whereas a previous exercise had you practice writing a more flexible function with more arguments, this practices a less flexible one with fewer arguments.

rnorm() can produce Cohen’s d-like data when we fix the SDs to 1, one mean to 0, and the second mean to the desired population Cohen’s d value.

23.2 Other probability density functions

23.2.1 Generate a random number between 0 and 9

Use runif(). Look up the help menu if you need with ?runif().

23.2.2 Generate two random numbers between 0 and 9, return only the larger one

Use runif(). Look up the help menu if you need with ?runif().

There are multiple ways to do this. Don’t reach immediately for help from an AI: think about the logic of how you might do it. Write out increasingly detailed pseudocode. There is more than one way of doing it.

23.2.3 Calculate the probabilty of n number of ‘heads’ among k number of coin flips

Use rbinom(). Look up the help menu if you need with ?rbinom().

FEEDBACK- questionable value, dbinom() instead? TODO

23.2.4 Binary outcomes with a treatment effect

Rewrite generate_data_two_group_binary() so that it returns id, condition (as a factor), and score. Use rbinom().

Simulate one dataset with prob_control = 0.20 and prob_intervention = 0.35 and verify the observed proportions with group_by() and summarise().

TODO in between question needed before this one to scale the difficulty; but also this doens’t change much from the in-class code - @jamie

23.2.5 Missing data

Modify generate_data_two_group() so that a proportion of score values are set to NA completely at random.

Use [suggestion needed].

Add an argument missing_rate between 0 and 1. Test with missing_rate = 0, 0.10, and 0.40.

23.3 Multivariate normal distributions

I.e., correlated data.

23.3.1 Three time points

Write a function that returns three time points (pre, mid, post) with correlations:

  • cor(pre, mid) = 0.7
  • cor(mid, post) = 0.7
  • cor(pre, post) = 0.5

Assume the same mean and SD at each time point.

Use rnorm_multi().

23.3.2 Scale items with rnorm_multi()

Using faux::rnorm_multi():

  • generate 8 items with inter-item correlation r=0.3
  • add an id column
  • compute a total score as a row-mean or row-sum

23.4 Reflection

  • What features of the DGPs often assumed for psychological data are often unrealistic for real psychological data? These are the sorts of things we a) violate all the time, and yet b) usually don’t understand the implications of violating (i.e., do we need to care). We can examine them in simulations in later chapters.