Notes - Automatic Feedback for R Exercises Using Shiny

I recently co-created a shiny application for the admiral hackathon in February 2023. The concept of the app seems quite promising for teaching data-manipulation in general, so I decided to publish the code on GitHub and write this short post for anyone interested.

But from the beginning: The admiral hackathon was an event designed to make statistical programmers from the pharmaceutical industry more comfortable with the admiral R package. This package allows users to efficiently transform data from one data standard (sdtm) to another (adam).

Hackathon participants formed groups of up to five people and were then tasked to create R-scripts that map the sdtm data to adam according to specifics defined in the metadata.

The purpose of the shiny app was threefold:

Allow participants to easily register as a team.
Let participants upload their R-scripts, so that the admiral team can see how the admiral package is being used – and how the functions and the documentation could be improved.
Run participants R-scripts on the cloud and compare the produced file to a solution file. Let participants autonomously get feedback on their work.

In this blog post I want to highlight some of the thoughts that went into this application. Please keep in mind that this work was done under tight time restraints.

The hackathon application is still online (although data-upload is switched off) and the GitHub repository is publicly available. I have also uploaded to GitHub a .zip file of the workspace to which hackathon participants had access via posit cloud. For more context you can watch recordings of the hackathon-meetings.

Permanent Data

The biggest challenge you have to consider for this app is the permanent data storage. Shiny apps run on a server. Although we can write files on this server, whenever the app restarts, the files are lost. Therefore, a persistent data storage solution is required.

Google drive

I decided to leverage Google drive using the googledrive package. This allowed me to save structured data (the team registry and the submission scores) as well as unstructured data (their R-script files).

Authentication

To access Google drive using the googledrive package we need to authenticate. This can be done interactively using the command googledrive::drive_auth() which takes you to the Google login page. After login you receive an authentication token requested by R.

For non-interactive authentication this token must be stored locally. In our case where the shiny app must access the token once deployed, the token must be stored on the project level.

I have included the authentication procedure I followed in the R folder in google_init.R. You can find more extensive documentation of the non-interactive authentication.

The initial concept was: Each team gets their own folder including the most recent submission for each task, and a .csv file containing team information. To keep track of the submissions and the respective scores we wrote a .csv file in the mock-hackathon folder, so one folder above the team folders.

Saving the team info as a .csv file worked fine as each team received their own file which – once created – was not touched anymore. As each upload for every team should simply add a row to the submissions.csv file, appending the file would be ideal. This was not possible using the googledrive package. Instead, for each submission, the submissions file was downloaded, appended, and uploaded again. Unfortunately, this lead to a data loss, as the file was continuously overwritten, especially when two teams would submit simultaneously.

Recover the Lost Data

Whenever the submissions.csv file was uploaded, the previous version was sent to the Google drive bin. We ended up with over 3000 submissions.csv files containing a lot of redundant information. I had to write the following chunk to first get the unique file IDs of the 3000 submissions.csv files, create an empty submissions data-frame, and then download each file and add its information to the submisisons data-frame. To keep the data-frame as light as possible, after each append I deleted all duplicate submissions.

# get all task_info.csv ID's
# each row identifies one file in the trash
task_info_master <- drive_find(
  pattern = "task_info.csv",
  trashed = TRUE
)

# set up empty df to store all submissions
origin <- tibble(
  score = numeric(),
  task = character(),
  team = character(),
  email = character(),
  time = character()
)

# downloads, reads, and returns one csv file given a file id
get_file <- function(row) {
  tf <- tempfile()
  row %>%
    as_id() %>%
    drive_download(path = tf)
  new <- read_csv(tf) %>%
    select(score, task, team) %>%
    distinct()
}

# quick and dirty for loop to subsequently download each file, extract information
#  merge with previous information and squash it (using distinct()).
for (i in 1:nrow(task_info_master)) {
  origin <- rbind(origin, get_file(row = task_info_master[i, ])) %>%
    distinct()

  # save progress in a separate file after every 100 downloaded and merged sheets
  if (i %% 100 == 0) {
    print(i)
    write_csv(origin, paste("prog_data/task_info_prog_", i, ".csv", sep = ""))
    # update on progress
    message(i / nrow(task_info_master) * 100)
  }
}

If you want to stay in the Google framework, I recommend using the googlesheets4 package for structured data. googlesheets4 allows appending new information to an already existing sheet without the need to download the file first. As both packages follow the same style, going from one to the other is really simple. googlesheets4 requires authentication as well. However, you can reuse the cached token from the googledrive authentication by setting gs4_auth(token = drive_token()).

Security Concerns

Connecting a public shiny app to your Google account introduces a security vulnerability in general. Especially so, because we implemented the upload of files to the Google drive. And even more problematic: We run a user generated script and display some of its output. A malicious party might be able to extract the authentication token of our Google account or could upload malware to the drive.

To reduce risk, I simply created an un-associated Google account to host the drive. There are certainly better options available, but this seemed a reasonable solution for basically zero effort.

Register Team

We wanted to allow users to sign up as teams using the shiny app. The app provides a simple interface where users could input a team name and the number of members. This in turn would open two fields for each user to input their name and email address.

We do simple checks to make sure at least one valid email address is supplied, and that the group name is acceptable. The group name cannot be empty, already taken, or contain vulgar words.

The team registration itself was adding the team information to the Google sheets file event_info into the sheet teams and to create a team folder in which to store the uploaded R files.

The checks and registration is implemented in the register_team() function stored in interact_with_google.R.

Upload & Source Script

To upload a script, participants had to select their team first. The input options were based on the existing folders on the Google-drive in the mock_hackathon folder. To upload a particular script participants had to also select the task to be solved. The uploaded script is then uploaded to the team folder following a standardised script naming convention.

There are different aspects to be aware of when sourcing scripts on a shiny server. For example, you have to anticipate the packages users will include in their uploaded scripts, as their scripts will load but not install packages. Further, you should keep the global environment of your shiny app separate from the environment in which the script is sourced. This is possible by supplying an environment to the source() function, e.g: source(path_to_script, local = new.env())

Another thing we had to consider was to replicate the exact folder-structure on the shiny server that participants were working with when creating the scripts, as they were required to source some scripts and to save their file into a specific folder. This was relatively straight forward as we provided participants with a folder structure in the posit cloud instance they were using. They had access to the sdtm folder in which the data was stored, and the adam folder into which they saved their solutions. The structure also included a folder with metadata which was also available on the shiny server. For some tasks, participants required some adam-datasets stored in the adam folder, essentially the output from previous tasks. On the shiny server this was solved by copying the required adam dataset from the key folder (where our solutions to the tasks were stored) to the adam folder. After sourcing the content in the adam folder was deleted.

Compare to Solution File

We want to compare the file created by participants with our solution (key) file stored in the key folder. The diffdf::diffdf() function allows for easy comparison of two data-frames and directly provides extensive feedback for the user:

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

df1 <- tibble(
  numbers = 1:10,
  letters = LETTERS[1:10]
)
df2 <- tibble(
  numbers = 1:10,
  letters = letters[1:10]
)

Score

To compare submissions between participants we implemented a simple scoring function (score_f()) based on the table comparison by diffdf(). The function can be found in the compare_dfs.R file:

score_f <- function(df_user, df_key, keys) {
  score <- 10
  diff <- diffdf::diffdf(df_user, df_key, keys = keys)
  if (!diffdf::diffdf_has_issues(diff)) {
    return(score)
  }

  if (!diffdf::diffdf_has_issues(diffdf::diffdf(df_user,
    df_key,
    keys = keys,
    strict_numeric = FALSE,
    strict_factor = FALSE
  ))) {
    return(score - 1)
  }
  return(round(min(max(score - length(diff) / 3, 1), 9), 2))
}

Every comparison starts with a score of 10. We then subtract the length of the comparison object divided by a factor of 3. The length of the comparison object is a simplified way to represent the difference between the two data-frames by one value. Finally, the score is bounded by min(score, 1).

The score is not a perfect capture of the quality of the script uploaded but: 1. helped participants get an idea of how close their data-frame is to the solution file 2. allowed us to raffle prizes based on the merit of submitted r-scripts

Last updated

2023-04-04 07:58:00 UTC

Details

source code, R environment

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{thoma2023,
  author = {Stefan Thoma},
  editor = {},
  title = {Automatic {Feedback} for {R} {Exercises} {Using} {Shiny}},
  date = {2023-03-14},
  url = {https://stefanthoma.github.io/quarto-blog//posts/2023-03-14_shiny_script_upload},
  langid = {en}
}

For attribution, please cite this work as:

Stefan Thoma. 2023. “Automatic Feedback for R Exercises Using Shiny.” March 14, 2023. https://stefanthoma.github.io/quarto-blog//posts/2023-03-14_shiny_script_upload.