Using Docker to Build Data Science Environments with RStudio

I have been using Docker to create environments for data science work. With Docker, I was able to painlessly create the environments with a degree of accuracy and consistency. After getting exposed to using Docker for environment creation, it is hard to imagine doing it any other ways.

For more information, I encourage you to check out the Rocker images and this blog post about using RStudio and Docker.

Step 1: Create VM and update OS as necessary

I created virtual machines on VMware using CentOS 7 and make it accessible through bridged networking. I also used CentOS’ minimum installation as it just needs the basic components to run Docker. We will need to access the VM via SSH with the port 8787 opened for the RStudio server instance.

Step 2: Access the VM via SSH (through a non-root user call docker_admin) and install Git with the sudo command.

Step 3: Install Docker for the non-root docker_admin user. Verify the installation with the command “docker image ls.”

More information on installing Docker CE can be found at here and here.

It boils down to:

sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo
sudo yum -y install docker-ce
sudo systemctl start docker && sudo systemctl enable docker
sudo usermod -aG docker docker_admin

Step 4: For my environments, I need to clone some R template scripts. This step is not mandatory if you do not require it.

git clone examples

For my environments, I also need to make some environment variables accessible to the scripts. Again, this step may not be mandatory for your installation.

scp .Renviron cloud_user@<IP_Address>:/home/cloud_user

Step 5: Create the Dockerfile or use the one from the template directory

FROM rocker/verse
LABEL com.dainesanalytics.rstudio.version=v1.0
RUN Rscript -e "install.packages(c('knitr', 'tidyverse', 'caret', 'corrplot', 'mailR', 'DMwR', 'ROCR', 'Hmisc', 'randomForest', 'e1071', 'elasticnet', 'gbm', 'xgboost'))"
COPY --chown=rstudio:rstudio .Renviron /home/rstudio
COPY --chown=rstudio:rstudio examples/ /home/rstudio

My environments require many of the machine learning packages, but these packages may not be mandatory for your installation.

Step 6: Build the Docker image with the command:

docker image build -t rstudio/nonroot:v1 .

Step 7: Run the Docker container with the command:

docker container run --rm -e PASSWORD=rserver -p 8787:8787 --name
rstudio-server rstudio/nonroot:v1

The password can be any string, and the RStudio Server just requires one.

Step 8: After we are done with the container and/or the virtual machine, we can shut down the container with the command:

docker container stop [container ID]

The templates (R and Docker) can be found here on GitHub.