SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Using Docker, build environment of analyzing data.

If you want to start to analyze data or use Machine Learning, you will build environment installed R, Python and some library for analyzing data, ML. Data analyst often use RStudio, JupyterNotebook, spyder which is Python's IDE like RStudio, so there are some useful environment for analyzing data. If you belong to team of analyzing data, I recommend to use Docker which is container software wrapping some functions that contains everything needed to run: code, runtime, system tools, system libraries - anything that can be installed on a server because it effects the scene that you analyze data with some other analyst and can be high reappearance between some environment.


Using Docker, you can easily build environment for analyzing data and avoid problem of reappearance. Especially, DockerImage including some libraries needed for analyzing data is very useful. I gonna introduce some DockerImage and how to use it.

Reappearance for analyzing data

There are some scene that you can't reappear result of analyzing data by other analyst even if you have same input data used by them because of difference of environment between analysts. In development software, it generally uses same environment that means having same libraries and tools, same version between some developers using virtual environment to avoid to difference of environment. It is same in analyzing data. We should build same environment between analysts to use same library and tool, version using tools.

Using Docker for analyzing data

We can solve above problem using Docker. Recently, software engineer often use docker in many scene that are building development environment and production server. In analyzing data, Docker is very useful tool because it has some characteristics below:

  • more lightweight than hypervisor machine.
  • easily sharing between some analysts using Dockerfile.
  • easily build environment for analyzing using DockerImage includes some libraries needed.

DockerImage for analyzing data

Do you know DockerHub ? If you want to explore image to suit the purpose, you should access DockerHub and search DockerImage on here. DockerHub is a cloud-based registry service having some DockerImage. You can explore DockerImage which you want to use, you can deploy your DockerImage and distribute it to other user for team collaboration.
So, there are some DockerImage for analyzing data, including some IDE(RStudio, JupyterNotebook, etc..) and some libraries(Numpy, Scipy, Pandas, Scikit-Learn, etc..) often used in analyzing data. In generally, when analyst want to build environment of analyzing data, they have to install individually each of tools or libraries such as RStudio and Anaconda(is Python package including Numpy, Scipy, Pandas, and so on..). But you can see some DockerImage including above tools and libraries, these are very useful for you. I recommend you to build environment of analyzing based below DockerImage.

Using Python, there is a scene that you want to switch version of Python between 2 ver and 3 ver. If you use jupyter's image, you can easily switch these version of Python on JupyterNotebook. It is very important when you try to analyze data using Python.

How to use Docker

In advance, you must install client tool of Docker. I think that Docker for Mac, Windows are good client tools.

I gonna explain how to use Docker in case of using jupyter/datascience-notebook. First, you pull DockerImage which you want to use and check the image installed with typing below command.

$ docker pull jupyter/datascience-notebook
$ docker images jupyter/datascience-notebook
REPOSITORY                     TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
jupyter/datascience-notebook   latest              7e23bef2recd        1 hours ago         4.592 GB

Next, let's run container with port 8888. You can confirm that containers already running from accesing JupyterNotebook.

$ docker run -d --name notebook -p 8888:8888 jupyter/datascience-notebook

JupyterNotebook is very usefully tool when you try to analyze data and create model of machine learning.
When you have to install new library or tool into DockerImage, you can create your original environment of analyzing to write Dockerfile.

For example, when you want to install networkx library which is the famous Python library to draw and calculate network graph, you will write pip command into Dockerfile below :

FROM <image>
RUN pip install networkx

After you wrote own Dockerfile, you build dockerImage from Dockerfile below:

$ docker build -t jupyter/datascience-notebook [path placed `Dockerfile`]

There are some usefully commands in Docker. For example, you can mount local directory from container of docker adding -v parameter to docker run command, It is a part of command what I always use.


In my team which is a data analyzing team in my company, we have been analyzing data using JupyterNotebook on Docker (it's not all member). I think it is good tools for building environment because we can speedy and easily build same environment between analysts. If you want to start to analyze data or study Machine Learning, I recommend you to use Docker. I think you shouldn't take some time to build environment, you should spend more time to analyze data and study statistic and Machine Learning. So, Docker will help you to try to do some problems in analyzing data.