If you want to start to analyze data or use Machine Learning, you will build environment installed R, Python and some library for analyzing data, ML. Data analyst often use RStudio, JupyterNotebook, spyder which is Python's IDE like RStudio, so there are some useful environment for analyzing data. If you belong to team of analyzing data, I recommend to use Docker which is container software wrapping some functions that contains everything needed to run: code, runtime, system tools, system libraries - anything that can be installed on a server because it effects the scene that you analyze data with some other analyst and can be high reappearance between some environment.
Using Docker, you can easily build environment for analyzing data and avoid problem of reappearance. Especially,
DockerImage including some libraries needed for analyzing data is very useful. I gonna introduce some
DockerImage and how to use it.
Reappearance for analyzing data
There are some scene that you can't reappear result of analyzing data by other analyst even if you have same input data used by them because of difference of environment between analysts. In development software, it generally uses same environment that means having same libraries and tools, same version between some developers using virtual environment to avoid to difference of environment. It is same in analyzing data. We should build same environment between analysts to use same library and tool, version using tools.
Using Docker for analyzing data
We can solve above problem using Docker. Recently, software engineer often use docker in many scene that are building development environment and production server. In analyzing data, Docker is very useful tool because it has some characteristics below:
- more lightweight than hypervisor machine.
- easily sharing between some analysts using
- easily build environment for analyzing using
DockerImageincludes some libraries needed.
DockerImage for analyzing data
Do you know DockerHub ? If you want to explore image to suit the purpose, you should access DockerHub and search
DockerImage on here. DockerHub is a cloud-based registry service having some
DockerImage. You can explore
DockerImage which you want to use, you can deploy your DockerImage and distribute it to other user for team collaboration.
So, there are some
DockerImage for analyzing data, including some IDE(RStudio, JupyterNotebook, etc..) and some libraries(Numpy, Scipy, Pandas, Scikit-Learn, etc..) often used in analyzing data. In generally, when analyst want to build environment of analyzing data, they have to install individually each of tools or libraries such as RStudio and Anaconda(is Python package including Numpy, Scipy, Pandas, and so on..). But you can see some
DockerImage including above tools and libraries, these are very useful for you. I recommend you to build environment of analyzing based below
- jupyter/all-spark-notebook : having JupyterNotebook, Scala, R, Spark, Mesos Stack.
- jupyter/pyspark-notebook : similar to above image.
- jupyter/datascience-notebook : having some Python tools needed for analyzing data.
- jupyter/notebook : is simple image for using JupyterNotebook.
- rocker/hadleyverse : having RStudio, LaTeX, some R packages.
- rocker/rstudio : is simple image for using RStudio.
Using Python, there is a scene that you want to switch version of Python between 2 ver and 3 ver. If you use jupyter's image, you can easily switch these version of Python on JupyterNotebook. It is very important when you try to analyze data using Python.
How to use Docker
I gonna explain how to use Docker in case of using
jupyter/datascience-notebook. First, you pull
DockerImage which you want to use and check the image installed with typing below command.
$ docker pull jupyter/datascience-notebook $ docker images jupyter/datascience-notebook REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE jupyter/datascience-notebook latest 7e23bef2recd 1 hours ago 4.592 GB
Next, let's run container with port 8888. You can confirm that containers already running from accesing JupyterNotebook.
$ docker run -d --name notebook -p 8888:8888 jupyter/datascience-notebook
JupyterNotebook is very usefully tool when you try to analyze data and create model of machine learning.
When you have to install new library or tool into
DockerImage, you can create your original environment of analyzing to write
For example, when you want to install networkx library which is the famous Python library to draw and calculate network graph, you will write pip command into
Dockerfile below :
FROM <image> MAINTAINER <name> RUN pip install networkx
After you wrote own
Dockerfile, you build
$ docker build -t jupyter/datascience-notebook [path placed `Dockerfile`]
There are some usefully commands in Docker. For example, you can mount local directory from container of docker adding
-v parameter to
docker run command, It is a part of command what I always use.
In my team which is a data analyzing team in my company, we have been analyzing data using JupyterNotebook on Docker (it's not all member). I think it is good tools for building environment because we can speedy and easily build same environment between analysts. If you want to start to analyze data or study Machine Learning, I recommend you to use Docker. I think you shouldn't take some time to build environment, you should spend more time to analyze data and study statistic and Machine Learning. So, Docker will help you to try to do some problems in analyzing data.