Set up a data science development environment

ipython notebook screenshot

This article explains how to set up a data science development environment using Python as the programming language and Ubuntu as the operating system. For this setup I will assume that you want to set up a virtual machine. I am going to use the latest version of Ubuntu, at the time of writing that is Ubuntu 13.10.

You can get the iso from here: http://www.ubuntu.com/download/desktop

If you want a headless version of Ubuntu then you can use Ubuntu Server for Cloud, this is good choice if you want to set up IPython notebook to access remotely.

Setting up remote access

After you have completed the install open up a terminal. The first thing I like to do it set up ssh so that I can get remote access to the machine.

sudo apt-get update
sudo apt-get install -y openssh-server

Now you should be able to remotely access the virtual machine from your desktop console. For this setup the virtual machine has the IP address 192.168.1.3:

Remote access

ssh mark@192.168.1.3
mark@192.168.1.3's password: *****
Welcome to Ubuntu 13.10 (GNU/Linux 3.11.0-14-generic i686)

     * Documentation:  https://help.ubuntu.com/

165 packages can be updated.
19 updates are security updates.

Last login: Thu Dec 12 14:14:36 2013 from 192.168.1.5

Ubuntu already has python preinstalled, we can check the version number at the console:

$ python --version
Python 2.7.5+

For the moment it is better to use version 2 of python, not all the libraries we need will work successfully with version 3.


[Optional] Increasing terminal console resolution

If you are going to log in to your virtual machine directly (not via SSH) then I like to increase the resolution of the terminal console. Open the ‘/etc/default/grub’ file and update the GRUB_CMDLINE_LINUX_DEFAULT line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash vga=795"

The vga=795 argument sets the console resolution to 1280×1024 with 24-bit colours. The ConsoleFramebuffer link has a list of vga codes for different screen resolutions.

Next we need to regenerate the grub config file and reboot the machine.

$ sudo grub-mkconfig
sudo update-grub
sudo shutdown -r now

Setting up a virtual environment for Python

The next step is to set up a virtual environment, this is a safer approach which can help avoid version conflicts between different applications.

Install virtual environment and pip, a python package manager

sudo apt-get install -y python-virtualenv python-pip
sudo apt-get build-dep python-numpy python-scipy

Now we need to create and activate the virtual environment.

Setting up the virtual environment

virtualenv $HOME/system
source $HOME/system/bin/activate
echo 'source $HOME/system/bin/activate' >> ~/.bashrc
pip install -U pip # update pip to the latest version

The bash prompt should now be prefixed with (system) so that we can see that we are now in the virtual environment. We also add the source statement to the end of the .bashrc file so that everytime we open up a new terminal window the virtual environment will be activated.


Install python libraries

Now are ready to install the libraries we need. First we install numpy:

Install numpy

pip install -U numpy

After the install we check that all the tests run correctly.

pip install nose # needed to run tests
python -c "import numpy; numpy.test()"

Running unit tests for numpy
NumPy version 1.8.0
NumPy is installed in /home/mark/system/local/lib/python2.7/site-packages/numpy
Python version 2.7.5+ (default, Sep 19 2013, 13:49:51) [GCC 4.8.1]
nose version 1.3.0
...
Ran 4977 tests in 43.910s

OK (KNOWNFAIL=5, SKIP=7)

Now we can install scipy.

Install scipy

pip install -U scipy

After the install we check that all the tests run correctly.

python -c "import scipy; scipy.test()"
Running unit tests for scipy
NumPy version 1.8.0
NumPy is installed in /home/mark/system/local/lib/python2.7/site-packages/numpy
SciPy version 0.13.2
SciPy is installed in /home/mark/system/local/lib/python2.7/site-packages/scipy
Python version 2.7.5+ (default, Sep 19 2013, 13:49:51) [GCC 4.8.1]
nose version 1.3.0
...
Ran 8938 tests in 87.570s

OK (KNOWNFAIL=115, SKIP=209)

We now have the two major libraries installed but there are still plenty more libraries that are useful.

Installing other libraries

pip install scikit-learn
pip install pandas
pip install patsy # required by statsmodels
pip install statsmodels

We are now able to do plenty of data science programming but if we want to visualize the results then we still have work to do.

We want to install matplotlib but first we should install some graphical and networking libraries.

Installed needed libraries for matplotlib

sudo apt-get install libpng-dev libjpeg8-dev libfreetype6-dev
sudo apt-get install software-properties-common
sudo apt-get install python-software-properties
sudo add-apt-repository -y ppa:chris-lea/zeromq
sudo add-apt-repository -y ppa:chris-lea/libpgm
sudo apt-get install -y libzmq1
sudo apt-get install -y libzmq-dev
sudo apt-get install -y libpgm-5.1-0
sudo apt-get install python-dev
pip install pyzmq

Now matplotlib should install correctly. At this time we also install ipython (an interactive python shell).

Install matplotlib

pip install -U distribute # for Ubuntu 12.04 installed version is too old so must update
pip install matplotlib
pip install jinja2
pip install ipython
pip install pygments

Running ipython notebook on the server

To start the notebook running on the server:

ipython notebook --ip=0.0.0.0 --port=5555

ipython notebook screenshot