Back in 2015 we were contacted by the VUB Department of Hydrology and Hydraulic Engineering (in short HYDR) to see if we could run SWAT (Soil and Water Assessment Tool) and Python scripts in a cloud based environment. Main goal was to have raw computing power at hand, whenever a complex model had to be calculated.
We extended the original request for setting up a cloud hosting environment for SWAT and started thinking about going to the next level: getting more computing power from cloud based systems.
It was not until begin 2017 HYDR and Kinamo gave rebirth to the original idea and restarted the thinking process on how we could get this effectively work and step beyond the “Run some scripts on your portable or server” approach. Unlocking cloud CPU’s would mean instant high computing power (HPC) at the fingertips and thus booking major results in terms of speed and efficiency.
Introducing Jupyter Notebook
Jupyter Notebook originally started as IPython Notebook and introduced the ability to present an interactive notebook environment that introduced interaction with your data using any programming language. While originally it used Python it has been extended to using R, Fsharp, … basically anything where Jupyter has a Kernel made available for. Check out the extensive (and growing) list of Jupyter Kernels on Github.
The initial idea for using Jupyter Notebook came from Ann. It was however uncertain if SWAT (Soil & Water Assessment Tool) would run in such environment since it it called from within the used scripting languages, and above all we aimed for using multiple CPU cores. Let’s face it, what would be the advantage of running a HPC cluster if you only use one core…
The SWAT question aside, the choice for Jupyter was pretty straightforward, it allows:
- Independent usage of programming languages (as mentioned, the Jupyter Kernel availability determines the languages you use in the notebook)
- Sharing of Notebooks and thus data and results
- Integrate with Apache Spark, but also spawn Docker containers, use Kubernetes… basically modern age computing!
Our first tests used a basic Notebook installation and mainly focused on getting SWAT to run in the Jupyter environment. It prove not to be straightforward, but with help of the HYDR department we managed to sort out the requirements.
However, since SWAT mainly is used on Windows machines and based on .NET, setting it up on Linux demands a Mono installation (Mono provides cross-platform .NET framework, more information on the mono-project website).
The people from the HYDR department got the models up and running but the next pitfall were the number of CPU cores. Our primary choice for scripting was R, due to our initial idea to head towards the Mosix direction (an idea we ditched later on). By default any R script will use only one core, you will have to tell R to effectively use multiple cores.
Going Parallel in R
If you want R to be using multiple cores, you must load additional libraries. We ended up using the “parallel” library in R, but you do require a more recent version. We will add details on how to effectively make the script use multiple cores. It all depends on your code however.
In the end, we were able to run one single Notebook using R as programming language on 8 cores. Silly as it is, that big block of CPU usage actually made us and the HYDR team very happy!
This all sounds very academical (and it is) but this post will be continued in a second part. We’ll dive more into our reason for extending the (very basic) Jupyter Notebook installation with JupyterHub that allows us to move from a simple, single instance to effectively what we are aiming for: Jupyter hosting with centralized deployment and excellent data integration. In other words, every data scientists dream and that within reach.
Watch this space!
… curious about our Jupyter experiments? Contact us, we’d be happy to brainstorm!