From Science to Production in Minimum Effort

Elior Cohen
October 11, 2018

Hotlinks:
-
Denzel Deployment Framework

In the past year working through numerous data science projects in the organization I work for, I’ve spent hundreds of hours working with fellow data scientists.
Since the data science field is quite young, it is not surprising that my co-workers, as well as data scientist in general, come from various fields.
Mathematics, statistics, physics and even psychology, this diversity is quite positive as each one brings his strengths and views from his respective field.
Myself, I come from software engineering, and this naturally makes me inspect processes from an engineer’s perspective.

The margin between science and engineering

Working through a data science project many skills are required.
One should have mathematical understanding and intuition about the nature of the algorithms, statistical knowledge to create and test hypothesis about the problem, familiarity with algorithms and state of the art solutions and much more.
Engineering and programming skills don’t have to be at expert level in order to be a good data scientist.
It is absolutely possible to be a top field researcher, without ever knowing what HTTP verbs are, how to manage task queues with workers and brokers or how to build an in browser monitoring dashboard.
Production systems, on the other hand, require quite advanced engineering skills.

From my engineering point of view, I’ve noticed that this margin between science and engineering manifests itself once a data science project has to go from the research stage into production.
I’ve seen data scientists struggle with frameworks, and spending unnecessary time learning tools which are out of their necessary skill scope.
On the other hand I’ve seen production teams being delivered (from the data scientists team) projects that they don’t understand nor have idea how to incorporate within their organization’s production cycle.
Since this problem was real and reoccurring, I’ve decided to take a step and close this margin with an open source Python package.

Meet Denzel

Before I’ve opened my dracula-themed editor to start coding a framework, I’ve sat down to define the requirements:

  1. Data scientist first. The framework should have as minimum as necessary learning overhead for the data scientist and abstract the heavy lifting.
  2. Lean and fast. The development within the framework should be quick and the result should be as lean as possible.
  3. Production ready. Use production grade tools and practices, and its output should be something that could easily be delivered to a production engineer to operate with.

Three requirements, dozens hours of code, months of tests and real production use later the denzel package was born and now open sourced for public use.

Denzel is a minimal framework for deploying machine learning trained models.
Once you have a trained model persisted to disk, all you need is to implement four simple Python functions and denzel will give you:

  1. A docker-containerized project
  2. Expose API for end users to interact with your model
  3. Task management system to queue up and execute predictions
  4. In browser UI dashboard for monitoring your deployment

One of the important aspects here is the result in docker containerized. This means that anyone that knows how to manage docker containers in production could apply any production tool (like Kubernetes, Rancher etc.) and have limitless possibilities with regarding to scaling and deploying the project wherever. Also all the major cloud providers support docker deployment. All this magic without requiring the data scientist himself to learn docker, API building, task management or UI designing.

This project is fully supported by my employer Data Science Group Ltd. and is currently in use in a number of different deployments.
It saves our data scientists a lot of unnecessary time, and gives our clients a convenient way to integrate our solutions in their production systems.

Denzel is currently in alpha stage, this means that features will still be added to it and open sourcing it is the best way to see what the data science community really needs from it.

In order to use it, check the documentation and especially the tutorial, and an hour or so plus 60ish lines of codes later, you’d have your model deployed.

Hope you enjoyed this reading and that denzel will serve you well.

Happy deploying :)