You are a data scientist or engaged in a data science project in your organization. Congratulations! You have one of the most interesting, influential, and intellectually stimulating jobs on the market. You've mastered stats, machine learning, become a programming wizard, an expert in visualization, a big data evangelist, and a math god.
So, this is enough? Right?
These last three years, our group has lead numerous data science projects across diverse verticals, including ad tech, fin tech, health tech, cloud computing, security, and the telecom industry. Surprisingly, many of our projects share similar attributes despite originating from different domains. Trivial commonalities are evident in the employed algorithms, platforms, and tools. But more important similarities lie in the life cycles of data science projects – from inception to production. I would claim that a successful data science project likely stems not from the technical skills mentioned above, but from something far more fundamental to classic scientific methods and research design.
It's all about defining the theoretical research problem and its operationalization. The world is not Kaggle. If this process is not done correctly, you will invariably find yourself answering the wrong question, or not answering any question at all.
The theoretical research problem, or business problem, need not be defined in vague terms, but rather using concise and coherent definitions easily comprehensible to non-technical people. This seems trivial but is a point often neglected. Never agree to do a project in which you are asked to "tell us something about the data". This would most certainly lead to project failure, dissatisfaction of your clients/bosses, and to overall frustration. Try to lead your team to identify the exact definition of the theoretical research/business question. For example, in a classification task of bad vs. good clients, try to define which theoretical aspects are of interest to this definition – their payment history, engagement level, location in a social network, etc.
All of your theoretical variables need to be measurable. The process of converting conceptual theoretical problems into measurable variables is called operationalization. For example, if the theoretical research problem is defined by classifying good vs. bad users in terms of their usage patterns, the operationalization process needs to address the exact quantifiable definition of the "goodness" of users. One way of establishing this would be to count number of clicks a user made in a session relative to other users in the same geographical location, and during the same day and hour. This definition must be known and acceptable to all people involved in the project.
Whatever approach you choose is important, whether you implement XGBoost, random forests, HMMs, CNNs or RNNs. Moreover, you can engineer clever features, produce amazing visualization, and optimize scalable code. Nevertheless, if you are not addressing the concise business problem and its operationalization, you are essentially pushing water uphill with a rake.
The process of defining the theoretical research problem and its operationalization is a continuous dynamic process which tends to change over time. This is fine as long as you and the other people involved in the project are doing these changes consciously, in a rigorous fashion, and with everyone in sync.
A vague definition of the research question and its operationalization leads, at best, to loss of time - and at worst deteriorates to diffusing the project entirely. Therefore, ask yourself and your team these two questions every day:
· What is the theoretical research goal of the project?
· How I can operationalize this research goal?
The core of a data science project is the research question and its operationalization. Consistently addressing, debating, and thinking about these questions will keep your project on course, and increase your chances of success.