“Full-Stack Data Scientist”: The Four False Pillars
We’ll explain what many claim to be a “Full-Stack Data Scientist”. Then we’ll discuss the four main arguments of those that defend the notion that a Data Scientist can only be successful in the industry if he/she follows “the generalist” ideology.
In Hollywood, it’s known that sequels are rarely better than the original movie. Batman: The Dark Knight Trilogy is a notorious exception. We believe that this post is yet another.
(I) Root Cause Analysis is end-to-end — so, a Data Scientist should be too!
Root Cause Analysis is something strategy consultants have been doing for decades. One of the main pro-generalist arguments is that the causes for the low performance of Machine Learning models lie outside the modeling stage.
Now, two questions arise:
- Should RCA be done by data scientists? And, if yes,
- Would it help them to be more effective in finding the root causes if they own the entire process?
So far, our data scientists would answer “yes” and “yes!”.
But quickly, another question arises:
What would it be like if data scientists owned this project end-to-end?
Well, the most likely answer would involve delays in project completion, lower user acceptance, troubles on scalability (i.e., more root causes to find and a bigger problem.) Why? Because they are not experts in UX design, DevOps, or data engineering!
When you try to do something that is difficult and you aren’t an expert in it, it’ll take you longer to do and you’re more likely to make mistakes.
What’s the problem that usually takes place when a specialized DS tries to run RCA? They stumble onto other teams’ walls. Why? Three classic reasons are,
- lack of ownership, aka Social Loafing,
- lack of communication, and
- lack of data-driven awareness.
So…what happens if other teams and/or team members are:
- unavailable to step up or their technical/communication mistakes
- not aware of what the data science modules do (input/output);
- ignoring the possible consequences that their (bad) work may have
Well, then our answer is: your organization has a problem of Culture, Values, and, ultimately, Leadership. If that happens, you can have whichever kind of Data Scientists you want, as you’ll always be closer to failure than to… well, pretty much anything else.
(II) Multiple Roles bring Communication Overhead
If you work to develop a project in a multidisciplinary team, you need to communicate. If you work alone, you do not need to communicate….but, to stamp communication as overhead is something that goes beyond comprehension.
Some — well-intentioned individuals — mention that Data Science is an immature field and, consequently, it’s difficult to create the blueprint of the final product beforehand — as we typically have in Software Engineering.
Consequently, rapid (re)build-try-fail iteration is key for success. Naturally, that speed boost would come by full project ownership — aka taking the generalist way. And, oops…there is a fallacy here.
In software engineering, you usually have multiple teams in place organized either in tribes, technologies, or functions. They work together to accomplish a singular goal…but, in that process, they have competing agendas themselves…consequently, alignment and communication are key for their success.
Naturally, imposing a large number of meetings, stand-ups and other rituals to all levels of technical team members may be tough…and thus, also naturally, the role of the Agile Product Owner was birthed.
Data Science is no different. And, although this is a recent trend, it’s a popular one. To have a real impact in an organization, every Data Science team needs to have an (Agile) Product Owner in place. If your team doesn’t have one, or you believe your team is too small, this (the lack of a Product Owner for Data Science and/or a Team Leader/Head) might be where your real problem started in the first place.
(III) Learning by doing is the key!
Nowadays, there are a plethora of online courses for Machine Learning and Data Science following the early success of Coursera. Usually, these courses come with a promise like “Zero-to-hero: How to become a data scientist in 6 months.”
In these courses, it’s often said that Data Science is an empirical discipline where learning by doing is key. Like in most other jobs, we’d agree that experience in actually doing stuff (vs. only theorizing about it) helps. A lot.
The problem is that expression implies that everybody can become a Data Scientist if they try hard enough. Without reducing the merit of people who studied other STEM disciplines (like Physics or Production Engineering) and then, with a lot of effort and persistence, made a difficult transition to Data Science careers…it’s a fact that they will struggle to become Professional Data Scientists as they lack foundational knowledge.
In other words, they know when and how to use the available off-the-shelf tools/libraries/methods until a certain point, but they don’t know exactly how it works. Please read more about Luis Moreira-Matias’ thoughts on Professional vs. Citizen Data Scientists from a keynote that he gave last year at the IT Arena 2019 (Lviv, Ukraine), where he shared the stage with the likes of Microsoft, Spotify, and Uber.
What are the consequences of such ignorance? Firstly, it prevents really tailoring their methodology to their business application (a classical step in CRISP-DM methodology). This entails that the result will not have so much impact on your business. This makes leaders wonder how they can extract more value from these Data Scientists who are delivering below expectations…and the answer invariably arrives about “getting them more stuff to do”.
And that stuff would be DevOps, data engineering, or analytics dashboards…does this ring a bell? Of course…it is a Full-Stack Data Scientist! 🙂
“Any fool can know. The point is to understand.”
Albert Einstein
When hiring for positions in the all-data-things space (data engineers, data scientist — different flavors and machine learning researchers) for the last 4 years. For the latter two roles, our interviews always contain a technical Q&A about the candidate’s past data science projects which covers, among other things, the foundations of the used methods. Here are some of the pearls that we’ve heard during these interviews in the last years:
How does the performance of models trained with Random Forests vary with the number of used trees?
Models trained with RandomForests get better as the number of trees is larger;
Which is the difference between Logistic and Linear Regression?
One provides linear models…and the other doesn’t;
Why you are using Huber Loss when your evaluation metric is RMSE?
In Data Science, we need to try everything first to see what works the best. (He literally tried all the linear regression methods available on sci-kit learn).
What AUROC metric stands for?
AUROC stands for…Accuracy.
Which are the hyperparameter values that you use to train your SVM model?
The used model hyperparameters are the default ones. This is a fair comparison as the performance uplift after tuning those would always be small.
How can you create a model from a training set with 500 examples?
Use a CNN in Keras. Always. Preferentially, a pre-trained one.
If you didn’t find anything wrong with any of the above responses, data science may not be your best career path.
Working hard is mandatory…but to work hard & smart is even better!
Following this line of thought, another argument is that owning the whole development process brings a higher sense of satisfaction. There is nothing wrong with working hard – it’s crucial to any career, especially Data Science! But to work smart must be equally important. Investing time doing things which you are strong at just makes you…stronger! If you doubt it, check out how seriously Cristiano Ronaldo trains scoring goals…but he certainly does not practice much goalkeeping.
(IV) +-1% of accuracy does not impact my business
It’s often said that +1% is not a performance uplift that is worthy to invest time into. It does not move the needle. Consequently, what you actually need are some guys (generalists) to use some data to build any model that works.
That can be achieved quick-and-dirty with some build-try-fail iterations and a lot of copy-paste from online knowledge exchange boards such as StackOverflow. Make an MVP to run a POC. Have a spike…a proof point to show our investors that “yes, we can!”. All good fellas, aren’t they?
Do you have a data strategy & governance roadmap? No? Well, that’s bad…
Today, one of the problems organizations face is the lack of a data strategy/ roadmap and/or data governance policies. Often, senior leadership (both in corps and/or start-ups) are not aware of what creating and scaling a data-driven business can actually mean in practice.
They have a business plan, sure, and naturally, they like the scalability vs. reduced OPEX that automation brings into that. And of course, the competitive advantage that doing that fast can bring to their business. But they often forget to see the darker side of it. We like to call it “becoming a data hostage.”
Copy-paste code: 1$. To know which code to copy-paste: 10000$.
There are a series of implications on automating your business in a data-driven way: you become dependent on the used data — which means that, if there is a bankruptcy of a data provider and/or a regulatory change that forbids of using that data in the future…well, you’re in trouble.
You also become dependent on the pipeline dependencies — packages, development language, versions — even if they run in any type of virtual container. Finally, you also become dependent on your model: usually, your model performance changes with the scale (accept 20% of customers instead of 5%, recommend 10 products instead of 3, and so on). Here, you expect to maintain your business performance (driven by your model’s performance) after scaling up or even improving it.
But there isn’t any model capable of that. However, by then it’s too late: you already have a team of citizen data scientists, you already promoted the guys for their great work (after all, you did the POC, you got the new funding round, you convinced the investors) and you already have a series of processes (dependencies) of a group of people that is far from being specialized in the tasks they need to do….now, at scale.
Do you want to scale your business one day? If yes, you need professional DS now.
If your business wants to scale (or may one day want to) and to have a predictive or prescriptive analytics engine in its core, you need professional DSs now. If you have an automated data-driven business in place, the impact of +1% in your model’s performance is tremendous…regardless of the industry.
Two examples of this are from Start-up/Credit Risk (where +0.01 in AUROC translates into +1M USD) or Corp/E-Commerce (where -1% on MSE can easily translate into +3M USD only in a quarter) which illustrate the impact well.
A Data Scientists true obstacles
“Generalists” sustain their arguments about Data Scientists based on the 4 false pillars discussed in this article. These pillars become very problematic for any Data Scientist for a variety of reasons, but to understand why this discussion even exists, and what the true obstacles are for large-scale adoption of data science in the industry, we’d need to dig deeper. And we will.