This post is by Instacart VP Data Science Jeremy Stanley, and technical advisor and former LinkedIn data leader Daniel Tunkelang. Previously, Jeremy wrote the most comprehensive manual we’ve ever seen for hiring data scientists.
It's hard to believe that "data scientist" only became a bona fide job title in 2008. Jeff Hammerbacher at Facebook and DJ Patil at LinkedIn coined the term to capture the emerging need for interdisciplinary skills across analytics, engineering, and product. Today, the demand for data scientists has blossomed, and with it the need to better understand how to grow these teams for success.
The two of us have seen our share of the good, the bad, and the ugly, leading and advising teams at a variety of companies in different industries and at different stages of maturity. We've seen the challenges of not only hiring top data scientists, but making effective use of them and retaining them in a hyper competitive market for talent.
In this article, we've summarized the advice we give to founders who are interested in building data science teams. We explain why data science is so important for many startups, when companies should begin investing in it, where to put data science in their organization and how to build a culture where data science thrives.
Data science serves two important but distinct sets of goals: improving the products your customers use, and improving the decisions your business makes.
Data products use data science and engineering to improve product performance, typically in the form of better search results, recommendations and automated decisions.
Decision science uses data to analyze business metrics — such as growth, engagement, profitability drivers, and user feedback — to inform strategy and key business decisions.
This distinction may sound straightforward, but it’s an important one to keep in mind as you establish and grow your data science team. Let’s take a closer look at these two areas.
Using Data Science to Build Better Products
Data products leverage data science to improve product performance. They rely on a virtuous cycle where products collect usage data that becomes the fodder for algorithms which in turn offer users a better experience.
What happens before you’ve collected that data? The first version of your product has to address what data science calls the “cold start” problem — it has to provide a "good enough" experience to initiate the virtuous cycle of data collection and data-driven improvement. It’s up to product managers and engineers to implement that good enough solution.
For example, when an Instacart user visits the site, the application shows recently purchased groceries under a “buy it again” header. It’s a feature that delights users, but it hardly requires data science — or much data. The data science kicks in when we want to show recommendations for products they haven’t purchased before. Doing so requires analyzing all users’ purchasing behavior, determining which users are similar to each other, and ultimately recommending items based on what similar users have purchased in the past. That's where data science uses data to create value, enabling customers to easily discover new products they might not have found on their own.
In order to improve products, data scientists must collaborate closely and constantly with engineers. You also need to decide concretely whether data scientists implement product enhancements themselves or partner with engineers who implement them. Either approach can work, but it’s important to formalize it and establish shared expectations across the organization. Otherwise, you’ll struggle to get improvements into production, and you’ll lose talented data scientists who feel unproductive and undervalued.
Using Data Science to Make Better Decisions
Decision science uses data analysis and visualization to inform business and product decisions. The decision-maker may be anywhere in the organization — from a product manager determining how to set priorities on a road map to the executive team making bet-the-company strategic decisions.
Decision science problems span a wide range, but they tend to have several characteristics. They're novel problems that the organization has not needed to solve before. They’re often subjective, requiring data scientists to deal with unknown variables and missing context. They’re complex, with many moving parts that lack clear causal relationships. At the same time, decision science problems are measurable and impactful — the result of making the decision is concrete and significant for the business.
The above may sound a lot like data analytics, and indeed the difference between analytics and decision science isn’t always clear. Still, decision science should do more than produce reports and dashboards. Data scientists shouldn't be doing work that can be delivered using off-the-shelf business intelligence tools.
At LinkedIn, the executive team used decision science to make a critical business decision about the visibility of member profiles in search results. Historically, only paid users could see full profiles for everyone in their extended (third-degree) network. The visibility rules were complex, and LinkedIn wanted to simplify them — but not in a way that would undermine its revenue. The stakes were enormous.
The proposed visibility model was a monthly use limit for unpaid users, with a cut-off based on usage. LinkedIn’s decision scientists simulated the effects of this change, using historical behavior to predict the impact on revenue and engagement. The analysis had to extrapolate past behavior on one model to forecast behavior on a radically different one. Nonetheless, the analysis was sufficient to move forward.
The result was not only positive for the business, but also delighted millions of users and eliminated a source of complexity that had been an albatross for product development. Some people complained about the limits — but those were precisely the people that LinkedIn felt should be paying to use the platform. The project was a success, thanks to the decision science that informed it.
Not all decisions require the big guns of decision science. Some decisions are too small to justify the investment. Other decisions may be important, but the business could lack the data to meaningfully analyze them. In those cases, businesses need to rely on intuition and experimentation. Good decision scientists know their own limitations and recognize when their efforts would be wasteful or counterproductive.
While decision science and data products call for some of the same skills, it’s rare for data scientists to excel at both. Decision science depends on business and product sense, systems thinking, and strong communication skills. Data products require machine learning knowledge and production-level engineering skills. If you have a small data science team, you may need to find the rare superstars who can do both. But you’ll benefit from specialization as you scale your team.
Data science isn’t right for everyone. You only want to invest in data science if it'll be critical to your success, but not if it'll just be an expensive distraction.
Before you invest in building a data science team, you should ask yourself these four questions:
1. Are you committed to using data science to either inform strategic decisions or build data products?
If you’re not committed to using data science toward one of these goals, then don’t hire data scientists.
They can help you make strategic decisions, but only if you’re committed to a culture of data-driven decision making. You may not need them on day one, but it takes time for you to hire the right people — and time for them to get to know your data and your business. You’ll need all that to happen before they can apply data science to drive decision making.
Data products can create value and delight users through improved optimization, relevance, etc. If these are on your product roadmap, you should bring data scientists in early to make the design decisions that will set you up for long-term success. Data scientists can make key decisions about product design, data collection, and systems architecture that are critical foundations for building magical-seeming products.
2. Will you be able to collect the data you need and and act on it?
A founding engineer can create an MVP product with a small amount of product and design guidance. Data science requires data, which comes only with measurement and scale. Recommender systems rely on instrumenting your product to track user behavior. Optimizing business decisions depends on fine-grained measures of key activities and outputs.
But collecting data isn’t enough. Data science only matters if data drives action.
Data should inform product changes and drive the organization’s key performance indicators (KPIs).
Instrumentation requires a commitment across the organization to identify what data each product needs to collect and establish the infrastructure and processes for collecting and maintaining that data. To be successful, instrumentation requires collaboration among data scientists, engineers, and product product managers — which in turn requires executive commitment.
Similarly, data-driven decision making requires a top-down commitment. From the CEO down, the organization has to commit to making decisions using data, rather than based on the highest paid person's opinion ( or HiPPO).
3. Will you have enough signal in your data to derive meaningful insights?
Many people equate big data to data science, but size isn’t everything. Data science is about separating the signal in data from the noise.
The available signal depends not only on data volume, but also on the signal-to-noise ratio.
For example, an ad product may collect data from billions of impression events, but the data only carries signal in the rare cases where users interact with the ads. Hence, the large volume of data only yields a small amount of signal. No amount of data science will tease deep insights out of a big data set unless there's a critical mass of signal.
4. Do you need data science to be a core competency, or can you outsource it?
Building a data science team is hard and expensive. If you can get away with outsourcing your data science needs, then you probably should. One option is to make judicious use of consultants. A better one is to use an off-the-shelf solution for your domain that uses APIs to ingest data, build models, automate actions, and report on key analytics. There may not be a solution perfectly tailored for your needs, but it’s often worth compromising to accelerate your business and keep your core team focused on the areas where it can add the most value.
When do you need data science to be a core competency? If data science is solving problems that are critical to your success, then you can’t afford to outsource it. Also, off-the-shelf solutions tend to be rigid. If your business is taking a unique approach to a problem (e.g. collecting new kinds of data or using the results in novel ways), it’s unlikely that an off-the-shelf solution will be flexible enough to adapt to it.
Data science requires data to science, and most companies don’t have much data on day one.
Don’t hire a head of data or build a team until you have work for them to do. At the same time, ensure you’re collecting key data early on so that team can have an impact once you’re ready.
If you don't have data yet, then who will answer the questions of what data to acquire and when to acquire it? That person doesn't necessarily have to be a data scientist. But it had better be someone who understands the potential of different data sets and can make tough decisions about your data investment strategy. If you already know that you're going to spend a lot of money and time on data acquisition, then it's probably time for you to make at least a minimal investment in hiring a first data scientist.
It's possible that you need data right away because your business is all about delivering data products. But it's more likely that your minimal viable product (MVP) won't be data-driven. Rather, you'll be betting on an instinct and seeing if the market validates that instinct. In that case, prematurely investing in data acquisition and data science will cost you precious money and time that should go toward bringing your MVP to market.
Once you have (or quickly plan to have) data for data scientists to work with, and are ready to commit significant product, engineering and business resources to support your data science efforts, you should quickly begin building a team.
It’s never too early to instill a culture that values data. Business decisions, from acquisitions to product launches, should be based on data rather than opinion. One of the advantages of introducing data science into an organization sooner rather than later is that doing so helps instill data as a first-class asset.
But don’t rush into hiring just because data science is sexy. Given the buzz around this functional area, many people feel a sense of urgency around building a data science team. Companies with petascale ambition are eager to hire the folks who will derive insight from all that data. But building a team too early is an expensive distraction, will demotivate your talent and possibly have lasting negative cultural implications.
If we were going to throw out one, overarching recommendation, it’s this: After you've validated your MVP, it's time to think about investing in data science.
A successful product launch should generate enough data to learn from, and you'll need to keep up with that data stream by having people on board who can extract value and insight from it.
Where you introduce data science into your org structure matters a lot — for the team, for your other functions, and for the overall success of your business. There are three common approaches: a standalone team, an embedded model, and integrated teams. Each has trade-offs, so let's walk through a few possibilities.
Going It Alone
In the standalone model, your data science team acts as an autonomous unit parallel to engineering. The head of data science is a key leader and typically reports to the head of product or engineering — or even directly to the CEO.
The advantage of the standalone model is autonomy. This type of data science team is well positioned to tackle whatever problems it deems most valuable. There's also a symbolic advantage to a standalone data science team: It demonstrates that the company sees data as a first-class asset, which will help them attract world-class talent.
The standalone model works particularly well for decision science teams. Even though decision scientists collaborate closely with product teams, their independence helps them to make hard calls, like telling PMs that their product’s metrics aren’t good enough to justify a launch. Decision scientists also benefit a lot from cross-pollination, both to understand how different product metrics depend on one another and to share more general learnings about experimentation and data analysis.
The flip-side of autonomy is the risk of marginalization. As companies grow and organize into product teams, they often prefer to be self-sufficient. Even when they could benefit from collaboration with data scientists, product teams simply don’t want to depend on resources they don't control. Instead, they rely on themselves — even hiring their own data scientists under other names like "research engineers" — to get things done. If product teams refuse to work with the standalone data science team, then that team becomes marginalized and ineffective. Again, that's when you start losing good talent.
The original data science team at LinkedIn was a standalone team, and the team’s autonomy allowed it to make key contributions across LinkedIn’s products, in areas ranging from improving the quality “people you may know” to detecting fraudulent accounts. But as LinkedIn grew, it became increasingly difficult for a standalone team to collaborate effectively with product teams, especially as those teams hired their own engineers with similar skill sets. Eventually LinkedIn decided there was no longer a need for its standalone team. This is a very likely outcome.
The Virtues of Embedding
In an embedded model, the data science team brings in talented people and farms them out to the rest of the company. There’s still a head of data science, but he or she is mostly a hiring manager and coach.
The embedded model is the polar opposite of the standalone model: It gives up autonomy to ensure utility. In the best case, data scientists join the product teams that most need their services, and get to work on a wide variety of problems throughout the organization.
The downside of the embedded model is that not all data scientists are happy giving up autonomy (in fact, many are not good at it at all). Data scientist job descriptions emphasize creativity and initiative, and embedded roles often require them to defer to the leadership of the teams in which they are embedded.
There’s a risk that your data scientists will feel like second-class citizens as embedded team members — their product leads don't feel responsible for their growth and happiness, while their managers won't feel directly vested in their work.
We’ve seen some companies embed data science managers, but this approach only works once you have a fairly large data science team.
At LinkedIn, Daniel experienced the pros and cons of the embedded model. Actually the decision science team there has long thrived with its embedded model. Decision scientists ensure that product teams make decisions — particularly launch decisions — informed by data. At the same time, having a centralized organization facilitates knowledge sharing and career development. But, as mentioned earlier, the standalone data products team was not as successful as the organization scaled. Ultimately, LinkedIn decided to fold product data science into engineering, and Daniel himself moved into an engineering role to lead an integrated team responsible for search quality, an area that requires super tight collaboration between engineers and data scientists.
In an integrated model, there's no separate data science team at all. Instead, product teams hire and manage their own data scientists.
This optimizes for organizational alignment. By making data scientists first-class members of their product teams, it addresses the downsides of the standalone and embedded models. To the extent that data scientists, software engineers, designers, and product managers work on shared product goals, the integrated model instills collective team ownership of those goals. This is how you avoid the breakdowns that can occur when narrowly focused functional teams diverge in their goals and end up mired in dependencies that are too often ignored or delayed.
The downside of the integrated model is that it dilutes the identity of data science. Individual data scientists identify with their associated product teams, rather than a centralized data science team. You also sacrifice the flexibility of the embedded model, since it's harder to move people around based on their skills and interests. Finally, the integrated model can create challenges for scientists’ career growth, since the manager of an integrated team may not be in the best position to value or reward their accomplishments.
At Instacart, data science is fully integrated into product teams. Those teams own their product domain — which could be the real-time order fulfillment engine, the application used by shoppers when picking groceries or the search and recommendation services (there are 15 of these teams).
Each is a mixture of engineers, data scientists, designers and product managers, and the engineers and data scientists both report into a technical lead — who may themselves be an engineer or a data scientist. This structure ensures that the engineers and data scientists collaborate closely — and they are empowered to do whatever's needed to achieve their team’s objectives. As the VP of Data Science, Jeremy is a mentor and coach to the data scientists and their team leads. He brings the team together into a community that spans product teams. And he leads organization wide data science initiatives.
Each of the three models has its pros and cons, and you have to figure out which one is best for your organization — plus how you want it to evolve. Be ready to adapt as your needs change. Sometimes the best approach isn’t a single model but a hybrid. As Andy Grove wrote in High Output Management:
Good management is a reconciliation of centralization and decentralization — a balancing act to get the best combination of responsiveness and leverage.
As your organization and ambitions continue to grow, you'll inevitably want to hire even more data scientists (Jeremy wrote another popular article focused purely on this). Build a company culture early that makes it a great place to practice data science, and you’ll reap dividends when they matter most.
Many organizations claim to be data-driven. They collect a lot of data, invest money in data engineering, and frequently reference data rich dashboards. But they fall short.
Actions speak louder than words, and data science will only feel valued in an organization that makes decisions based on data.
Companies must build the backbone and the credibility that they will make decisions based on data even when they run counter to popular wisdom or lead to significant shifts in power in the organization. These are the opportunities where data science can have the greatest impact.
Data scientists, like everyone else, want their work to have recognizable and celebrated impact. Making this happen creates a positive feedback loop where data scientists remain motivated to tackle big problems and ensure that their solutions are measurable.
Recognizing the contributions of data scientists can be difficult — especially when they’re in integrated teams. Your data science leader needs to remain a champion of excellence and impact, and the senior executives at the company should seek to understand and appreciate the impact that data scientists are having on a regular basis. Not just every now and again.
In many ways, data science takes a village — a data scientist in a vacuum can achieve nothing.
Unless they collaborate closely with product managers, engineers and designers, they will not create amazing products, and unless leaders and operators value their insights, their recommendations may never affect change.
When Jeremy first joined as a data leader at Sailthru, its engineering organization had a neutral perception of data science. In order to increase everyone’s buy-in, he spent 30% of his time in his first 2 months creating and teaching a class to the engineering organization on statistical learning.
By making all of his examples use Sailthru data, and engaging the engineers in the process of building data driven products, the class rapidly accelerated the process of turning the organization’s perception of data science around. That investment of time was costly — especially in those formative months. But having engineers who were excited about the potential of data science as collaborators was well worth the investment.
Despite its name, this discipline can be as much an art as a science. Not everything can be measured, and we're limited by our algorithms, our computational resources, and our ingenuity.
Over time, the impact that a data science team has will be far higher if you build a diverse team with extremely different backgrounds, skill-sets, and world views.
This will ensure they think as holistically as possible about their domain, and will encourage creativity and innovation over time.
Finally, focus early on hiring data scientists who reflect your company ideals. To be effective, data scientists must be trusted by their teams, the users of their products, and the decision makers they influence. As you build your team, hire for and reward individuals with integrity — who share the values of your organization. Their impact is tremendous, and, for better or worse, they'll make many decisions that will shape your company’s future.