The Different Roles Around Data

Juan Pablo Penagos
7 min readSep 28, 2022

If you have been working in the tech world you may have heard different names of roles in data like Data Architect, Data Engineer, Data Analyst, Data Scientist, and Data Ops for example. So, the question here is: why do different roles involved in data exist? and, What is the difference between them? In this post, I will try to explain this in words that are easy to understand.

I will be a little more technical in the text written between “[…]” and I will use “(…)” to express some examples of the real world. You can omit the text into “[…]” and “(…)” if you believe that the message is a little confusing or difficult to read.

If you are an expert in some roles named before you surely say that this article lacks more things or the article lack more accuracy, but this article is for people that are not expert in these types of roles. Additional in this article to make explanations easier we will not talk about other new concepts like “Data LakeHouse” or “Inverse ETL”

I like to explain or teach some topics with analogies because it’s easier to understand the different ideas that can sound a little abstract. So, before talking about the questions of data roles, let’s start by imagining that we are in an Italian restaurant with a friend that only wants to eat a cake, where we are the customers who wish to eat a delicious chocolate dessert (Dashboard) and a Rigatoni Pasta (Machine Learning Model)

While the employees prepare our food dishes let’s think about the different employees in the restaurant that makes our food dishes possible. First, we may be thinking of the human that gathers and collect the different foods (foods = data) from the market, we call this person “Data Architect" and this employee is employee number 1(E1) because we just follow an order in the steps that make our dishes possible. Employee #1 (E1) brings the food into a restaurant's cellar, and let’s put a tag for this space with “Stage 1” (S1), or in tech words, this place can be called “Data Lake”. All the food in this place S1 stays unordered, and the food is not shelved following a specific order, thus here finding a specific food like a potato it’s not easy.
[One example in the real world can be: On E-commerce, a company like Amazon.com the person E1 catches the data of customers experience in the webpage, then it’s saved in some cloud resource like S3 of AWS (S3 is the Data Lake “Stage 1”). The core skills of E1 are: strong software engineering background, defining the data architecture framework, gathering and processing raw data]

Next comes Step 2, where employee number 2 appears, this employee is the “Data Engineer” (E2). This person has the skills to understand the job of Data Architect (E1), has the knowledge of the different foods types (or different data types) and this person has the big power to organize all the food shelving but in another cellar, this place it’s “Stage 2” (S2) or in tech words: “Data Warehouse”. This employee put all the food in a specific order, following an established model so that the next employees can easily find some food such as the potato.
[Following our example of Amazon.com, this employee E2 takes the data in S1 (Data Lake) and puts this data in S2 (Data Warehouse) with techniques of ETL (Extraction + Transformation + Load). Think that “ETL” is the process to move data from a messy space (Data Lake) to a tidy space (Data Warehouse). The resource for this place S2 can be Amazon Redshift (AWS resource) or Big Query (GCP resource). those are some resources among many others. The core skills of E2 are: creating the roadmap for data management systems, using ETL tools, conceptualizing and visualizing the data framework an enterprise level, data modeling, and data administration].
It’s easy to confuse the roles among E1 and E2, so I will share with you this link to complement these roles concepts: https://www.executivelevels.com/how-to-tell-the-difference-between-a-data-architect-data-engineer/

Following the journey of our dishes, here comes two more employees, the pastry chef (Data Analyst) and the chef (Data scientist), let’s call these employees “employees E3A and E3B” (E3A and E3B are employees that prepare food but each is a specialist in a type food). The employee E3A has the skills to prepare dishes more easily to digest like dessert, cake, or frozen. Let’s think that this type of prepared food is called “Insights”. [The core skills of E3A are building Dashboards and making queries (SQL) without forgetting the capacity to understand the business. The stack of tools for this person is visualization tools like Tableau, Power Bi, Looker …etc, and query tools like MySQL, PostgreSQL, Snowflake … etc).
Employee E3B has the skills to prepare strong dishes like the Rigatoni Pasta. [The core skills of E3B are building Machine Learning (ML) Models, forecasts and queries. The stack of tools of this person is programming in Python and R, and knowledge in the service cloud like GCP, AWS, or AZURE. And for this person, it’s vital to have strong knowledge of statistics and mathematics, because making ML models are not only about writing some lines in Python and running the model. Here is where the true Data Scientist makes the difference with the fake “Data Scientist”)

These persons (E3A, E3B) understand the data in the cellar that the employee E2 put in order (Data Warehouse) and they have the capacity to find foods here, they make this search with a tool called SQL (Structural Query Language). So these persons, E3A and E3B should find the foods to prepare the dishes.

And last but not least comes the employee E4 “DataOps”. This person is like a boss of food quality (or data quality). So this role is a transversal role for the rest roles named before. This person gives us the guarantee of having speed and quality in all of the processes involved in the data, from the time of data collection until the final insights, ML models, or any other request. this role is relatively new and comes from the core idea of the Software Engineer DevOps role. I share with you this link to complement the DataOps role idea: https://en.wikipedia.org/wiki/DataOps

After eating our delicious dishes our friend who ate only a chocolate dessert asked me an interesting question: What type of food is more sold in the restaurant? the desserts or the main dishes? or, in other words: What type of data requirements are most asked? dashboards or ML Models? I let you a moment to think about the answers, but I wish that you think before the answer to this other easier question: What type of cars are best sellers? sports cars like Ferraris? or, cars of common use like a Toyota? (this comparison is exaggerated, but is only to explain easier the idea).
So, let’s go back to the first question and you will note that likely the Dashboards are more asked than ML models because exist more customers which understand dashboards than ML models, however, the ML Models are more expensive and more difficult to develop (same as Ferrari VS Toyota). Here I refer to the ML Models only for the internal consumption in companies because there are many types of ML Models or Deep Learning Models in many companies like Amazon.com to recommend products based on customers' experience, and for these models is not required that customers understand ML Models, in fact, many customers do not know that this happens.

And lastly, I wish to tell you that the desserts are very delicious, but … take care with the sugar. Many people ask for many Dashboards, and this is not a good idea because with many dashboards it’s more difficult to make decisions. If you have many metrics, insights, plots … etc, it’s more difficult for your brain to make a decision. So I recommend you to be selective with dashboards, you should have a max of 4 or 5 dashboards at the most, and if you have doubts about this only think if you had to select one car between 50 top cars in the world, or think about selecting the most beautiful woman or man between a list of top 50 candidates. If instead of having a such wide list you had to choose from a list of 3 options, it would be much easier (think that a wide list is like having many metrics and plots).

If you have any comments or different opinions don't forget to share your ideas, thanks.

--

--

Juan Pablo Penagos

Professional Statistician passionate about Data Science, and Technology.