Let us learn how to build DBT Models using Apache Spark on AWS EMR Cluster using denormalized JSON Dataset.

Here is the high-level agenda for this session.

  • DBT for ELT (Extract, Load and Transformation)
  • Overview of DBT CLI and DBT Cloud
  • Setting up EMR Cluster with Thrift Server using Bootstrapping
  • Overview of Semi Structured Data Set used for the Demo
  • Develop required queries using Spark SQL on AWS EMR
  • Develop the Spark Application on AWS EMR using DBT Cloud
  • Run the Spark Application on AWS EMR using DBT Cloud
  • Overview of Orchestration using Tools like Airflow

DBT for ELT (Extract, Load and Transformation)

First let us understand what ELT is and where DBT come into play.

  • ELT stands for Extract, Load and Transformation.
  • DBT is the tool which is used purely for Transformation leveraging target database resources to process the data.

Based on the requirements and design we need to modularize and develop models using DBT. Once the models are developed and run using DBT, the models will be compiled into SQL Queries and run using target database.

The open source community of DBT have developed adapters for all leading databases such as Spark, Databricks, Redshift, Snowflake, etc.

Overview of DBT CLI and DBT Cloud

DBT CLI and DBT Cloud can be used to develop DBT Models based on the requirements.

  • DBT CLI is completely open source and can be setup on Windows or Mac or Linux based desktops.
  • As part of DBT CLI installation we can take care of installing dbt-core along with the relevant adapters based on the target database.

Setting up EMR Cluster with Thrift Server using Step

As we are not processing significantly large amount of Data, we will setup single node EMR Cluster using latest version. If you are not familiar about AWS EMR, you sign up to this course on Udemy.

DBT Internally uses JDBC to connect to target Database and hence we need to ensure the Spark Thrift Server is also started as the EMR Cluster comes up with Spark. At the time of configuring single node cluster make sure to add step with command-runner.jar and sudo /usr/lib/spark/sbin/start-thriftserver.sh so that Spark Thrift Server is started after the cluster is started.

Here is the screenshot to configure the step.

Overview of Semi Structured Data Set used for the Demo

Here are the details of Semi Structured Data Set used for the Demo. The data set have 5 columns.

  1. order_id which is of type integer
  2. order_date which is string representation of the date
  3. order_customer_id which is of type integer
  4. order_status which is of type string
  5. order_items which is of type string. But the string is valid JSON Array.

We can covert string which contain JSON Array to Spark Metastore array<struct> using from_json function of Spark SQL. However, we need to make sure to specify the schema as second argument while invoking from_json on top of order_items column in our data set.

Develop required queries using Spark SQL on AWS EMR

Here are the queries to process the semi-structured JSON Data using Spark SQL.

Spark SQL have the feature of providing the path of files using SELECT Query.

SELECT *
FROM JSON.`s3://airetail/order_details`

The column order_items is of type string which have JSON Array stored in it. We can convert to Spark Metastore Array using from_json as below.

SELECT order_id, order_date, order_customer_id, order_status,
    explode_outer(from_json(order_items, 'array<struct<order_item_id:INT, order_item_order_id:INT, order_item_product_id:INT, order_item_quantity:INT, order_item_subtotal:FLOAT, order_item_product_price:FLOAT>>')) AS order_item
FROM order_details

Here is the final query which have the core logic to compute monthly revenue considering COMPLETE or CLOSED orders.

WITH order_details_exploded AS (
    SELECT order_id, order_date, order_customer_id, order_status,
        explode_outer(from_json(order_items, 'array<struct<order_item_id:INT, order_item_order_id:INT, order_item_product_id:INT, order_item_quantity:INT, order_item_subtotal:FLOAT, order_item_product_price:FLOAT>>')) AS order_item
    FROM order_details
) SELECT date_format(order_date, 'yyyy-MM') AS order_month,
    round(sum(order_item.order_item_subtotal), 2) AS revenue
FROM order_details
WHERE order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1
ORDER BY 1

Develop the DBT Models using Spark on AWS EMR

Let us go ahead and setup the project to develop required DBT Models to compute monthly revenue. We’ll break the overall logic to compute monthly revenue into 2 dependent DBT Models.

Here are the steps that are involved to complete the development process.

  1. Setup DBT Project using Spark Adapter
  2. Run Example Models and confirm if project is setup successfully
  3. Develop Required DBT Models with core logic
  4. Update Project File (change project name and also make required changes related to the models)

Here is the code for the first model order_details_exploded.sql where we will be preserving the logic for exploded order details in the form of a view.

{{ config(materialized='view') }}

SELECT order_id, order_date, order_customer_id, order_status,
    explode_outer(from_json(order_items, 'array<struct<order_item_id:INT, order_item_order_id:INT, order_item_product_id:INT, order_item_quantity:INT, order_item_subtotal:FLOAT, order_item_product_price:FLOAT>>')) AS order_item
FROM JSON.`s3://airetail/order_details`

Here is the code for the second model monthly_revenue.sql where we will be preserving the results in a table in specified s3 location. The configurations related to create the table pointing to specific s3 location can be specified in either this model or at the project level by updating dbt_project.yml.

{{ 
    config(
        materialized='table',
        location_root='s3://airetail/monthly_revenue'
    ) 
}}

SELECT date_format(order_date, 'yyyy-MM') AS order_month,
    round(sum(order_item.order_item_subtotal), 2) AS revenue
FROM {{ ref('order_details_exploded') }}
WHERE order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1
ORDER BY 1

Run the DBT Models using Spark on AWS EMR

As the development of DBT Model using Spark Adapter is done let us see how to run and validate the same.

  1. Run the DBT Project with 2 models
  2. Login into the EMR Cluster and launch Spark SQL
  3. Run query pointing to the target location in which the monthly revenue data is preserved.

Overview of Orchestration using Tools like Airflow

DBT Applications are primarily developed to take care of implementing required transformation logic using ELT pattern. The overall pipeline might require beyond the transformation logic. We need to make sure the entire pipeline is supposed to be orchestrated.

One of the ways we can orchestrated the pipeline by using orchestration tools such as AWS Step Functions, Airflow, etc.

Here is one of the common design when it comes to building end to end pipeline in which DBT play a critical role.

As part of this workshop we will go through the real time simulated project to process Salesforce Data using AWS Data Analytics Services.

The video also cover few aspects related to our Guided Program, but majority of the content is related to the project.

Here is the program link related to Data Engineering using AWS Analytics — https://itversity.com/bundle/data-engineering-using-aws-analytics

For sales inquiries: support@itversity.com

YouTube Video – Salesforce Data Processing using AWS Data Analytics Services

Few Important Links

Few important links to stay connected with us.

As we have good Udemy Follower base let me make a special note to our Udemy Customers before going through the details related to the Workshop on Salesforce Data Processing using AWS Data Analytics Services.

Special Note to Udemy Customers

  • Thank You for our esteemed customer
  • Make sure to rate us and also provide feedback as demonstrated. Your rating and feedback is very important for our community success.
  • If you are existing Udemy Customer and not familiar about ITVersity Courses in Udemy, feel free to visit this page.
Now let us get into the problem statement related to the Data Engineering Sample Project using Simulated Salesforce Data.

Data Engineering Sample Project — Problem Statement

Here are the details about the workshop or demos. It is primarily based on running analytics on top of Salesforce leads/sales data using Data Lake and Data Engineering Pipelines.

  • Get the data ingested from Salesforce into the bronze layer. Data will be in semi-structured format.
  • Apply required transformations and store the data into Data Warehouse
  • After applying required transformations, the Data should be well structured and should be ready for reporting.
As we have come up with the problem statement related to the project, now it is time for us to get into design. We will go through the design using both AWS Native Approach as well as Databricks on AWS Approach. But we will perform the tasks using AWS Native Approach.

Data Engineering Sample Project — Design using AWS Native

Here is the design using AWS Native Data Analytics Services using Salesforce, AWS s3, AWS Lambda or AWS EMR with Spark SQL, Redshift and some reporting tool such as AWS Quicksight.

  • Salesforce — Source which generate leads data
  • Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
  • S3 — AWS Cloud Storage which can act as Data Lake for Data Analytics Applications.
  • Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
  • EMR with Spark SQL — Used to build large scale Data Processing applications. Alternatively, we can also process data using Spark Data Frame APIs on EMR. EMR Stands for Elastic Map Reduce.
  • Redshift Serverless — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.

Data Engineering Sample Project — Design using Databricks on AWS

Here is the design leveraging Cloud Agnostic Data Lake Platforms such as Databricks. The sample project is built using Salesforce, DBFS on AWS s3, AWS Lambda or Databricks Runtme, Databricks SQL based on Spark SQL, and some reporting tool such as AWS Quicksight.

We can also use Databricks on other prominent cloud platforms such as Azure and GCP.

  • Salesforce — Source which generate leads data
  • Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
  • DBFS — Databricks Abstraction on top of Cloud Storage which can act as Data Lake for Data Analytics Applications. DBFS Stands for Databricks File System.
  • Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
  • Databricks Runtime with Spark SQL — Used to build large scale Data Processing applications using Distributed Engine built based on Spark.
  • Databricks SQL — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.

The above design is for AWS Databricks. We can also build similar solution on Azure where AWS s3 can be replaced with ADLS, Appflow can be replaced with ADF and AWS Lambda can be replaced with Azure Functions.

Data Engineering Sample Project — Implementation Plan

Here the high level details related to the implementation of Data Engineering using Data from Salesforce.

  • Overview of Salesforce and AWS Appflow
  • Review CSV Data with semi structured or JSON Fields
  • Upload files to Bronze Layer of Data Lake (AWS s3) simulating Data Ingestion from Salesforce via AWS Appflow
  • Overview of Approach — AWS Lambda vs. AWS EMR using Spark SQL
  • Implement the solution to get the data from Bronze Layer to Gold Layer
  • Ad-hoc Analysis using Redshift Spectrum and Load Data into Redshift reporting tables
  • Overview of Reporting using Redshift Tables
As we have gone through the details related to the implementation plan, let us focus on execution of those.

Overview of Salesforce and AWS Appflow

First let us go through the details related to Saleforce and Appflow, then we will get into the implementation.

  • Salesforce is globally renowned Cloud based CRM Application. It is primarily used to streamline leads, sales and Customer Relationship Management.
  • AWS Appflow is fully managed Data Flow Service from standard applications like Salesforce. It can be used to not only get the data from Salesforce but also from other platforms such as ServiceNow.

Typically we can ingest data from Salesforce using AWS Appflow in the form of CSV Files. Let us review the data first, before coming up with required logic to process the data.

Review CSV Data with semi structured or JSON Fields

Before starting on building the applications on CSV Data, we need to understand the characteristics of the data.

  • Whether the CSV Files which contain the data have header or not.
  • Delimiter or Separator. Even though CSV stands for Comma-Separated Variable format, we might end up using other Separator or Delimiter as well.
  • Enclosing Character
  • Escape Character

Details about Enclosing Character and Escape Character are covered as part of the workshop or lecture in detail. Once we understand the characteristics of the data, we can take care of copying the data to s3.

Data Ingestion from Salesforce into s3 using Appflow

As we do not have active Salesforce Account and integration with Appflow, for the sake of Sample Project or Workshop data is already uploaded to AWS s3.

  • Many times as part of the projects, the integration might take time. As long as we know the structure of the data, we can generate sample data and can build rest of the data pipeline.
  • This approach might result in some rework, but will definitely increase productivity of the team.
As we are ready with the data in s3 let us go through the criteria to decide between AWS Lambda and AWS EMR with Spark SQL.

AWS Lambda vs. AWS EMR with Spark SQL

Even though AWS EMR with Spark SQL is effective for large scale data sets, for smaller to medium sized data sets, it might not be viable solution.

  • Instead of using AWS EMR with Apache Spark, it is better to use AWS Lambda Functions for light weight Data Processing.
  • AWS Lambda Functions take significantly lesser time to start and also fast in performance when it comes to light weight Data Processing.
As we understood the criteria to choose between solutioning using AWS Lambda as well as Spark SQL, let us see both in action.

Solution using AWS Lambda

Here is the solution to transform semi-structured data to structured data using AWS Lambda Functions.

Solution using Spark SQL on EMR

Here is the solution to process Semi-Structured Data from Salesforce using Spark SQL on EMR.

We can build Orchestrated Pipeline using either of the approach leveraging AWS Step Functions.

Analysis or Reporting using Amazon Redshift

Here are Redshift Queries used as part of the Sample Project to take care of Analysis as well as Reporting the data using Amazon Redshift.

Here are the steps involved to run queries the data in AWS s3.

  • Create Glue Crawler pointing to s3 locations
  • Create Database and Tables by running Glue Crawler
  • Create External Schema in Redshift. It is also known as Redshift Spectrum.
  • Run relevant queries to perform Ad-hoc Analysis. For reporting purposes, it is recommended to ensure data is copied to Redshift tables.

Reports built against Redshift internal tables perform better compared to Redshift External Tables.

We are looking for a Data Engineer to join our growing team of Data engineering experts. The hire will be responsible for expanding and optimizing our data and data pipeline architecture, as well as optimizing data flow and collection for cross functional teams. The ideal candidate is an experienced data pipeline builder and data wrangler who enjoys optimizing data systems and building them from the ground up. The Data Engineer will support our software developers, database architects, data analysts and data scientists on data initiatives and will ensure optimal data delivery architecture is consistent throughout ongoing projects. They must be self-directed and comfortable supporting the data needs of multiple teams, systems and products. The right candidate will be excited by the prospect of optimizing or even re-designing our company’s data architecture to support our next generation of products and data initiatives.

Responsibilities for Data Engineer
  • Create and maintain optimal data pipeline architecture,
  • Assemble large, complex data sets that meet functional / non-functional business requirements.
  • Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
  • Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and AWS/ Azure ‘big data’ technologies.
  • Build analytics tools that utilize the data pipeline to provide actionable insights into customer acquisition, operational efficiency and other key business performance metrics.
  • Work with stakeholders including the Executive, Product, Data and Design teams to assist with data-related technical issues and support their data infrastructure needs.
  • Keep our data separated and secure across national boundaries through multiple data centers and AWS/ Azure regions.
  • Create data tools for analytics and data scientist team members that assist them in building and optimizing our product into an innovative industry leader.
  • Work with data and analytics experts to strive for greater functionality in our data systems.
Qualifications for Data Engineer
  • Advanced working SQL knowledge and experience working with relational databases, query authoring (SQL) as well as working familiarity with a variety of databases.
  • Experience building and optimizing ‘big data’ data pipelines, architectures and data sets.
  • Experience performing root cause analysis on internal and external data and processes to answer specific business questions and identify opportunities for improvement.
  • Strong analytic skills related to working with unstructured datasets.
  • Build processes supporting data transformation, data structures, metadata, dependency and workload management.
  • A successful history of manipulating, processing and extracting value from large disconnected datasets.
  • Working knowledge of message queuing, stream processing, and highly scalable ‘big data’ data stores.
  • Strong project management and organizational skills.
  • Experience supporting and working with cross-functional teams in a dynamic environment.
  • We are looking for a candidate with 3- 5 years of experience in a Data Engineer role, who has attained a Graduate degree in Computer Science, Statistics, Informatics, Information Systems or another quantitative field. They should also have experience using the following software/tools:
    • Experience with big data tools: Hadoop, Spark, Kafka, etc.
    • Experience with relational SQL and NoSQL databases, including Postgres and Cassandra.
    • Experience with data pipeline and workflow management tools: Azkaban, Luigi, Airflow, etc.
    • Experience with AWS cloud services: EC2, EMR, RDS, Redshift
    • Experience with stream-processing systems: Storm, Spark-Streaming, etc.
    • Experience with object-oriented/object function scripting languages: Python, Java, C++, Scala, etc.
    • Experience on Cloudera/ Databricks/snowflakes platforms will be added advantage
Qualifications

Minimum :: Bachelors or Equivalent in engineering ( IT, Software, ECE)

We are looking for self-motivated Cloudera Data Platform/Databricks solution architect with 8+ years experience to drive innovations on technology interoperable platforms with Cloudera for enterprise customers in diverse industry verticals. Key focus areas would be Cloudera Data engineering Platform/Databricks with Real time analytics, Next Best Action, and Internet of Everything.

The ideal candidate will be the one who have solid working experience in Cloudera.

Responsibilities
  • Design and implement solutions to address business problems in various industry verticals.
  • Drive Proof of Concept (POC) and Proof of Technology(POT) evaluation on interoperable technology platforms
  • Train and mentor developers towards solution development and POC/POT execution
  • Support presales engineering activities for Cloudera Hadoop based RFPs
  • Planning, designing, developing and deploying solutions on various platforms for clients.
  • Working closely with the internal teams to design the data flow.
  • Managing stakeholders to understand their data needs.
  • Providing technical expertise to developers in implementation and other technical aspects.
  • Coordinating with architects to make sure the data is being used properly.
Technical

Design and implement Cloudera Hadoop based big data solutions, including leadership role in design to develop shared/reusable components.

  • Experience in Data Engineering Technologies
  • File Systems: HDFS, ADLS/Blob, AWS s3, etc.
  • Distributed Computing Frameworks: YARN/MR, Spark 2.x or later
  • SQL on Big Data: Hive, Spark, Athena, Presto, etc
  • Integrations with different types of Databases – Cassandra, MongoDB, etc
  • Data Warehouses such as Snowflake, Redshift, Synapse, etc
  • Streaming Technologies such as Kafka, Kinesis, etc
  • Good working knowledge of Pig Scripting, Oozie workflow and HBASE.
  • Define and develop client specific best practices around data management within a Hadoop environment.
  • Hands on experience on Core Java, Python in Linux environment.
  • Work with functional team/ Frameworks team to understand requirements and translate them to technical specifications.
  • Cloudera certified 

Non-Technical:

  • Strong analytical and problem solving skills.
  • Strong written and verbal communication skills.
  • Ability to work effectively under pressure with constantly changing priorities and deadlines.
  • Familiarity with project management and systems development life cycle processes, tools, concepts and methodologies is plus. 

Ability to work independently and as a team member.

Qualifications

Minimum :: Bachelors or Equivalent in Engineering ( IT, Software, ECE)