As part of this workshop we will go through the real time simulated project to process Salesforce Data using AWS Data Analytics Services.

The video also cover few aspects related to our Guided Program, but majority of the content is related to the project.

Here is the program link related to Data Engineering using AWS Analytics — https://itversity.com/bundle/data-engineering-using-aws-analytics

For sales inquiries: support@itversity.com

YouTube Video – Salesforce Data Processing using AWS Data Analytics Services

Few Important Links

Few important links to stay connected with us.

As we have good Udemy Follower base let me make a special note to our Udemy Customers before going through the details related to the Workshop on Salesforce Data Processing using AWS Data Analytics Services.

Special Note to Udemy Customers

  • Thank You for our esteemed customer
  • Make sure to rate us and also provide feedback as demonstrated. Your rating and feedback is very important for our community success.
  • If you are existing Udemy Customer and not familiar about ITVersity Courses in Udemy, feel free to visit this page.
Now let us get into the problem statement related to the Data Engineering Sample Project using Simulated Salesforce Data.

Data Engineering Sample Project — Problem Statement

Here are the details about the workshop or demos. It is primarily based on running analytics on top of Salesforce leads/sales data using Data Lake and Data Engineering Pipelines.

  • Get the data ingested from Salesforce into the bronze layer. Data will be in semi-structured format.
  • Apply required transformations and store the data into Data Warehouse
  • After applying required transformations, the Data should be well structured and should be ready for reporting.
As we have come up with the problem statement related to the project, now it is time for us to get into design. We will go through the design using both AWS Native Approach as well as Databricks on AWS Approach. But we will perform the tasks using AWS Native Approach.

Data Engineering Sample Project — Design using AWS Native

Here is the design using AWS Native Data Analytics Services using Salesforce, AWS s3, AWS Lambda or AWS EMR with Spark SQL, Redshift and some reporting tool such as AWS Quicksight.

  • Salesforce — Source which generate leads data
  • Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
  • S3 — AWS Cloud Storage which can act as Data Lake for Data Analytics Applications.
  • Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
  • EMR with Spark SQL — Used to build large scale Data Processing applications. Alternatively, we can also process data using Spark Data Frame APIs on EMR. EMR Stands for Elastic Map Reduce.
  • Redshift Serverless — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.

Data Engineering Sample Project — Design using Databricks on AWS

Here is the design leveraging Cloud Agnostic Data Lake Platforms such as Databricks. The sample project is built using Salesforce, DBFS on AWS s3, AWS Lambda or Databricks Runtme, Databricks SQL based on Spark SQL, and some reporting tool such as AWS Quicksight.

We can also use Databricks on other prominent cloud platforms such as Azure and GCP.

  • Salesforce — Source which generate leads data
  • Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
  • DBFS — Databricks Abstraction on top of Cloud Storage which can act as Data Lake for Data Analytics Applications. DBFS Stands for Databricks File System.
  • Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
  • Databricks Runtime with Spark SQL — Used to build large scale Data Processing applications using Distributed Engine built based on Spark.
  • Databricks SQL — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.

The above design is for AWS Databricks. We can also build similar solution on Azure where AWS s3 can be replaced with ADLS, Appflow can be replaced with ADF and AWS Lambda can be replaced with Azure Functions.

Data Engineering Sample Project — Implementation Plan

Here the high level details related to the implementation of Data Engineering using Data from Salesforce.

  • Overview of Salesforce and AWS Appflow
  • Review CSV Data with semi structured or JSON Fields
  • Upload files to Bronze Layer of Data Lake (AWS s3) simulating Data Ingestion from Salesforce via AWS Appflow
  • Overview of Approach — AWS Lambda vs. AWS EMR using Spark SQL
  • Implement the solution to get the data from Bronze Layer to Gold Layer
  • Ad-hoc Analysis using Redshift Spectrum and Load Data into Redshift reporting tables
  • Overview of Reporting using Redshift Tables
As we have gone through the details related to the implementation plan, let us focus on execution of those.

Overview of Salesforce and AWS Appflow

First let us go through the details related to Saleforce and Appflow, then we will get into the implementation.

  • Salesforce is globally renowned Cloud based CRM Application. It is primarily used to streamline leads, sales and Customer Relationship Management.
  • AWS Appflow is fully managed Data Flow Service from standard applications like Salesforce. It can be used to not only get the data from Salesforce but also from other platforms such as ServiceNow.

Typically we can ingest data from Salesforce using AWS Appflow in the form of CSV Files. Let us review the data first, before coming up with required logic to process the data.

Review CSV Data with semi structured or JSON Fields

Before starting on building the applications on CSV Data, we need to understand the characteristics of the data.

  • Whether the CSV Files which contain the data have header or not.
  • Delimiter or Separator. Even though CSV stands for Comma-Separated Variable format, we might end up using other Separator or Delimiter as well.
  • Enclosing Character
  • Escape Character

Details about Enclosing Character and Escape Character are covered as part of the workshop or lecture in detail. Once we understand the characteristics of the data, we can take care of copying the data to s3.

Data Ingestion from Salesforce into s3 using Appflow

As we do not have active Salesforce Account and integration with Appflow, for the sake of Sample Project or Workshop data is already uploaded to AWS s3.

  • Many times as part of the projects, the integration might take time. As long as we know the structure of the data, we can generate sample data and can build rest of the data pipeline.
  • This approach might result in some rework, but will definitely increase productivity of the team.
As we are ready with the data in s3 let us go through the criteria to decide between AWS Lambda and AWS EMR with Spark SQL.

AWS Lambda vs. AWS EMR with Spark SQL

Even though AWS EMR with Spark SQL is effective for large scale data sets, for smaller to medium sized data sets, it might not be viable solution.

  • Instead of using AWS EMR with Apache Spark, it is better to use AWS Lambda Functions for light weight Data Processing.
  • AWS Lambda Functions take significantly lesser time to start and also fast in performance when it comes to light weight Data Processing.
As we understood the criteria to choose between solutioning using AWS Lambda as well as Spark SQL, let us see both in action.

Solution using AWS Lambda

Here is the solution to transform semi-structured data to structured data using AWS Lambda Functions.

Solution using Spark SQL on EMR

Here is the solution to process Semi-Structured Data from Salesforce using Spark SQL on EMR.

We can build Orchestrated Pipeline using either of the approach leveraging AWS Step Functions.

Analysis or Reporting using Amazon Redshift

Here are Redshift Queries used as part of the Sample Project to take care of Analysis as well as Reporting the data using Amazon Redshift.

Here are the steps involved to run queries the data in AWS s3.

  • Create Glue Crawler pointing to s3 locations
  • Create Database and Tables by running Glue Crawler
  • Create External Schema in Redshift. It is also known as Redshift Spectrum.
  • Run relevant queries to perform Ad-hoc Analysis. For reporting purposes, it is recommended to ensure data is copied to Redshift tables.

Reports built against Redshift internal tables perform better compared to Redshift External Tables.