As part of this workshop we will go through the real time simulated project to process Salesforce Data using AWS Data Analytics Services.
The video also cover few aspects related to our Guided Program, but majority of the content is related to the project.
Here is the program link related to Data Engineering using AWS Analytics — https://itversity.com/bundle/data-engineering-using-aws-analytics
For sales inquiries: support@itversity.com
YouTube Video – Salesforce Data Processing using AWS Data Analytics Services
Few Important Links
Few important links to stay connected with us.
- Visit our website — https://www.itversity.com
- Subscribe to our YouTube Channel — https://www.youtube.com/itversityin?sub_confirmation=1
- Follow our LinkedIn Page — https://www.linkedin.com/company/itversity
- Sign up for our Newsletter — https://forms.gle/mwVYMRzAdv89rxRf8
- Udemy Profile — https://www.udemy.com/user/it-versity/
- For Enquiries: support@itversity.com
Special Note to Udemy Customers
- Thank You for our esteemed customer
- Make sure to rate us and also provide feedback as demonstrated. Your rating and feedback is very important for our community success.
- If you are existing Udemy Customer and not familiar about ITVersity Courses in Udemy, feel free to visit this page.
Data Engineering Sample Project — Problem Statement
Here are the details about the workshop or demos. It is primarily based on running analytics on top of Salesforce leads/sales data using Data Lake and Data Engineering Pipelines.
- Get the data ingested from Salesforce into the bronze layer. Data will be in semi-structured format.
- Apply required transformations and store the data into Data Warehouse
- After applying required transformations, the Data should be well structured and should be ready for reporting.
Data Engineering Sample Project — Design using AWS Native
Here is the design using AWS Native Data Analytics Services using Salesforce, AWS s3, AWS Lambda or AWS EMR with Spark SQL, Redshift and some reporting tool such as AWS Quicksight.
- Salesforce — Source which generate leads data
- Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
- S3 — AWS Cloud Storage which can act as Data Lake for Data Analytics Applications.
- Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
- EMR with Spark SQL — Used to build large scale Data Processing applications. Alternatively, we can also process data using Spark Data Frame APIs on EMR. EMR Stands for Elastic Map Reduce.
- Redshift Serverless — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.
Data Engineering Sample Project — Design using Databricks on AWS
Here is the design leveraging Cloud Agnostic Data Lake Platforms such as Databricks. The sample project is built using Salesforce, DBFS on AWS s3, AWS Lambda or Databricks Runtme, Databricks SQL based on Spark SQL, and some reporting tool such as AWS Quicksight.
We can also use Databricks on other prominent cloud platforms such as Azure and GCP.
- Salesforce — Source which generate leads data
- Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
- DBFS — Databricks Abstraction on top of Cloud Storage which can act as Data Lake for Data Analytics Applications. DBFS Stands for Databricks File System.
- Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
- Databricks Runtime with Spark SQL — Used to build large scale Data Processing applications using Distributed Engine built based on Spark.
- Databricks SQL — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.
The above design is for AWS Databricks. We can also build similar solution on Azure where AWS s3 can be replaced with ADLS, Appflow can be replaced with ADF and AWS Lambda can be replaced with Azure Functions.
Data Engineering Sample Project — Implementation Plan
Here the high level details related to the implementation of Data Engineering using Data from Salesforce.
- Overview of Salesforce and AWS Appflow
- Review CSV Data with semi structured or JSON Fields
- Upload files to Bronze Layer of Data Lake (AWS s3) simulating Data Ingestion from Salesforce via AWS Appflow
- Overview of Approach — AWS Lambda vs. AWS EMR using Spark SQL
- Implement the solution to get the data from Bronze Layer to Gold Layer
- Ad-hoc Analysis using Redshift Spectrum and Load Data into Redshift reporting tables
- Overview of Reporting using Redshift Tables
Overview of Salesforce and AWS Appflow
First let us go through the details related to Saleforce and Appflow, then we will get into the implementation.
- Salesforce is globally renowned Cloud based CRM Application. It is primarily used to streamline leads, sales and Customer Relationship Management.
- AWS Appflow is fully managed Data Flow Service from standard applications like Salesforce. It can be used to not only get the data from Salesforce but also from other platforms such as ServiceNow.
Typically we can ingest data from Salesforce using AWS Appflow in the form of CSV Files. Let us review the data first, before coming up with required logic to process the data.
Review CSV Data with semi structured or JSON Fields
Before starting on building the applications on CSV Data, we need to understand the characteristics of the data.
- Whether the CSV Files which contain the data have header or not.
- Delimiter or Separator. Even though CSV stands for Comma-Separated Variable format, we might end up using other Separator or Delimiter as well.
- Enclosing Character
- Escape Character
Details about Enclosing Character and Escape Character are covered as part of the workshop or lecture in detail. Once we understand the characteristics of the data, we can take care of copying the data to s3.
Data Ingestion from Salesforce into s3 using Appflow
As we do not have active Salesforce Account and integration with Appflow, for the sake of Sample Project or Workshop data is already uploaded to AWS s3.
- Many times as part of the projects, the integration might take time. As long as we know the structure of the data, we can generate sample data and can build rest of the data pipeline.
- This approach might result in some rework, but will definitely increase productivity of the team.
AWS Lambda vs. AWS EMR with Spark SQL
Even though AWS EMR with Spark SQL is effective for large scale data sets, for smaller to medium sized data sets, it might not be viable solution.
- Instead of using AWS EMR with Apache Spark, it is better to use AWS Lambda Functions for light weight Data Processing.
- AWS Lambda Functions take significantly lesser time to start and also fast in performance when it comes to light weight Data Processing.
Solution using AWS Lambda
Here is the solution to transform semi-structured data to structured data using AWS Lambda Functions.
Solution using Spark SQL on EMR
Here is the solution to process Semi-Structured Data from Salesforce using Spark SQL on EMR.
We can build Orchestrated Pipeline using either of the approach leveraging AWS Step Functions.
Analysis or Reporting using Amazon Redshift
Here are Redshift Queries used as part of the Sample Project to take care of Analysis as well as Reporting the data using Amazon Redshift.
Here are the steps involved to run queries the data in AWS s3.
- Create Glue Crawler pointing to s3 locations
- Create Database and Tables by running Glue Crawler
- Create External Schema in Redshift. It is also known as Redshift Spectrum.
- Run relevant queries to perform Ad-hoc Analysis. For reporting purposes, it is recommended to ensure data is copied to Redshift tables.
Reports built against Redshift internal tables perform better compared to Redshift External Tables.