AWS EMR Migration: Your Organization’s Need 

Is your organization facing barriers while dealing with Apache Spark deployments or on-premises Apache Hadoop? Is your team really struggling with over-provisioning resources to control task variability? Do you invest a long time in maintaining rapidly changing open-source software innovation?

Then, your organization is not lonely. Big data and machine learning migration to AWS and Amazon EMR can produce many leads over on-premises deployments. It involves an increase in agility, separation of computing and storage, persistent and resilient storage, and managed services providing up-to-the-minute, simple environments to develop and operate big data applications.

We need to be careful in decision-making while migrating big data and analytics workloads from on-premises to the cloud. Many customers successfully migrate their big data from on-premises to Amazon EMR with our help. We are keeping here a step-by-step EMR Migration process based on the successful case studies which will also showcase best practices for:

  • Migrating data, applications, and catalogs
  • Using persistent and transient resources
  • Configuring security policies, access controls, and audit logs
  • Estimating and minimizing costs, while maximizing value
  • Leveraging the AWS Cloud for high availability (HA) and disaster recovery (DR)

Organizations present far and wide are aware of the power of data analytics and processing frameworks, like Apache Hadoop. Still, there are challenges and difficulties while implementing and functioning these frameworks in data lake environments deployed on-premises. 

AWS EMR migration assists organizations to shift their Hadoop deployments and big data workloads within our budget and timeline estimates.

There are significant downtimes and it is not financially viable while upgrading and scaling hardware to provision growing workloads on-premises. This has led the way for organizations to re-architect with the help of AWS EMR to build a modern system that is high-performing, safe, sustainable, and cost advantageous.

Challenges faced because of On-Premises Hadoop data lakes

Scalability for organizations with the help of Hadoop deployed on-premises will be the biggest challenge as it includes the purchase of additional hardware. It is not adapted to elasticity and makes use of clusters for long periods of time as well. The costs concerning the workloads gradually increase because of the ‘always on’ infrastructure while the data recovery and widely available requires to be supervised manually.

Critical necessity for Organizations

  • Flexible infrastructure and Easily scalable which can be provisioned fast based on the requirements.
  • Admin dependency depletion with a completely managed service.
  • Cost optimization with the ability to switch the infrastructure on and off based on workload demand.
  • Innovative schemes for improved return on investment (ROI) in the long term.
  • Inspecting new open-source technologies with spinning up sandboxes in actual time.
  • Integrating cloud security with the help of the Hadoop ecosystem.

AWS Data Analytics – Reference Architecture Diagram:

AWS Data Analytics – Reference Architecture Diagram:

Benefits of AWS EMR Migration?

Benefits of AWS EMR Migration?

The primary considerations of organizations are data-driven insights and cost optimization to reach nearly zero idleness of workloads with quick business value. Below are some critical USPs of AWS EMR migration designs that can help organizations to execute the previously mentioned.

  • Decoupling of the storage and compute systems.
  • A seamless data lake environment with Amazon S3.
  • A stateless compute infrastructure.
  • Cluster capabilities that are consistent and transient.
  • Cluster fragmentation is based on business units for improved isolation, customization, and cost allocation.

Enterprise Data Lake on AWS

Following technologies are listed which can be used during migration

  1. Low-cost storage using S3 
  2. Ad-hoc analysis using Athena or Redshift
  3. Streaming pipelines using Kinesis or Kafka
  4. Operational Data Store using Dynamo DB
  5. Datawarehouse using Redshift 
  6. Visualizations using Quicksight
  7. Orchestration using Step Functions or Airflow
  8. Other key AWS services like Lambda, API gateway, and SQS, etc.

Key Strategies for Hadoop to EMR Migration

Retire

Recognizing everything in your business environment which has the possibility to migrate lets you estimate the value of the product, application, or service. Detect all the users of each migration element and check what is being used and what is not. Finding out what you can retire will help in saving money on elements that should have been earlier phased out of use too.

Retain

A few elements of your environment may not migrate and are kept as it is. There are several grounds for the sake of maintaining an in-house element, like riding out the reduction value or the cost of migration is overpriced, and your company can maintain more value with the application or service. Keeping some IT aspects on-premise is chosen for a hybrid cloud service in demand.

Lift and Shift

This plan of action benefits companies to achieve Hadoop to EMR migration quickly to speed up shutting down their on-premises data center. This allows organizations to get rid of cost-intensive hardware upgrades. The lift and shift strategy assists organizations to keep their existing Hadoop segregated and classified by utilizing AWS S3. In addition to this, it also helps them in decoupling resources and limiting code transformations to a bare minimum. Code is moved to the cloud environment with the simple lift and shift of Hadoop to the EMR migration approach

Re-Platformization

The re-platform strategy for Hadoop to EMR migration permits organizations to enlarge their cloud migration benefits. This is done by making use of the entire set of features given by AWS EMR. With the help of this strategy, organizations can harmonize their workloads and infrastructure for cost-effectiveness, scalability, and accomplishment. Also, this strategy permits organizations to integrate their Hadoop ecosystem with cloud monitoring and security. Even though re-platform is like the lift and shift approach, it provides comparatively lesser optimizations when it comes to cloud features and offerings.

Refactor and Re-Architecting

Organizations get help to re-imagine their ecosystem of insights in the cloud with the strategy of re-architecting Hadoop on AWS EMR. It supports them normalize their data to a larger customer pool while lowering the time-to-insight. This can be attributed to the abilities of streaming analytics that supply organizations to self-service their demand while building greater capabilities.

The re-architect plan of action sorts out all the challenges faced by the organizations. It ranges from the survey of business priorities to building a cloud-based data platform. The strategy includes changing the architecture with the help of cloud-native services to increase performance, provision scalable solutions, and upgrade the cost-effectiveness of the infrastructure.

Summary

Apache Hadoop to AWS EMR migration is the best match for organizations with everlasting targets. With the help of this migration, organizations can re-architect their infrastructure with AWS cloud services such as S3, Athena, Lake Formation, Redshift, and Glue Catalog which are existing. Organizations that look for achieving simple, quick scalability and elasticity with finer cluster utilization must put forward AWS EMR migration. This also helps in realizing cost-efficient and executing a well-architected and well-designed solution.

Let us learn how to build DBT Models using Apache Spark on AWS EMR Cluster using denormalized JSON Dataset.

Here is the high-level agenda for this session.

  • DBT for ELT (Extract, Load and Transformation)
  • Overview of DBT CLI and DBT Cloud
  • Setting up EMR Cluster with Thrift Server using Bootstrapping
  • Overview of Semi Structured Data Set used for the Demo
  • Develop required queries using Spark SQL on AWS EMR
  • Develop the Spark Application on AWS EMR using DBT Cloud
  • Run the Spark Application on AWS EMR using DBT Cloud
  • Overview of Orchestration using Tools like Airflow

DBT for ELT (Extract, Load and Transformation)

First let us understand what ELT is and where DBT come into play.

  • ELT stands for Extract, Load and Transformation.
  • DBT is the tool which is used purely for Transformation leveraging target database resources to process the data.

Based on the requirements and design we need to modularize and develop models using DBT. Once the models are developed and run using DBT, the models will be compiled into SQL Queries and run using target database.

The open source community of DBT have developed adapters for all leading databases such as Spark, Databricks, Redshift, Snowflake, etc.

Overview of DBT CLI and DBT Cloud

DBT CLI and DBT Cloud can be used to develop DBT Models based on the requirements.

  • DBT CLI is completely open source and can be setup on Windows or Mac or Linux based desktops.
  • As part of DBT CLI installation we can take care of installing dbt-core along with the relevant adapters based on the target database.

Setting up EMR Cluster with Thrift Server using Step

As we are not processing significantly large amount of Data, we will setup single node EMR Cluster using latest version. If you are not familiar about AWS EMR, you sign up to this course on Udemy.

DBT Internally uses JDBC to connect to target Database and hence we need to ensure the Spark Thrift Server is also started as the EMR Cluster comes up with Spark. At the time of configuring single node cluster make sure to add step with command-runner.jar and sudo /usr/lib/spark/sbin/start-thriftserver.sh so that Spark Thrift Server is started after the cluster is started.

Here is the screenshot to configure the step.

Overview of Semi Structured Data Set used for the Demo

Here are the details of Semi Structured Data Set used for the Demo. The data set have 5 columns.

  1. order_id which is of type integer
  2. order_date which is string representation of the date
  3. order_customer_id which is of type integer
  4. order_status which is of type string
  5. order_items which is of type string. But the string is valid JSON Array.

We can covert string which contain JSON Array to Spark Metastore array<struct> using from_json function of Spark SQL. However, we need to make sure to specify the schema as second argument while invoking from_json on top of order_items column in our data set.

Develop required queries using Spark SQL on AWS EMR

Here are the queries to process the semi-structured JSON Data using Spark SQL.

Spark SQL have the feature of providing the path of files using SELECT Query.

SELECT *
FROM JSON.`s3://airetail/order_details`

The column order_items is of type string which have JSON Array stored in it. We can convert to Spark Metastore Array using from_json as below.

SELECT order_id, order_date, order_customer_id, order_status,
    explode_outer(from_json(order_items, 'array<struct<order_item_id:INT, order_item_order_id:INT, order_item_product_id:INT, order_item_quantity:INT, order_item_subtotal:FLOAT, order_item_product_price:FLOAT>>')) AS order_item
FROM order_details

Here is the final query which have the core logic to compute monthly revenue considering COMPLETE or CLOSED orders.

WITH order_details_exploded AS (
    SELECT order_id, order_date, order_customer_id, order_status,
        explode_outer(from_json(order_items, 'array<struct<order_item_id:INT, order_item_order_id:INT, order_item_product_id:INT, order_item_quantity:INT, order_item_subtotal:FLOAT, order_item_product_price:FLOAT>>')) AS order_item
    FROM order_details
) SELECT date_format(order_date, 'yyyy-MM') AS order_month,
    round(sum(order_item.order_item_subtotal), 2) AS revenue
FROM order_details
WHERE order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1
ORDER BY 1

Develop the DBT Models using Spark on AWS EMR

Let us go ahead and setup the project to develop required DBT Models to compute monthly revenue. We’ll break the overall logic to compute monthly revenue into 2 dependent DBT Models.

Here are the steps that are involved to complete the development process.

  1. Setup DBT Project using Spark Adapter
  2. Run Example Models and confirm if project is setup successfully
  3. Develop Required DBT Models with core logic
  4. Update Project File (change project name and also make required changes related to the models)

Here is the code for the first model order_details_exploded.sql where we will be preserving the logic for exploded order details in the form of a view.

{{ config(materialized='view') }}

SELECT order_id, order_date, order_customer_id, order_status,
    explode_outer(from_json(order_items, 'array<struct<order_item_id:INT, order_item_order_id:INT, order_item_product_id:INT, order_item_quantity:INT, order_item_subtotal:FLOAT, order_item_product_price:FLOAT>>')) AS order_item
FROM JSON.`s3://airetail/order_details`

Here is the code for the second model monthly_revenue.sql where we will be preserving the results in a table in specified s3 location. The configurations related to create the table pointing to specific s3 location can be specified in either this model or at the project level by updating dbt_project.yml.

{{ 
    config(
        materialized='table',
        location_root='s3://airetail/monthly_revenue'
    ) 
}}

SELECT date_format(order_date, 'yyyy-MM') AS order_month,
    round(sum(order_item.order_item_subtotal), 2) AS revenue
FROM {{ ref('order_details_exploded') }}
WHERE order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1
ORDER BY 1

Run the DBT Models using Spark on AWS EMR

As the development of DBT Model using Spark Adapter is done let us see how to run and validate the same.

  1. Run the DBT Project with 2 models
  2. Login into the EMR Cluster and launch Spark SQL
  3. Run query pointing to the target location in which the monthly revenue data is preserved.

Overview of Orchestration using Tools like Airflow

DBT Applications are primarily developed to take care of implementing required transformation logic using ELT pattern. The overall pipeline might require beyond the transformation logic. We need to make sure the entire pipeline is supposed to be orchestrated.

One of the ways we can orchestrated the pipeline by using orchestration tools such as AWS Step Functions, Airflow, etc.

Here is one of the common design when it comes to building end to end pipeline in which DBT play a critical role.