The author is a CEO and Co-Founder in Eucloid. For any queries, reach out to us at: contact@eucloid.com
LinkedInA²G Blueprint: Architecting (Athena-Airflow-GlueCatalog) for eTL
Why?
Airflow is an beautiful orchestrator that can integrate seamlessly with wide variety of connectors in the market, while MWAA (Amazon Managed Workflows for Apache Airflow ) makes it easier to setup and operate end-to-end data pipelines in the cloud at scale. With Most companies in the market are driving towards multi-cloud architecture , it is important to choose an orchestrator service that can talk to pretty much every cloud provider (aws, azure, gcp, oracle, etc. ). So we chose airflow as our frontliner.
Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. The core of athena is prestodb which was designed and developed by facebook then became community driven under apache license. The beauty of athena is it supports wide variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet and many compression types. Athena is ideal for interactive querying and can also handle complex analysis, including large joins, window functions, and arrays. In this article Athena will be the heavy lifter who takes care of all transformations.
One important reason for choosing Athena over glue spark is the ease of code and maintenance were data engineer can built the transformation in ansi sql which can be maintained by analytical engineer or data analysts.
With Athena sharing Glue catalog , other services of AWS such as redshift spectrum/ EMR presto/EMR hive/ Glue spark etc can talk to transformed s3 dataset.
How?
I will not talk about Mwaa setup and role , policy creation in this article, it is pretty simple and can refer to aws article https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html.
In this article lets clean and dedupe the s3 json data in our transformation layer. The source is uncleaned , duplicate, complex nested json data and dumped at daily folders in s3.
S3 Source Folder structure:
📁 day=20220322/ | Folder |
📁 day=20220323/ | Folder |
📁 day=20220324/ | Folder |
📁 day=20220325/ | Folder |
📁 day=20220326/ | Folder |
On top of the s3 data with day level partition , create an Athena table with json serde
CREATE EXTERNAL TABLE `schema.table1`(
`col1` string,
`col2` timestamp,
`col3` map❮string,string❯ COMMENT ‘from deserializer’,
`col4` array❮string❯ COMMENT ‘from deserializer’
)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
‘org.apache.hive.hcatalog.data.JsonSerDe’
WITH SERDEPROPERTIES (
‘case.insensitive’=’true’,
‘paths’=’col1,col2,col3,col4')
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘s3://bucket_name/prefix_path/’
TBLPROPERTIES (
‘classification’=’json’)
Source table is all set , please refer to https://docs.aws.amazon.com/athena/latest/ug/data-types.html for better understanding on map, struct and array datatypes.
Now lets talk about orchestration of our transformation using airflow, awsathenaoperator is used to fire athena transformation queries, where the cleansed , deduped and flattened data is dumped to transformed parquet table ,
from airflow import DAG
from airflow.providers.amazon.aws.operators.athena import AWSAthenaOperator
with DAG(
‘ATHENA-AIRFLOW-ETL’,
) as dag:
Athena_task1 = AWSAthenaOperator(
task_id=’task1',
query=”””
insert into schema1.table2 select * from schema1.table1
“””,
database=”{schema}”,
output_location=’s3://bucket_name/prefix_folder_path’,
sleep_time=30,
max_tries=None,
aws_conn_id=AWS_CONN_ID,
dag=dag
)
Limitations & points to consider:
When you drop an external table, the underlying data remains intact because all tables in Athena are EXTERNAL. So we should use s3operator in airflow to delete underlying s3 files.
The max dynamic partition limit for athena is 100 , refer https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html
Stored procedures are not supported.
Refer to aws site for other limitations and be vary of https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html
Cost Benefits with this approach is pretty significant when compared to other enterprise etl tools in the market . mwaa is highly scalable with min workers from 1 till max 20. and athena is a pay as you go solution which charge just about 5$ for 1TB data scanned.
Looking for Data solutions for your business? We can help. Reach out at Contact@eucloid.com
Posted on : April 22, 2022
Category : Data Engineering