The author is the CEO and Co-Founder of Eucloid. For any queries, reach out at: firstname.lastname@example.orgLinkedIn
How and Why is ETL making way for ELT?
What is ETL?
ETL is the integration of a data source into a data warehouse, i.e., pulling information, transforming it into defined formats, and loading it into databases. It was originally designed to work with relational databases, which historically dominated the market. It has a batch-driven approach, developed, and executed with an ETL tool.
The source data is extracted to a staging area using an ETL tool and moved into the target system. In the staging area, the data undergoes a transformation process that organizes and cleans all data types. This transformation process allows the now structured data to be compatible with the target data storage systems.
A few famous ETL tools are Informatica, Talend, Microsoft SSIS, Google Dataflow etc.
Usages of ETL
ETL synchronizes several data environments and migrates data from legacy systems. One can primarily find its usage in organizations that need to synchronize data from several sources. This synchronization happens when any organization is merging its ventures, with multiple consumers, suppliers, and partners in common. In the early days, this data was stored in separate data repositories and formatted differently. With ETL, the data is transformed into a unified format before loading it onto the target location.
ETL also finds usage in organizations that need to migrate and update their data from legacy systems. Here, ETL is required to transform the data into a compatible format with the new structure of the target database.
Limitations of ETL
With all its utilities, ETL has a major disadvantage as well. It takes a longer time to implement because of the many steps in the transformation stage. However, this also results in cleaner data, i.e., data that is well polished and free from irregularities. Thus, this process is well suited for smaller target data repositories that require less frequent updating. Thereby utilizing less time.
Moreover, the world of data is constantly evolving with the emergence of newer datasets including sensor data, video feeds, mobile geolocation data, product usage data, social media data, and log files. This change in data volume, type, and incoming speed requires a platform that can efficiently process this high volume in its context and transform it with a real-time streaming rate. ETL focuses on relational processing, and as the data grows over time, its performance starts deteriorating. Furthermore, it utilizes the Schema-on-Write approach, which has been used over the years in database architectures, which starts by defining the schema, and then transforming and loading the data. This means you must know how you are going to use the data beforehand.
The major disadvantage of ETL is that one cannot update back-dated data for derived or transformed fields. It happens because no copy of the data source is loaded in the warehouse, thus leading to limited data accessibility.
For example, until 2020, a certain company used the number of hours spent in the office and the hours engaged in group activities as the factors to determine an indicator for employee productivity. After the pandemic, this definition was substantially changed to include new factors of work-from-home and mental health leaves, among others. It is simple to calculate the productivity in a post-COVID era using this new equation, but if the organization wants to calculate productivity using this new equation for the pre-COVID era unless we go back to the previous source systems that contained the raw data, and these source systems might necessarily not be under the control of the team working on calculating productivity.
It leads to very restricted usage and requires a lot of pre-planning of the data model; reducing traceability and thus making it difficult to troubleshoot.
The emergence of an alternative
With the advent of semi-structured, poly-structured, and unstructured data, the Schema-on-Read approach has started gaining precedence. It loads data as it is and gives value right away. This has transformed the 'Extract-Transform-Load’ approach into an 'Extract-Load-Transform’ approach. The Hadoop Distributed File System is the classical example of the Schema-on-Read approach. With this, you read and store data in its original form. So, instead of loading the data in a predefined schema once you receive it, you defer the modeling to the moment you need it. Thus, ELT starts a new branch of data handling.
ELT as a successor of ETL
ELT extracts data from one or multiple remote sources but then loads it into the target data warehouse without any other formatting. This unstructured, extracted data is made available to business intelligence systems, and there is no need for data staging. In an ELT process, data transformation happens within the target database. The process is simplified and does not require ‘keys’ (identifiers) to transfer and use the data. ELT only requires periodic updates of information rather than real-time updates.
It leverages data warehousing to perform basic data transformations, such as validation or removal of duplicated data. These processes are updated in real-time and used for large amounts of raw data. This approach enables faster implementation than the ETL process. The transformation occurs after the load function, preventing the migration slowdown that can occur during this process.
As an entire copy of the data source is replicated, ELT offers data flexibility. Having raw data in a data platform increases the chances to explore it and thus, increases data accessibility. It also enables enhanced traceability and thus makes it easy to troubleshoot.
Some popular ELT tools in the market are Airflow, Fivetran, Hevo Data, Matillion, etc.
Biggest Challenge to ELT
ELT is a newer process that comes with its own set of challenges. Since ELT functions involve loading comprehensive data, ELT functions could quickly overwhelm local processing and storage power. It requires expensive hardware upgrades and scheduled downtime while fixes are deployed. Organizations may be forced to install additional processing power which may remain mostly idle once ELT has been implemented. However, this challenge is well addressed by the evolution of cloud architecture.
The cloud provides scalability through its infrastructure and hosted services like iPaaS and SaaS, which gives organizations the ability to expand their resources. It provides the compute time and storage space necessary for even massive data transformation tasks. ELT tools leverage open-source cloud platforms to collaboratively push improvements, security, and compliance across the enterprise.
Earlier with ETL, expanded capability meant expanded costs. Increasing processing and storage resources was a one-way expense, requiring big investments in hardware. Organizations had to pay upfront for the most power they would ever require, but day-to-day operations used only a fraction of that capability. In a cloud-based ELT, you pay-as-you-use. Computing and storage costs drop to a minimum when the environment is operating under less pressure. Averaged annually, this results in a much lower total cost of ownership than ETL.
The gradual shift from ETL to ELT is just one example of how the traditional data pipelining environment is changing. To build a better ELT for the future, we need to focus on the data streams rather than the tools. Real-time latency, version control, schema evolution, continuous integration, and deployment are the key areas to improve.
ELT in the coming days will be equipped with a data management framework; a comprehensive and hybrid approach for managing big data. It will not only encompass data integration but also data governance, data quality, and data security - making the entire process of data handlinge scalable, versatile, high-performance, and flexible.
Looking for Data solutions for your business? We can help. Reach out at Contact@eucloid.com
Posted on : August 26, 2022
Category : Data Engineering