By Kim Furman, Synthesis Marketing Manager
Your steaming cup of coffee is warming your hand. You slowly sit down to start your workday and then suddenly your neighbour’s piercing car alarm goes off, again and again. Your good mood, concentration and the heat of your coffee seem to evaporate.
Why does this throw you off while the sound of classical (input preference) music soothes you? Noise and music are both sound waves. What is the difference?
Music is ordered sound where noise is disordered sound. The same applies to data and is the reason you are likely hearing the terms ETL and ELT with increasing frequency.
Today’s problem is that there is too much data to handle and make sense of – noise. Data with insight is competitive edge and customer satisfaction and to achieve this, data centralisation is needed – A single source of truth or bringing together data to create harmony – music over noise.
Suddenly, the whole business is singing from the same source sheet and that is a competitive edge many companies are lacking.
I sat down with Archana Arakkal, Lead Machine Learning Engineer at Synthesis Software Technologies, to get a deeper understanding of ETL and ELT. Within five minutes, I had a comprehensive crash course and so can you.
Quick summary:
This is a process is where data is extracted from different sources, transformed into usable resources and loaded onto a single system. ETL tools breakdown data silos and make it easier for data scientists to access, analyse data and turn it into business intelligence
Now, let’s get technical. What is ETL?
Loosely defined ETL stands for Extract, Transform and Load.
ETL consists of the following lifecycles before data can be consumed by its end user. This could be a data analyst, data scientist or business owner – basically, anyone who needs to understand the data in a safe and secure manner:
- The first step “Extract” is the process of aggregating data from its original source. This original source could range from hard copy documents, excel spreadsheets all the way to a database cluster.
- The second step “Transform” consists of processing the data from its original source into a format that can be consumed for further analysis. This process could entail the standard data cleaning techniques all the way to creating combinatory datasets that will achieve a specific use case with the data at hand.
- The last step “Load” requires the data to be stored into a database that is in a consumable infrastructure layer – this means the database will adhere to all necessary security standards that will ensure the data is used with correct policies. The transformed/pre-processed data needs to be migrated or moved into this target database.
What is ELT?
The letters in ELT stand for the same as ETL however ELT swaps the “Load” and “Transform.” The reason is that the data is loaded into the target source, and only then will it be transformed.
The purpose behind this approach is that in most to some cases data sets can’t necessarily be standardised with one transformation and often decision makers need to have a look at the raw data sources prior to understanding what transformations are required for their exact dataset.
This allows for more flexibility in loosely defined use cases that needs further exploratory data analyses before the final use case is defined.
When to use one over the other?
A key indicator between and ETL and ELT process is that often in an ELT process the central datastore would be a data lake, reservoir or a data stream to ensure the rawest form of data is transported near real time from the data sources.
Why use them?
ETLs should be used in situations where there is a data source or multiple data sources that need to be centrally manipulated for the final end user to consume.
This would typically be in team dynamics that perhaps have business units that are not interested in managing data and would prefer to consume the data in a clean manner for reporting purposes.
The central data team would be responsible for transporting data of high quality with all the necessary data manipulations prior to the final consumption.
Should every company be preparing to use ETL or ELT?
Yes, they should. ETLs or ELTs are in any situation that requires data analyses, data storage, data migration, data engineering, data strategy, data consolidation etc.
How hard is it to implement?
The complexity would differ based on the organisations existing standards and systems. In the event that no data strategy has been implemented and no central data team exists this could be a lengthy process.
If a team already exists with an adequate data strategy and direction it would depend on the quality of the data as well as the complexity that of data combinations that are required.
What does ETL or ELT solve for?
All data projects that require some form of data analyses requires an ETL process that can further prepare the data.
The key points that an ETL process solves for is having the ability to work with data that will be in a reliable data store or warehouse that unifies data policies and data standards have been adhered to – this further improves the quality of data inferences and data extrapolation by ensuring that data consolidation is uniform across the organisation.
Another way of representing ETL processes is the refinery process that is used across multiple data sources to consolidate all the data into a single centralised location that can be accessible to personal with ease.
While ETL processes may provide a clean overview of data from the very start, ELT processes provide data consumers with the opportunity to consciously decide which parts of their data sets are useful or not.
How do ETLs and ETLs improve data information and data within an organisation?
ETLs improve data information across the organisation with the following means:
- Centralised store that allows for easier access to further consolidate data and preform even more richer analyses of data across multiple sources (with the comfort that all data is of a high standard).
- Data security and data integrity is maintained due to ETL processes cleansing data that will be consumed by end users.
- All ETL processes are automated, this means that time is saved.
- ETL processes further simplify transforming complex data sets by ensuring that data formats conform to standards that are agreed to by the organisation.
- ETL processes ensure a drastic reduction in human error since ETL process can be used as a means to validate data prior to the end user consuming data for analytics purposes.
- Data quality can be attributed to higher standards since automating processes and reducing errors significantly improves the output of data.
Bottom line, what value will this provide?
An example of how a ETL process could improve the overall efficiency of the business is the following:
Suppose there are close to a 1000 different sources that originate from different locations in different formats and each format has multiple permutations of naming conventions for product types that need to be further implemented.
Utilising an ETL process will ensure that the “Transform” will standardise the names for all the product types in an automated manner further saving the data analyst down the line from having to consolidate a 1000 sources manually – the benefit of having a faster turnaround as well as a higher accuracy on having these ETL processes is that decisions can be made at a faster pace to an ever-changing market that requires businesses to act fast.