ETL

ETL is known as Extract, transform, and load.ETL is the process of collecting data from various sources into a huge, central data repository called a data warehouse.ETL follows a set of business rules which is used to prepare, clean and organize raw data for data analytics, reporting, and machine learning. We can define specific business intelligence requirements through data analytics.

ETL

 

Why ETL

 

  • Organization stores both structured and unstructured data from various sources. The data extraction can be done using different approaches provided by various ETL Tools.
  • We can define transformation logic based on business requirements.
  • We can store large amounts of analytical data for various businesses in a data warehouse which is further used by businesses.
  • The business uses analytical data to generate various reports, define business goals, understand business needs, data mining, mitigate risks, etc.

Extract

The first stage in the ETL process is data extraction in which we extract data from various sources such as transactional systems, spreadsheets, CSV, and flat files. Data read performed from different source systems. The file can be Historical files, full load, or incremental files made available depending upon the source systems request.

Transform

In the Transform stage, the extracted data is transformed using a defined format that is useful for loading data into the data warehouse. The extracted data is formatted based on the required datatype, cleansed, validated and then it is loaded into respective immediate tables. This may involve cleaning and validating the data, converting data types, combining data from multiple sources, and creating new data fields. We have to filter data based on certain attributes and load it to the Data warehouse.
The data cleansing part involves defining proper data types, removing null values, etc. We have to join multiple attributes into one or we can split sing attribute into multiple attributes

Load

once data is transformed, it’s loaded into the data warehouse. This step involves creating a data warehouse physical design which involves various tables such as facts, dimensions, reporting tables as well as views.
The load strategies are defined differently for each source system because there might be variations in business requirements for that specific source system. There is no generic approach used we have to work based on business requirements.

ETL Tools

The most commonly used ETL tools are SSIS, ADF, Sybase, Hevo, and Oracle Warehouse Builder.

Data Warehouse

The most commonly used Data Warehouses are Azure data warehouse, Snowflake, BigQuery, and Redshift.

ADVANTAGES

 

  • In the ETL process, we can integrate data from multiple source systems, which makes it accessible and usable.
  • It ensures data in the data warehouse is consistent, up-to-date, accurate, and complete.
  • It provides security, provides data warehouse access to only authorized users.
  • It process provides scalability to manage and analyze huge amounts of data.
  • It provides automation features in which you can automate your daily or regular interval jobs to execute and save manual intervention and effort.

DISADVANTAGES

  • Implementation and maintenance cost is high for the ETL process.
  • It has limitations in terms of flexibility. to handle real-time data.
  • Design and Implementation are complex and require expert resources.