3.83 out of 5
3.83
204 reviews on Udemy

ETL Framework for Data Warehouse Environments

The non functional ETL requirements
Instructor:
Sid Inf
1,854 students enrolled
English [Auto-generated]
This course provides a high level approach to implement an ETL framework in typical Data Warehouse environments. This approach can be used for a new application that needs to design and implement ETL solution which is highly reusable with data loading, error handling, audit handling, job scheduling and re-start-ability features. This framework will help reduce time and increase quality due to high re-usability and design standards.
Metadata Categories, learn the commonly used types of metadata in a real time project and how these are different from the Business and Technical viewpoints.
ETL Framework process flow, the process flow and different activities which should be taken care during the ETL framework implementation from file (source data) validations, Exception handling and Audit Control.
Data Sourcing, the different types of Data Sourcing possible in a Data Warehouse environment, different mechanisms in which the data sourcing can happen like the Scheduled events, Change Data Capture, Pub- Sub, Web services/API connectivity and the classification.
Different commonly required/used scripts for Data Sourcing, the different validations required to be performed for Data Sourcing and what functionality to be included in the scripts (shell/bat).
File Validation process, post file validation steps and file validation failure notifications.
Staging Layer, the need for staging layer, Reference Data, Audit columns for Staging and Reference tables, Data retention in the staging layer, partitions and DB standards.
Business Validation Layer, different situations possible during the data processing, concurrent workflow process, partitions in staging and business validation layer.
Data warehouse Layer, Dimension Load, Fact Load types/process, Fact partitions, Fact Summary Load and Source File Management/Archival.
Exception Handling/Error Handling, Data model for exception handling, Error Category, Error Code and different possible solutions for exception handling.
Sample Project Setup, Steps to download the project setup, executing the DDLs for metadata, project explanation and importing the code base into Informatica.
Extending the Operational Metadata’s Data Model for exception handling with additional supporting tables.
Error Handling Data Model, the framework for the data model design.
Using PMREP tables, for exception handling.
Audit, Balance and Control, the need, different technology components involved, table structure and data model, workflow example.
Configuration Management, Software Change Management, Identification, Tracking and Management of all the assets/objects of a project, One of the standard project management processes, the formal way for managing changes of the software and the process for deploying code from development to testing to production.

This course provides a high level approach to implement an ETL framework in any typical Data Warehouse environments. The practical approaches can be used for a new application that needs to design and implement ETL solution which is highly reusable with different data loading strategies, error/exception handling, audit balance and control handling, a bit of job scheduling and the restartability features and also to any existing ETL implementations. For existing implementations this framework needs to be embedded into the existing environment, jobs and business requirements and it might also go to a level of redesigning the whole mapping/mapplets and the workflows (ETL jobs) from scratch, which is definitely a good decision considering the benefits for the environment with high re-usability and improved design standards. 

This course is a combination of standard and practical approaches of designing and implementing a complete ETL solution which details the guidelines, standards, developer/architect checklist and the benefits of the reusable code. And, this course also teaches you the Best practices and standards to be followed in implementing ETL solution. 

Though this course, covers the ETL design principles and solutions based on Informatica 10x, Oracle 11g, these can be incorporated to any of the ETL tools in the market like IBM DataStage, Pentaho, Talend, Ab-intio etc. 

Multiple reusable code bundles from the marketplace, checklists and the material required to get started on UNIX for basic commands and Shell Scripting will be provided. 

Getting Started

1
Introduction

In this lecture, we start with the need of ETL framework and how is it important for any Data Warehouse project. 

2
What are we getting in to?

A quick run thru of what will be covered in this course at a very high level. 

3
Quick note on the commonly asked questions.

Answers to the commonly asked questions by the participants from other courses (Data Warehouse Concepts, Informatica Developer, Informatica Administrator).

4
The Architecture which will be used for this course

In this lecture, we will talk about the architecture we will use for this course.

5
Different stages of the Architecture

Let's take a look at the different stage in this lecture.

Metadata Categories

1
Business Metadata

Business metadata includes definitions of data files and attributes in business terms. It may also contain definitions of business rules that apply to these attributes, data owners and stewards, data quality metrics, and similar information that helps business users to navigate the information of the enterprise.

2
Technical Metadata

Technical metadata is the most common form of metadata. This type of metadata is created and used by the tools and applications that create, manage, and use data. 

3
Operational Metadata

Operational metadata contains information that is available in operational systems and run-time environments. It may contain data file size, date and time of last load, updates, and backups, names of the operational procedures and scripts that have to be used to create, update, restore etc.

ETL Framework - Process Flow

1
ETL Framework - The Process Flow

This section covers the process flow and different activities which should be taken care during the ETL framework implementation from file validations, Exception handling and Audit Control.

Data Sourcing

1
What is Data Sourcing?

Sourcing or Extracting or Pulling the data from source systems in the form of flat files, XML files, relational databases, message queues, webservice calls, API's, cloud based connectivity, application connections etc. is called Data Sourcing. In this lecture, we understand the Data Sourcing process.

2
What are the different events of Data Sourcing?

The different methods/events  in which the Data Sourcing can be done are the following

  • Trigger files --> This event is discussed in detail in this lecture.
  • Based on Schedule
  • Change Data Capture (CDC)
  • Pub/Sub (Publish and Subscribe)
  • Web Services and API's
3
Scheduled Events

The different methods/events  in which the Data Sourcing can be done are the following

  • Trigger files 
  • Based on Schedule --> This event is discussed in detail in this lecture.
  • Change Data Capture (CDC)
  • Pub/Sub (Publish and Subscribe)
  • Web Services and API's
4
CDC - Change Data Capture Events

The different methods/events  in which the Data Sourcing can be done are the following

  • Trigger files
  • Based on Schedule
  • Change Data Capture (CDC) --> This event is discussed in detail in this lecture.
  • Pub/Sub (Publish and Subscribe)
  • Web Services and API's
5
Pub - Sub Events

The different methods/events  in which the Data Sourcing can be done are the following

  • Trigger files
  • Based on Schedule
  • Change Data Capture (CDC)
  • Pub/Sub (Publish and Subscribe) --> This event is discussed in detail in this lecture.
  • Web Services and API's
6
WebServices/API Events

The different methods/events  in which the Data Sourcing can be done are the following

  • Trigger files
  • Based on Schedule
  • Change Data Capture (CDC)
  • Pub/Sub (Publish and Subscribe)
  • Web Services and API's --> This event is discussed in detail in this lecture.


Data Sourcing - Classification

1
Push and Pull
In the earlier section, we have seen the different Data Sourcing events or methods. In this section we will learn the different Data sourcing mechanisms and their classification as push and pull operations.
2
Architectural Classification

The architectural classification for the Push/Pull mechanism is discussed in this lecture. 

Script Requirements for Data Sourcing

1
What functionality should be part of the Scripts for Data Sourcing?

In this lecture, the different logical areas to be covered as part of the code are discussed. 

Functionality to be included

  • Access rights to be verified ( Can be a one time activity)
  • File Polling
  • File Stability Check
  • FTP/SFTP
  • Log file output

File Validation

1
File Validation Process

Standard file validations

  • Physical availability
  • New/ Non duplicate
  • Not a ‘0’ byte file
  • Trailer record count
  • Record length
  • File name
  • Orphan records
2
Post File Validation Steps

Standard file validations

  • Physical availability
  • New/ Non duplicate
  • Not a ‘0’ byte file
  • Trailer record count
  • Record length
  • File name
  • Orphan records

3
File Validation Failure Notifications

Failure Validation

  • Stakeholder notification
  • Log file output

The Staging Layer

1
What is the need for a Staging layer?

In this lecture, we talk about the need of the staging layer and how it will be useful for the Data Warehouse projects for performing the Data Validation and Business Validations.

2
What is Reference Data and Should it be stored in the Staging Layer?

Should Reference Data be stored in Staging layer. Let's find out.

3
Different Real time examples of Reference Data Set up and Usage.

Different examples of the Reference data are discussed in this lecture.

4
Are Audit columns required for the Staging and Reference Data?

In this lecture, we will see if the Staging layer should have the Audit attributes included or not.

5
How many days of data should be stored in the staging area?

In this lecture, we talk about the number of days the data should be kept in the Staging layer and how this is related to the Business decision.

6
Do we need to set up the Partitions based on the Data Retention?

In this lecture, we talk about the best practices for creating the partitions included in the Staging layer.

7
What kind of DB Standards should be followed for the Staging area setup?

We will quickly review the basic Data base standards to be followed for the Staging layer.

Business Validation Layer

1
What is the the Business Validation Layer?

In this lecture, we talk about the Business Validation Layer. All the Business Critical Aggregations and Calculations happen in this layer. 

2
Different situations possible during the data processing at this layer

In this lecture, we will talk about the different situations possible for the file loads.

  • Single source, one location, one landing area
  • Secured Database connectivity
  • Secured Web service connectivity
  • Multiple Source, different locations, same format
  • Multiple Source, different locations, same format with different landing area

3
Indirect Data Load Porcess - Using Informatica

In the Informatica PowerCenter, there are two ways to load a flat file into a Target definition: 

Direct and Indirect.

We can set the Source File type to ‘Direct’ when loading a single flat file. ‘Indirect’ Source File type is when you need to load multiple flat files.

4
Concurrent Workflow Process

In large data integration projects, it is quite common to source data from multiple systems, sources, regions etc. As the number of data sources increases, the ETL load time also increases, because of the increasing data volume. One way to reduce the load time is by running different ETL process in parallel. Informatica PowerCenter's capability to run workflow concurrently can be used in such scenarios to reduce the ETL load time.

5
Partitions in Stage Layer and Business Validation Layer

Are partitions really required for Staging and Business validation layer? Let's find out in this lecture. 

DataWarehouse Layer

1
Dimension Load

Once the Stage load is complete, Dimension load will be triggered. For each business context, there would be a separate dimension table. 

2
Fact Load Process
3
Fact Partitions/ Fact Summary Load
4
Fact Summary Load
5
Source File Managment/Archival

Source File Management or Archival is another important requirement which is usually ignored during the inception of the project or the requirement scope. This is as important as the data processing requirements. In this lecture, we understand how this will be done. 

Exception Handling/Error Handling

1
Data Model for Exception Handling

In this lecture we go thru the data model of the exception table and the related supportive tables.

2
Error Category
3
Error Code
4
Different Solutions for Exception handling

Project Setup

1
Steps to download the sample project and the codebase

In this lecture, the steps to download the sample project from Informatica and the import options to using the Repository Manger are shown. 

2
Importing and Creting the Database metadata and the sample data

Not just the Informatica code, we also have to create the metadata and the data for the Sample Data Warehouse schema at the database level. In this session, we execute the available scripts and create the sources and targets.

3
Project explanation and aligning to the architecture

In this lecture, we understand how we are using the sample data warehouse project to the architecture we understood from this course.

4
Review of the project and the code base

This is in continuation to the previous lecture of understand the project and how it is aligned to the architecture defined in this course. 

Extending the Operational Metadata's Data Model

1
Supporting Operational Metadata for Exception Handling
2
Additional tables required to support the Exception Handling

In this lecture, we understand the additional tables required to handle the exceptions. 

3
Why do we need so many tables?

In this lecture, we understand the need of so many tables from the requirements perspective.

Error Handling Data Model

1
System Stage Table

This table will hold the information about the different stages in the application.

2
Data Source Metadata Table

This table holds the granular information of the source feeds.

3
Data Source Detail Table

This table holds more granular information about the Data Sources.

4
Data Source System Table

This table holds the information about the system (internal/external) from which the feeds are being sent. 

5
Job Details Table

This table holds all the information related to the Jobs and their respective stages.

6
Changes to the Error Log Table

In this lecture, we look at the changes which are made to the ERROR_LOG table against what we have discussed.

Mapping examples

1
Modify the existing mapping to load the wrong data into the ERROR_LOG Table

In this lecture, we modify the existing mapping to load the wrong data into the ERROR_LOG Table.

2
Other Possible ways of implemeting the Error handling

In this lecture, we talk about the different possible ways of getting the Exception handling done. 

3
Using PMREP Tables for Error Handling

In this lecture, we see the option of using the PMREP tables to perform the Error Handling. 

Audit, Balance and Control

1
Metadata Management

This lecture covers the Metadata Management and different types of Metadata. 

2
Need for Operational Metadata

This lecture covers the need for Operational Metadata.

3
Different technical components involved for Audit Balance and Control

This lecture covers the different components involved in managing the Audit, Balance and Control

4
Table Structure for Audit Balance and Control

Let's review the table structure for Audit Balance and Control

5
Structure of the Workflow with the re-usable sessions

In this lecture, let's review the structure of the workflow and how the reusable sessions can be repurposed. 

6
RUN_ID, the Unique Attribute

This lecture covers the different options available for the RUN_ID creation. 

7
The Re-usable Stored Procedure

In this lecture, we will discuss the reusable stored procedure which will be used for the Audit control.

8
Sequence Generator with Trigger

How to use the Sequence Generator and Trigger at the database level for the automated unique RUN_ID is shown in this lecture. 

9
Workflow Setup

Let's look at the example of setting up the workflow for ETL Process Log.

10
Assignments and What else is left?

In this lecture, we will talk about the rest of the attributes and the assignments possible for the Audit, Balance and Control. 


Configuration Management

1
What is Configuration Management?

Configuration Management (CM) deals with version control on software components and getting them ready for release. Releasing a set of components together into production and being able to track down software versions to a release is release management.

Configuration Items (CI) identified typically includes:

  • Requirement Specification, Functional Specification, Design, Test Plans, Test Scripts, Installation Scripts, Reusable components, Test Cases and User Manuals
  • Source code, Support Software, Data dictionaries, Header files, Executable code
  • Standards and customized checklists
  • Any other tool that can interpret the Source code differently in building the executable code
  • Customer supplied items (Documents, Code, tools, Test Scripts etc.)
2
What are the different subject areas under Configuration Management?

In this lecture, we will start with understanding the different aspects of Configuration Management and the different subject areas. 

3
Incident Management

Incident management is an area of IT Service Management (ITSM) that involves returning service to normal as quickly as possible after an incident, in a way that has little to no negative impact on the business. In practice, incident management often relies upon temporary workarounds to ensure services are up and running while the incident is investigated, the root problem identified and a permanent fix put in place.

4
Change Management

Change management is the process that guides how we prepare, equip and support applications and systems to successfully adopt new business processes and changes to the existing system in order to drive organizational success and operational goals.

5
Release Management

Release Management is the process responsible for planning, scheduling, and controlling the build, in addition to testing and deploying Releases.

6
Capacity Management
Capacity management is a process used to manage information technology (IT). Its primary goal is to ensure that IT resources are right-sized to meet current and future business requirements in a cost-effective manner
7
Service Level Management

An agreement between an IT service provider and a customer. The SLA describes the IT service, documents service level targets, and specifies the responsibilities of the IT service provider and the customer.

8
Disaster Recovery and Availability Management

Availability Management aims to define, analyze, plan, measure and improve all aspects of the availability of IT services.

Bonus Section

1
Direct links to other courses

This lecture will provide you with the coupons for other courses.

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
3.8
3.8 out of 5
204 Ratings

Detailed Rating

Stars 5
75
Stars 4
71
Stars 3
41
Stars 2
6
Stars 1
12
c409996745012c90fb27fe827214c1b7
30-Day Money-Back Guarantee

Includes

7 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion