Preparing for an interview is tricky. You would need to get a good understanding of new features and revise concepts you used in your preparation. This course helps you prepare for ETL Testing Interview with hands-on code examples covering 200+ Interview Questions and Answers on varied range of topics.
Discover not just what are the interview questions, but how to answer the questions to ensure you get the job as an ETL Testing professional or Data warehouse Testing professional. All questions are dealt with in detailed explanation and narration of what happens practically and in real time.
What will you learn?
Understand what kind of questions are asked in ETL/DWH/BI Testing Interviews.
Answer questions on Data warehouse concepts, ETL , BI and various other practical scenarios in real time projects.
Understand New Features of ETL Tools.
Understand Basic Testing Concepts.
Understand Advanced practical DWH/ETL/BI Testing Concepts.
Answer questions on Data Validations, Test Data Creation and Supporting the business in UAT.
And much more….
What is required for you to start with this course?
Familiarity with RDBMS Concpts and basics of ETL Testing.
In the course, we use Informatica 9x/10x and Oracle 11g to demonstrate examples.
Introduction
Course Pre-Requisite and Introduction
Interview Questions on ETL and ETL Concepts
ETL stands for Extract, Transform, and Load. It is an important concept in Data Warehousing systems. Extraction stands for extracting data from different data sources such as transactional systems or applications. Transformation stands for applying the conversion rules on data so that it becomes suitable for analytical reporting. The loading process involves moving the data into the target system, normally a data warehouse.
ETL testing helps in
- Verify if the data is transformed as defined in the business requirements
- Verify that the projected and expected data is loaded into the data warehouse without any truncation and data loss
- Make sure that ETL application identifes and reports invalid data and replaces with default values where ever defined
- Make sure that data loads at expected time to improve scalability and performance
- Make sure that the data is flown across the multiple layers as defined in the architecture and is adhering to the technical and business rules.
Many ETL tools were originally developed to make the task of the data warehouse developer easier and more fun. Developers are spared the arduous task of handwriting SQL code, replacing it with easy drag and drop to develop a data warehouse.
Today, the top ETL tools in the market have vastly expanded their functionality beyond data warehousing and ETL. They now contain extended functionalities for data profiling, data cleansing, Enterprise Application Integration (EAI), Big Data processing, data governance and master data management.
The popular ETL tools available in the market are −
- Informatica − Power Center
- IBM − Websphere DataStage (Formerly known as Ascential DataStage)
- SAP − Business Objects Data Services BODS
- IBM − Cognos Data Manager (Formerly known as Cognos Decision Stream)
- Microsoft − SQL Server Integration Services SSIS
- Oracle − Data Integrator ODI (Formerly known as Sunopsis Data Conductor)
- SAS − Data Integration Studio
- Oracle − Warehouse Builder
- ABInitio
- Open source Clover ETL
Interview questions on Data Warehouse Concepts
The concept of data warehousing is not hard to understand. The notion is to create a permanent storage space for the data needed to support reporting, analysis, and other BI functions. In this lecture we understand what are the main reasons behind creating a data warehouse and the benefits of it.
This long list of benefits is what makes data warehousing an essential management tool for businesses that have reached a certain level of complexity.
There are different terms used when any data warehouse architecture is mentioned. In this lecture, we will understand what architectures are possible and what are practically implemented.
The Data Warehouse Staging Area is temporary location where data from source systems is copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse.
Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.
A Data Mart is a subset of a data warehouse that can provide data for reporting and analysis on a section, unit or a department like Sales Dept, HR Dept, etc.
We can either structure the warehouse for analysis in which case there need be no data marts. However, data marts are often employed and typically contain subsets of data from the warehouse, organized to provide a particular analytical view and delivered to groups or individuals in an organization. For example, a mart of marketing data could be provided for the advertising department and one of staff data to the HR department.
So, even if you have a warehouse you don't have to use data marts. Equally, it is possible to create marts without a warehouse. You can put the data from a particular system, a finance system for example, into a data mart and structure the data for analytical purposes.
Once you understand the problems that warehouses and marts are designed to address, it is much easier to decide which you need.
We can either structure the warehouse for analysis in which case there need be no data marts. However, data marts are often employed and typically contain subsets of data from the warehouse, organized to provide a particular analytical view and delivered to groups or individuals in an organization. For example, a mart of marketing data could be provided for the advertising department and one of staff data to the HR department.
So, even if you have a warehouse you don't have to use data marts. Equally, it is possible to create marts without a warehouse. You can put the data from a particular system, a finance system for example, into a data mart and structure the data for analytical purposes.
Once you understand the problems that warehouses and marts are designed to address, it is much easier to decide which you need.
If there is a requirement to have the data mart and the data warehouse together, how would that be and how will that architecture be.
OLAP stands for Online Analytical Processing. It uses database tables (Fact and Dimension tables) to enable multidimensional viewing, analysis and querying of large amount of data.
OLTP stands for Online Transaction Processing Except data warehouse databases the other databases are OLTPs. These OLTP uses normalized schema structure. These OLTP databases are designed for recording the daily operations and transactions of a business.
OLTP stands for Online Transaction Processing Except data warehouse databases the other databases are OLTPs. These OLTP uses normalized schema structure. These OLTP databases are designed for recording the daily operations and transactions of a business.
Dimensions are categories by which summarized data can be viewed. For example a profit Fact table can be viewed by a time dimension.
The Dimensions which are reusable and fixed in nature Example customer, time, geography dimensions.
A Fact Table is a table that contains summarized numerical (facts) and historical data. This Fact Table has a foreign key-primary key relation with a dimension table. The Fact Table maintains the information in 3rd normal form.
The types of Facts are as follows.
1. Additive Facts: A Fact which can be summed up for any of the dimension available in the fact table.
2. Semi-Additive Facts: A Fact which can be summed up to a few dimensions and not for all dimensions available in the fact table.
3. Non-Additive Fact: A Fact which cannot be summed up for any of the dimensions available in the fact table.
A Fact which can be summed up to a few dimensions and not for all dimensions available in the fact table.
A Fact which cannot be summed up for any of the dimensions available in the fact table
Grain fact can be defined as the level at which the fact information is stored. It is also known as Fact Granularity
A Star schema is defined as a logical database design in which there will be a centrally located fact table which is surrounded by at least one or more dimension tables. This design is best suited for Data Warehouse or Data Mart
In a Snow Flake design the dimension table (de-normalized table) will be further divided into one or more dimensions (normalized tables) to organize the information in a better structural format. To design snow flake we should first design star schema design.
It is a collection of integrated databases designed to support operational monitoring. Unlike the OLTP databases, the data in the ODS are integrated, subject oriented and enterprise wide data.
Interview questions related to tables and database in the project
FUNCTION | DATABASE TESTING | ETL TESTING |
---|---|---|
Primary Goal | Data validation and Integration | Data Extraction, Transform and Loading for BI Reporting |
Applicable System | Transactional system where business flow occurs | System containing historical data and not in business flow environment |
Common Tools in market | QTP, Selenium, etc. | QuerySurge, Informatica, etc. |
Business Need | It is used to integrate data from multiple applications, Severe impact. | It is used for Analytical Reporting, information and forecasting. |
Modeling | ER method | Multidimensional |
Database Type | It is normally used in OLTP systems | It is applied to OLAP systems |
Data Type | Normalized data with more joins | De-normalized data with less joins, more indexes and Aggregations. |
What type of schema was used in your project and why?
One of the common questions asked in any interview is how many tables are there in your project or data model?
What were/are the tables used in your project?
Which were/are the columns in the tables?
How many records does your sources and targets have?
Explain different queries used in the project?
Have you performed Type 2 validations in your project?
What are the challenges faced during testing?
How many facts, dimensions and reference tables are in their in your project?
Can you name some columns columns used in the fact and dimension tables?
What is Data Sensitivity?
How to identify Fact Dimensions Reference Tables based on data?
What is Reconciliation?
How are the exit and entry criteria defined?
Aggregate functions are used to group multiple rows of a single column to form a more significant measurement. They are also used for performance optimization when we save aggregated tables in data warehouse.
Common Aggregate functions are −
MIN | returns the smallest value in a given column |
MAX | returns the largest value in a given column |
SUM | returns the sum of the numeric values in a given column |
AVG | returns the average value of a given column |
COUNT | returns the total number of values in a given column |
COUNT(*) | returns the number of rows in a table |
Example
SELECT AVG(salary) FROM employee WHERE title = 'developer';
A Surrogate key is something having sequence-generated numbers with no meaning, and just to identify the row uniquely. It is not visible to users or application. It is also called as Candidate key.
Interview questions on Data Validations
Where does the data validations happen?
What is Data Cleansing and Data Scrubbing?
What are Data Type and Data Length Validations?
What are the approaches followed to insert new data file validations?
What validations are done in Incremental Loads?
Interview questions related to project/application
What areas are covered under finance domain for ETL testing projects?
Is the test data created and loaded by the development team?
What are the areas where there is a possibility of defects in the Data Warehouse?
What are the documents used in a project and where are the validations defined?
Who is responsible for generating the different kinds of documents in a project?
What is a test case document?
How are Bugs or defects managed in your project?
What were the details in each project document?
What was the complete process followed in your project?
Can you explain about the number of downstream systems to your project?
How to test end to end integration of system?
Interview questions on RCA (Root Cause Analysis) and UAT Support
Did you perform any Root Cause Analysis (RCA) for issues in your project?
What are the steps to prepare test data for UAT (Production like)?
How did you support UAT?
Interview questions related to BI/Reporting
An ETL tool is used to extract data from different data sources, transform the data, and load it into a Data Warehouse. A BI tool however is used to generate interactive and ad-hoc or on demand reports for end-users, dashboard for senior management, data visualizations for monthly, quarterly, and annual board meetings.
Most common ETL tools include − SAP BO Data Services (BODS), Informatica, Microsoft – SSIS, Oracle Data Integrator ODI, Talend Open Studio, Clover ETL Open source, etc.
Most common BI tools include − SAP Business Objects, SAP Lumira, IBM Cognos, JasperSoft, Microsoft BI Platform, Tableau, Oracle Business Intelligence Enterprise Edition, etc.
How many down stream systems are there for your application/project?
What is Ad-Hoc Reporting?
Interview questions on Data Modelling concepts
What are some of the basic concepts to consider during Data Modelling?
Interview questions on Master Data Management (MDM)
What is MDM and what is the need for it in an enterprise and how is it setup?
Interview questions on Slowly Changing Dimensions (SCD)
Have you performed SCD Type 2 Validations in your project?
How is the SCD setup done at ETL layer?
What are the challenges faced with respect to SCD?
What is ETL automation and what are the areas in which you have done this?
Interview questions on Informatica Power Center Tool
A transformation is a repository object that generates, modifies, or passes data. Transformations in a mapping represent the operations the PowerCenter Server performs on the data. Data passes into and out of transformations through ports that you link in a mapping or mapplet. Transformations can be active or passive. An active transformation can change the number of rows that pass through it. A passive transformation does not change the number of rows that pass through it.
A transformation is a set of rules which generates, modifies, or passes data. Transformation can be of two types − Active and Passive.
Transformation | Type | Description |
Aggregator | Active / Connected | Performs aggregate calculations |
Application Source Qualifier | Active / Connected | Represents the rows that the Power Center Server reads from an application, such as an ERP source, when it runs a session. |
Custom | Active or Passive / Connected | Calls a procedure in a shared library or DLL. |
Expression | Passive / Connected | Calculates a value |
External Procedure | Active / Connected or Unconnected | Calls a procedure in a shared library or in the COM layer of windows. |
Filter | Active / Connected | Filters data |
Input | Passive / Connected | Defines mapplet input rows. Available in the Mapplet Designer |
Joiner | Active / Connected | Joins data from different databases of flat file systems. |
Lookup | Passive / Connected or Unconnected | Looks up values |
Normalizer | Active / Connected | Source qualifier for COBOL sources. Can also use in the pipeline to normalize data from relational or flat file sources. |
Output | Passive / Connected | Defines mapplet output rows. Available in the Mapplet Designer. |
Rank | Active / Connected | Limits records to a top or bottom range. |
Router | Active / Connected | Router data into multiple transformations based on group conditions. |
Sequence Generator | Passive / Connected | Generates primary keys. |
Sorter | Active / Connected | Sorts data base4d on a sort key. |
Source Qualifier | Active / Connected | Represents the rows that the PowerCenter Server reads from a relational or flat file source when it runs a session. |
Stored Procedure | Passive / Connected or Unconnected | Calls a stored procedure. |
Transaction Control | Active / Connected | Defines commit and rollback transactions. |
Union | Active / Connected | Merges data from different databases or flat file systems. |
Update Strategy | Active / Connected | Determines whether to insert, delete, update, or reject rows. |
XML Generator | Active / Connected | Reads data from one or more input ports and outputs XML through a single output port. |
XML Parser | Active / Connected | Reads XML from one input port and outputs data to one or more output ports. |
XML Source Qualifier | Active / Connected | Represents the rows that the PowerCenter Server reads from an XML source when it runs a session. |
In an active transformation, the number of rows that is created as output can be changed once a transformation has occurred. This does not happen during a passive transformation. The information passes through the same number given to it as input.
Lookup transformation allows you to access data from relational tables which are not defined in mapping documents. It allows you to update slowly changing dimension tables to determine whether the records already exist in the target or not.
Connected Lookup | Unconnected Lookup |
| - It is used when lookup function is used instead of an expression transformation while mapping |
| - Only returns one output port |
|
|
|
|
|
|
|
|
The worklet is a group of sessions. To execute the worklet we have to create the workflow.
A Work Flow is a set of instructions on how to execute tasks such as sessions, emails and shell commands. A WorkFlow is created from Workflow Manager.
- A Mapplet defines the Transformation rules.
- Sessions are defined to instruct the data when it is moved from source to target system.
- A Workflow is a set of instructions that instructs the server on task execution.
- Mapping is the movement of data from the source to the destination.
Tracing level is the amount of data stored in the log files. Tracing level can be classified in two Normal and Verbose. Normal level explains the tracing level in a detailed manner while verbose explains the tracing levels at each and every row.
None: Integration Service uses the tracing level set in the mapping.
Terse: Integration Service logs initialization information, error messages, and notification of rejected data.
Normal: Integration Service logs initialization and status information, errors encountered, and skipped rows due to transformation row errors. Summarizes session results, but not at the level of individual rows.
Verbose Initialization: In addition to normal tracing, the Integration Service logs additional initialization details, names of index and data files used, and detailed transformation statistics.
Verbose Data: In addition to verbose initialization tracing, the Integration Service logs each row that passes into the mapping. Also notes where the Integration Service truncates string data to fit the precision of a column and provides detailed transformation statistics.
When you configure the tracing level to verbose data, the Integration Service writes row data for all rows in a block when it processes a transformation.
Interview questions on UNIX commands
Where is UNIX used in a Data Warehouse Project?
What are the situations where you had to use the UNIX commands in ETL testing?
This lecture shows how to connect to UNIX/LINUX in real time projects.
What are the commands to navigate between folders in UNIX?
In this lecture, we talk about the common errors which the beginners do in LINUX using the file creation and naming conventions of the files.
Understanding the default layout of the ls command and permissions
What is File command and how to use it?
'less command'
What are wildcards? How to use them?