Data warehousing (DW) is a method of gathering and analysing data from many sources in order to get useful business insights. Typically, a data warehouse is used to integrate and analyse corporate data from many sources. The data warehouse is the heart of the business intelligence (BI) system, which is designed to analyse and report on data.
It is a collection of technology and components that help with data strategy. It refers to a company’s electronic storage of a huge volume of data that is intended for inquiry and analysis rather than transaction processing. It is a method of converting data into information and making it available to people in a timely manner so that it can be used to make a difference. It is created by combining data from a variety of disparate sources to provide analytical reporting, structured and/or ad hoc queries, and decision-making. Cleaning, integrating, and consolidating data are all part of data warehousing.
The Data Warehouse is kept distinct from the operational database of the company. It is an environment rather than a product. It is an information system’s architectural design that gives users access to current and historical decision-support data that’s difficult to find or find in a standard operational data store.
For example, a report on current inventory information may have more than 12 connected conditions. This can cause the query and report to take a long time to respond. A data warehouse introduces a novel design that can help to improve query performance and minimise response time for reporting and analytics.
The following are some alternative names for the data warehouse system:
Decision Support System (DSS).
Management Information System.
Executive Information System.
Analytic Application.
Business Intelligence Solution.
以下是数据仓库系统的一些替代名称:
决策支持系统 (DSS)
管理信息系统
行政信息系统
分析应用
商业智能解决方案
Data Warehouse Interview Questions for Freshers
1. What do you mean by data mining? Differentiate between data mining and data warehousing.
Data mining is the process of collecting information in order to find patterns, trends, and usable data that will help a company to make data-driven decisions from large amounts of data. In other words, Data Mining is the method of analysing hidden patterns of data from various perspectives for categorization into useful data, which is gathered and assembled in specific areas such as data warehouses, efficient analysis, data mining algorithm, assisting decision making, and other data requirements, ultimately resulting in cost-cutting and revenue generation. Data mining is the process of automatically examining enormous amounts of data for patterns and trends that go beyond simple analysis. Data mining estimates the probability of future events by utilising advanced mathematical algorithms for data segments.
Following are the differences between data warehousing and data mining:
Data Warehousing
Data Mining
A data warehouse is a database system that is intended for analytical rather than transactional purposes.
The technique of examining data patterns is known as data mining.
In data warehousing, data is saved on a regular basis.
In data mining, data is evaluated on a regular basis.
Engineers are the only ones that do data warehousing.
With the assistance of technologists, business users conduct data mining.
Data warehousing is the process of bringing all relevant data together.
Data mining is the process of extracting information from big datasets.
Data warehousing can be referred to as a subset of data mining.
Data Mining can be referred to as a super set of data warehousing.
2. What do you mean by OLAP in the context of data warehousing? What guidelines should be followed while selecting an OLAP system?
OLAP is an acronym for On-Line Analytical Processing. OLAP is a software technology classification that allows analysts, managers, and executives to get insight into information through quick, reliable, interactive access to data that has been converted from raw data to reflect the true dimensionality of the company as perceived by the clients. OLAP allows for multidimensional examination of corporate data while also allowing for complex estimations, trend analysis, and advanced data modelling. It’s rapidly improving the foundation for Intelligent Solutions, which includes Business Performance Management, Strategy, Budgeting, Predicting, Financial Documentation, Analysis, Modeling, Knowledge Discovery, and Data Warehouses Reporting. End-clients can use OLAP to perform ad hoc record analysis in several dimensions, giving them the information and understanding they need to make better choices.
Following guidelines must be followed while selecting an OLAP system:
Multidimensional Conceptual View: This is one of an OLAP system’s most important capabilities. It is feasible to use methods like slice and dice that require a multidimensional view.
Transparency: Make the technology, the underlying data repository, computing operations, and the disparate nature of source data completely accessible to consumers. Users’ efficiency and productivity are improved as a result of this transparency.
Accessibility: OLAP systems must only allow access to the data that is truly needed to do the analysis, giving clients a single, coherent, and consistent picture. The OLAP system must map its own logical schema to the disparate physical data storage, as well as to conduct any required transformations.
Consistent Reporting Performance: As the number of dimensions or the size of the database grows, users should not experience any substantial reduction in documenting performance. That is, as the number of dimensions grows, OLAP performance should not deteriorate.
Client/Server Architecture: Make the OLAP tool’s server component clever enough that the various clients can be connected with minimal effort and integration code. The server should be able to map and consolidate data from disparate databases.
Generic Dimensionality: Each dimension in an OLAP method should be seen as equal in terms of structure and operational capabilities. Select dimensions may be granted additional operational capabilities, although such duties should be available to all dimensions.
Dynamic Sparse Matrix Handling: To optimise sparse matrix handling by adapting the physical schema to the unique analytical model being built and loaded. When confronted with a sparse matrix, the system must be able to dynamically assume the information distribution and change storage and access in order to achieve and maintain a constant level of performance.
Multiuser Support: OLAP technologies must allow several users to access data at the same time while maintaining data integrity and security.
Unrestricted cross-dimensional Operations: It gives techniques the ability to determine dimensional order and to perform roll-up and drill-down operations within and across dimensions.
Intuitive Data Manipulation: Reorientation (pivoting), drill-down and roll-up, and other manipulations can be done intuitively and precisely on the cells of the scientific model using point-and-click and drag-and-drop methods. It does away with the need for a menu or several visits to the user interface.
Flexible Reporting: It provides efficiency to corporate clients by allowing them to organize columns, rows, and cells in a way that allows for easy data manipulation, analysis, and synthesis.
Infinite Dimensions and Aggregation Levels: There should be no limit to the number of data dimensions. Within any given consolidation path, each of these common dimensions must allow for an almost infinite number of customer-defined aggregation levels.
3. What do you understand about a fact table in the context of a data warehouse? What are the different types of fact tables?
In a Data Warehouse system, a Fact table is simply a table that holds all of the facts or business information that can be exposed to reporting and analysis when needed. Fields that reflect direct facts, as well as foreign fields that connect the fact table to other dimension tables in the Data Warehouse system, are stored in these tables. Depending on the model type used to construct the Data Warehouse, a Data Warehouse system can have one or more fact tables.
Transactional Fact Table: This is a very basic and fundamental view of corporate processes. It can be used to depict the occurrence of an event at any given time. The facts measure are only valid at that specific time and for that specific incident. “One row per line in a transaction,” according to the grain associated with the transaction table. It typically comprises data at the detailed level, resulting in a huge number of dimensions linked with it. It captures the smallest or atomic level of dimension measurement. This allows the table to provide users with extensive dimensional grouping, roll-up, and drill-down reporting features. It’s packed yet sparse at the same time. It can also be big at the same time, depending on the number of events (transactions) that have occurred.
Snapshot Fact Table: The snapshot depicts the condition of things at a specific point in time, sometimes known as a “picture of the moment.” It usually contains a greater number of non-additive and semi-additive information. It aids in the examination of the company’s overall performance at regular and predictable times. Unlike the transaction fact table, which adds a new row for each occurrence of an event, this represents the performance of an activity at the end of each day, week, month, or any other time interval. However, to retrieve the detailed data in the transaction fact table, snapshot fact tables or periodic snapshots rely on the transaction fact table. The periodic snapshot tables are typically large and take up a lot of space.
Accumulating Fact Table: These are used to depict the activity of any process with a well-defined beginning and end. Multiple data stamps are commonly found in accumulating snapshots, which reflect the predictable stages or events that occur over the course of a lifespan. There is sometimes an extra column with the date that indicates when the row was last updated.
4. What do you mean by dimension table in the context of data warehousing? What are the advantages of using a dimension table?
A table in a data warehouse’s star schema is referred to as a dimension table. Dimensional data models, which are made up of fact and dimension tables, are used to create data warehouses. Dimension tables contain dimension keys, values, and attributes and are used to describe dimensions. It is usually of a tiny size. The number of rows might range from a few to thousands. It is a description of the objects in the fact table. The term “dimension table” refers to a collection or group of data pertaining to any quantifiable occurrence. They serve as the foundation for dimensional modelling. It includes a column that serves as a primary key, allowing each dimension row or record to be uniquely identified. Through this key, it is linked to the fact tables. When it’s constructed, a system-generated key called the surrogate key is used to uniquely identify the rows in the dimension.
Following are the advantages of using a dimension table :
It features a straightforward design.
It is simple to study and comprehend.
It stores data that has been de-normalized.
It aids in the preservation of historical data for any dimension.
It’s simple to get info from it.
It’s simple to build and put into action.
It provides the context for any business operation.
以下是使用维度表的优点:
它具有简单的设计。
它很容易学习和理解。
它存储已反规范化的数据。
它有助于保存任何维度的历史数据。
从中获取信息很简单。
它很容易构建和付诸实施。
它为任何业务操作提供上下文。
5. What are the different types of dimension tables in the context of data warehousing?
Following are the different types of dimension tables in the context of data warehousing:
5. 数据仓库上下文中的维表有哪些不同类型?
以下是数据仓库上下文中不同类型的维度表:
Slowly Changing Dimensions (SCD): Slowly changing dimensions are dimension attributes that tend to vary slowly over time rather than at a regular period of time. For example, the address and phone number may change, but not on a regular basis. Consider the case of a man who travels to several nations and must change his address according to the place he is visiting. This can be accomplished in one of three ways:
Type 1: Replaces the value that was previously entered. This strategy is simple to implement and aids in the reduction of costs by saving space. However, in this circumstance, history is lost.
Type 2: Insert a new row containing the new value. This method saves the history and allows it to be accessed at any time. However, it takes up a lot of space, which raises the price.
Type 3: Add a new column to the table. It is the ideal strategy because history can be easily preserved.
Junk Dimension: A trash dimension is a collection of low-cardinality attributes. It contains a number of varied or disparate features that are unrelated to one another. These can be used to implement RCD (rapidly changing dimension) features like flags and weights, among other things.
Conformed Dimension: Multiple subject areas or data marts share this dimension. It can be utilised in a variety of projects without requiring any changes. This is used to keep things in order. Dimensions that are exactly the same as or a proper subset of any other dimension are known as conformed dimensions.
Roleplay Dimension: Role-play dimension refers to the dimension table that has many relationships with the fact table. In other words, it occurs when the same dimension key and all of its associated attributes are linked to a large number of foreign keys in the fact table. Within the same database, it might serve several roles.
Degenerate Dimension: Degenerate dimension attributes are those that are contained in the fact table itself rather than in a separate dimension table. For instance, a ticket number, an invoice number, a transaction number, and so on.
6. Differentiate between fact table and dimension table.
The record of a reality or fact table could be made up of attributes from various dimension tables. The Fact Table, also known as the Reality Table, assists the user in investigating the business aspects that aid him in call taking in order to improve his firm. Dimension Tables, on the other hand, make it easier for the reality table or fact table to collect dimensions from which measurements must be taken.
The following table enlists the difference between a fact table and a dimension table:
Fact Table
Dimension Table
It contains the attributes’ measurements, facts, or metrics.
It is the companion table that has the attributes that the fact table uses to derive the facts.
Data grain (the most atomic level by which facts may be defined) is what defines it.
It is detailed, comprehensive, and lengthy.
It is used for analysis and decision-making and contains measures.
It contains information regarding a company’s operations and procedures.
It contains information in both numeric and textual formats.
It only contains textual information.
It has a primary key that works as a foreign key in the dimension table.
It has a foreign key that is linked to the fact table’s primary key.
It stores the filter domain and reports labels in dimension tables.
It organizes the atomic data into dimensional structures.
It does not have a hierarchy.
It has a hierarchy.
It has lesser attributes than a dimension table.
It has more attributes than a fact table.
It has more records as compared to a dimension table.
It has fewer records than a fact table.
Here, the table grows vertically.
Here, the table grows horizontally.
It is created after the corresponding dimension table has been created.
It is created prior to the creation of the fact table.
Following are the advantages of using a data warehouse:
7. 数据仓库有什么优势?
以下是使用数据仓库的优点:
Helps you save time:
To stay ahead of your competitors in today’s fast-paced world of cutthroat competition, your company’s ability to make smart judgments quickly is critical.
A Data warehouse gives you instant access to all of your essential data, so you and your staff don’t have to worry about missing a deadline. All you have to do now is deploy your data model to start collecting data in a matter of seconds. You can do this with most warehousing solutions without utilising a sophisticated query or machine learning.
With data warehousing, your company won’t have to rely on a technical professional to troubleshoot data retrieval issues 24 hours a day, seven days a week. You will save a lot of time this way.
The high-quality data ensures that your company’s policies are founded on accurate information about your operations.
You can turn data from numerous sources into a shared structure using data warehousing. You can assure the consistency and integrity of your company’s data this way. This allows you to spot and eliminate duplicate data, inaccurately reported data and disinformation.
For your firm, implementing a data quality management program may be both costly and time-consuming. You can easily use a data warehouse to reduce the number of these annoyances while saving money and increasing the general productivity of your company.
Throughout your commercial endeavours, you can use a data warehouse to gather, absorb, and derive data from any source. As a result of the capacity to easily consolidate data from several sources, your BI will improve by leaps and bounds.
增强商业智能 (BI):
在您的商业活动中,您可以使用数据仓库从任何来源收集、吸收和派生数据。由于能够轻松整合来自多个来源的数据,您的 BI 将得到突飞猛进的改进。
Data standardization and Consistency are achieved:
The uniformity of huge data is another key benefit of having central data repositories. In a similar manner, a data storage or data mart might benefit your company. Because data warehousing stores data from various sources in a consistent manner, such as a transactional system, each source will produce results that are synchronized with other sources. This ensures that data is of higher quality and homogeneous. As a result, you and your team can rest assured that your data is accurate, resulting in more informed corporate decisions.
A data warehouse improves security by incorporating cutting-edge security features into its design. For any business, consumer data is a vital resource. You can keep all of your data sources integrated and properly protected by adopting a warehousing solution. The risk of a data breach will be greatly reduced as a result of this.
Because a data warehouse can hold enormous amounts of historical data from operational systems, you can readily study different time periods and inclinations that could be game-changing for your business. You can make better corporate judgments about your business plans if you have the correct facts in your hands.
8. What are the disadvantages of using a data warehouse?
Following are the disadvantages of using a data warehouse:-
Loading time of data resources is undervalued: We frequently underestimate the time it will take to gather, sanitize, and post data to the warehouse. Although some resources are in place to minimize the time and effort spent on the process, it may require a significant amount of the overall production time.
Source system flaws that go unnoticed: After years of non-discovery, hidden flaws linked with the source networks that provide the data warehouse may be discovered. Some fields, for example, may accept nulls when entering new property information, resulting in workers inputting incomplete property data, even if it was available and relevant.
Homogenization of data: Data warehousing also deals with data formats that are comparable across diverse data sources. It’s possible that some important data will be lost as a result.
9. What are the different types of data warehouse?
Following are the different types of data warehouse:
9. 数据仓库有哪些不同类型?
以下是不同类型的数据仓库:
Enterprise Data Warehouse:
An enterprise database is a database that brings together the various functional areas of an organisation in a cohesive manner. It’s a centralised location where all corporate data from various sources and apps can be accessed. They can be utilised for analytics and by everyone in the organisation once they’ve been saved. The data can be categorised by subject, and access is granted according to the necessary division. The tasks of extracting, converting, and conforming are taken care of in an Enterprise Data Warehouse.
Enterprise Data Warehouse’s purpose is to provide a comprehensive overview of any object in the data model. This is performed by finding and wrangling the data from different systems. This is then loaded into a model that is consistent and conformed. The data is acquired by Enterprise Data Warehouse, which can provide access to a single site where various tools can be used to execute analytical functions and generate various predictions. New trends or patterns can be identified by research teams, which can then be focused on to help the company expand.
An operational data store is utilised instead of having an operational decision support system application. It facilitates data access directly from the database, as well as transaction processing. By checking the associated business rules, the data in the Operational Data Store may be cleansed, and any redundancy found can be checked and rectified. It also aids in the integration of disparate data from many sources so that business activities, analysis, and reporting may be carried out quickly and effectively while the process is still ongoing.
The majority of current operations are stored here before being migrated to the data warehouse for a longer period of time. It is particularly useful for simple searches and little amounts of data. It functions as short-term or temporary memory, storing recent data. The data warehouse keeps data for a long time and also keeps information that is generally permanent.
Data Mart is referred to as a pattern to get client data in a data warehouse environment. It’s a data warehouse-specific structure that’s employed by the team’s business domain. Every company has its own data mart, which is kept in the data warehouse repository. Dependent, independent, and hybrid data marts are the three types of data marts. Independent data marts collect data from external sources and data warehouses, whereas dependent data marts take data that has already been developed. Data marts can be thought of as logical subsets of a data warehouse.
10. What are the different types of data marts in the context of data warehousing?
Following are the different types of data mart in data warehousing:
10. 在数据仓库的背景下,有哪些不同类型的数据集市?
以下是数据仓库中不同类型的数据集市:
Dependent Data Mart: A dependent data mart can be developed using data from operational, external, or both sources. It enables the data of the source company to be accessed from a single data warehouse. All data is centralized, which can aid in the development of further data marts.
Independent Data Mart: There is no need for a central data warehouse with this data mart. This is typically established for smaller groups that exist within a company. It has no connection to Enterprise Data Warehouse or any other data warehouse. Each piece of information is self-contained and can be used independently. The analysis can also be carried out independently. It’s critical to maintain a consistent and centralized data repository that numerous users can access.
Hybrid Data Mart: A hybrid data mart is utilized when a data warehouse contains inputs from multiple sources, as the name implies. When a user requires an ad hoc integration, this feature comes in handy. This solution can be utilized if an organization requires various database environments and quick implementation. It necessitates the least amount of data purification, and the data mart may accommodate huge storage structures. When smaller data-centric applications are employed, a data mart is most effective.