Data ware House
- Before speaking about data warehouse let us discuss something generic to make clear understanding to beginners so that they will not be in assumption that data warehouse is some new technology.
- So now what is warehouse? Answer is simple, it is a building where raw materials or manufactured goods may be stored prior to their distribution for sale. So in simple words if you want to build a warehouse you have to construct a building. Now what is difference between a construction of your home and warehouse? It is only the structure difference and usage difference. If you construct a building with hall, kitchen, bedroom, etc it is meant for living and if the same construction structure changes to a huge hall it may be called as warehouse where raw materials or manufactured goods can be stored.
- Now I will put a word here that data warehouse is nothing but a database.
- There may be lots of databases built for different purpose, for example ABC company may build a database to store its data entered from front end. Likewise based on purpose and necessity a database is being built.
- If an organization needs to "store historical business data where it can be used to analyze historical, current and predictive views of business operations" then we can call it as data warehouse.
Business Intelligence
- In order to make good business decisions on business/organization multiple levels of people in organization may need historical, current, and predictive views of business operations in the form of data. To add value to business data one should have interactive reporting, such reports can be built on top of well-structured data.
- So Business Intelligence applications need data in which it is stored in well structured format. Data should be collected from different sources (Transactional data) and stored in a data mart/data ware house which is a relational database. This data mart/data warehouse will be source for OLAP/multi-dimensional database (cubes). Finally we can generate reports either on top of data mart/data ware house or OLAP/multi-dimensional database (cubes).
Data mart
- Data mart is subset of data-ware house (Not all times).
- Let us say an organization is having sales data which is maintained at its retail shop systems and online systems.
- The data which is being pulled from retail shops can be called as "Retail Sales Data mart" and data from online can be called as "Online Sales Data mart".
- Now the combination of "Retail Sales Data mart" + "Online Sales Data mart" gives complete "Sales Data ware house".
- From our above example "Sales Data ware house" is super set formed from "Retail Sales Data mart" and "Online Sales Data mart" subsets.
- In certain situations if we have only retail sales then "Retail Sales Data mart" will be data mart as well as data ware house which justifies definition "Data mart may or may not be subset of data ware house."
OLTP
- OLTP stands for Online Transaction Processing.
- OLTP deals with processing of data from transaction systems.
- For example, an application that maintains data of IRCTC(Rail) reservation is an OLTP system. An OLTP system is designed mainly keeping in the mind the performance of the end application. It comprises of the application, database & the reporting system that directly works on this database. The database in an OLTP system would be designed in a manner as to facilitate the improvement in the application efficiency thereby reducing the processing time of the application. The database of such an application would be designed mainly for faster inserts of the travel related data. It would also be designed in a manner as to get a faster retrieval of the reservation availability information.
OLAP
- OLAP stands for Online Analytical Processing.
- OLAP systems were mainly developed using data in a warehouse. It deals with older data, old data is necessary to store them in a format that would give good performance for reporting. OLAP systems provide faster & efficient reporting.
Data Modelling
- Data modelling is conceptual representation of objects(tables) required for database. It represents the nature of data, business rules, logical design and physical design of database.
- Data modelling is achieved by top down or bottom up approach.
- Top down data models are created by understanding and analyzing business requirements.
- Bottom up approach data model will be derived from existing database.
- Data modelling involves highly normalized tables since it deals with OLTP systems.
Dimensional Modelling
- Dimensional modelling is conceptual representation of dimension and fact tables required for Data warehouse.
- It is the process of splitting dimension data in dimension tables and fact data into fact tables.
- Dimensional modelling involves de-normalized tables since it deals with OLAP systems.
Schema
- Schema is logical design of a database / data ware house. Based on relationship between fact and dimension tables in data ware house schema falls under two categories.
- Star schema: If every dimension table is directly related to fact table in a data ware house then it is said to be star schema.
- Snowflake schema: If few dimension tables are directly related and few are indirectly related to fact table in a data ware house then it is said to be snowflake schema.
Dimension table
Dimension
table is a table which contains textual description and their facts in a fact
table
Fact table
Fact table is a table
contains numeric values which speaks about the performance of organization
ETL
ETL stands for Extraction, Transformation and Loading. ETL
can extract and transform data from a wide variety of sources such as XML data
files, flat files, and relational data sources, and then load the data into one
or more destinations.
SSIS
SSIS stands for SQL server integration services.
A SQL
server integration service is a component of SQL server which can be used to perform a wide range of
data migration and ETL operations.
This
is a platform for integration and work flow applications. It is known for a
fast and flexible OLTP and OLAP extensions used for data Extraction, Transform
and Loading. The tool may also be used to automate maintenance of SQL server
data bases and multiple dimensional data sets.
Package
Package
is an organized collection of connections, control flow elements, data flow
elements, event handlers, variables, parameters, and configurations that you
assemble using the graphical design tools.
Connection Manager
It
connects outside of SSIS and points to different data sources.
It can
be used to connect to a source of data or to a destination.
Integration
services includes a variety of connection managers for connecting to different data sources, such as Relational databases,
Analysis services databases, and files in CSV and XML formats.
Control Flow
Control
flow manages flow of tasks in a package. Package consists of one or more data
flows. SSIS provides three different types of Control flow elements
Containers that provide
structures in packages, tasks that provide functionality,
And precedence
constraints that connect the executables.
Precedence constraint
In SSIS, tasks are
linked by precedence constraints. A task will only execute if the condition
that is set by the precedence. By using these constraints, it is possible to
ensure different execution paths depending on the success or failure of other
tasks. This means that you can use tasks with precedence constraints to
determine the work flow of an SSIS package.
Data Flow
Data flow most
frequently moves data from source to destination, SSIS Provides three different
types of data flow components: Sources, transformations, and destinations.
Sources extract data
from data stores such as tables and views in relational databases, files, and
Analysis services databases.
Transformations modify,
summarize and clean data. Destinations load data into data stores
Data flow sources
- ADO .NET / DataReader Source
The DataReader Object in ADO.NET is a stream based, forward
only, read only retrieval of query results from the Data sources, which do not
update the data. The DataReader cannot be created directly from code, they can
be created only by calling the EXECTEREADER method of a command object.
- Excel Source
Use
Excel Source and Excel Connection manager when reading from workbooks from
Excel 2003 and prior.In Excel, a worksheet or range is the equivalent of a table
or view.The list of available tables in the Excel Source and Destination
editors displays existing worksheets ( identified by the $ sign appended to the
worksheet name, such as Sheet1$)
- Flat File Source
A Flat File source is a
data flow component that uses metadata defined by a Flat File Connection
manager to specify the format and structure of the data to be extracted
from the flat file by a transform process.A flat file can be a plain text
file or a binary file configured to extract data from a single flat file
by using the file format definition provided by the Flat File connection
manager.
- OLE DB Source
The OLE DB source extracts data from a variety of OLE DB
relational databases by using a database table, a view, or an SQL Command.
For example the OLE DB source can
extract data from tables in SQL server databases.
- Raw File Source
The Raw file source reads raw data from a file. Because the
representation of the data is native to the source, no data requires no
translation and almost no parsing. This means that the Raw file source can read
data more quickly than other sources such as the Flat File and OLEDB Source
- XML Source
The XML Source reads an XML data file and populates the
columns in the source output with the data. The data in XML files frequently
includes hierarchical relationships. For example, an XML data file can
represent catalogs and items in catalogs. Before the data can enter the
dataflow, the relationships of the elements in XML data file must be
determined, and an output must be generated for each element in the file.
Data
Transformations
- Aggregate
The Aggregate transformation is used to perform aggregate
operations/ functions. The available aggregate functions are Group by, Sum,
Average, Count, Count distinct, Minimum, Maximum. The aggregate transformation
has one input and one or more outputs.
- Audit
Audit transformation helps the users to build a package which
requires about the environment, audit transformation is a simple component that
adds the values of certain System variables as new columns to the dataflow. It
allows for a single system variable to be added as many times, the variables
are Execution Instance GUID, Package ID, Package Name, Version ID, Execution
Start time, Machine name, User name, Task name, Task ID.
Mostly using this
component creating log entries or adding Meta data to error traps.
- Character Map
Character Map Transformation is used to make changes to
string type columns only. Using this transformation to change lower case to
upper case and Upper case to lower case.
The List of operation
details in character map transformation is Lower case, Upper case, Byte
reversal, Full width, half width etc.
- Conditional Split
Conditional split transformation split data rows to different
outputs depending on what ever criteria of the data that you wish.
- Copy Column
Copy
column transformation helps you to creating new columns by copying input
columns and adding new columns to transformation output. In simple it creates
alias columns of the input columns. You can create any number of new columns by
copying the same input column
- Data Conversion
Data Conversion is used to change the data type of column.you
have Unicode string formate in source database and the same column is defined
as Non unicode in destination. In this case we need to convert the data before
it reaches the destination we use Data conversion transformation.
- Data Mining Query
The Data Mining Query transformatin performs prediction
queries against data mining models. This transformation contains a query
builder for creating Data Mining Extensions(DMX) queries.This transformation uses
an Analysis services connection manager to connect the Analysis services
project or the instance of Analysis Services that contains the mining structure
and mining models.This transformation has one input and one output.It doesn’t
support an error output.
- Derived Column
Derived column transformation is used to deriving a new
column at run time. Let’s say we have bonus and salary as two fields and we
need to calculate the gross salary using derived column transformation.
- Export Column
The Export column transformation reads data in a dataflow and
inserts the data in a file. For example, if the data flow contains product
information, such as a picture of each product, Use the export column
transformation to save the images to files.
- Fuzzy Grouping
Fuzzy
Grouping transformation helps you group rows of data together based on the
similarity of the values in a specified column. It is used as a data cleaning
component, which identifies rows that are possible duplicates and then selects
a specific row of data to use in standardizing the data. This is useful for
performing aggregations on this data after fuzzy grouping has been done.
- Fuzzy Lookup
Fuzzy
Lookup transformation perform data cleaning tasks such as standardizing data,
correcting data, and providing missing values.
The Fuzzy
Lookup transformation uses fuzzy matching to return one or more close matches
in the reference table. Fuzzy lookup are commonly used to standardize address
and names
- Import Column
Import Column transformation is used in order to import some
data from a file to the dataflow and do some manipulations and then forward the
data to the destination, here data includes a binary file, an image, a media,
or any sort of document. If we want to move a huge file from one location to
other using the package use import column transformation.
- Lookup
- Lookup transformation acts like equi join in sqlserver. And It performs lookup by joining data in inputs columns in a reference dataset
- Merge
The Merge transformation merges two sorted datasets and
outputs a sorted dataset. It can accepts max two inputs and the inputs must be
sorted.the merge transformation is similar to Union all transformation.use
union all transformtion instead of merge some situations.
The transformation has more than two inputs and the
transformation inputs are not sorted at this situations we go for Union all
transformation.
- Merge Join
The Merge join transformation provides an ouput that is
generated by joinning two sorted datasets using a FULL, LEFT, or INNER join. This is similar to database join
operation. This transformation has two inputs and one output.
- Multicast
The Multicast transformation distributes its inputs to
multiple outputs.it basically creates copies of an input dataset. This
transformation has one input and multiple outputs.
- OLE DB Command
OLEDB Command is used with SCD (Slowly changing dimension),
as SCD is designed In the way that it performs insert operations for the new
rows,update operation through OLEDB Command transformation.
- Percentage Sampling
Percentage Sampling transformation is used to split the
dataset into separate outputs based on the percent.lets say I have 2000 records
in my source table and I connect source to PERCENTAGE SAMPLING transformation
by providing limit record values to 20 then 20% percent of total records will
be flowed from the transformation.
Pivot
Pivot transformation in SSIS helps to used to convert the
rowsto columns to visualize data in different way.
- Row Count
RowCount transformation is used to track how many records are
loaded.
Ex: We receive flat file that we need to load into our
sqlserver table.Beside loading the data from flat file into our destination
table we also want to track how many records loaded into flat filie.
- Row Sampling
Row Sampling transformation will Limit the records by passing
integer value .Row Sampling transformation is similar to percentage sampling.
Percentage Sampling will Limit the records by passing %
Ex: I have 1000 records in my source we use Row Sampling transformation
if I take 10 as a value then output is 10 records.
Ex: Same as above example we use Percentage Sampling
transformation
If I take 10 as value then 10% of 1000 i.e 100 records will
be the output.
- Script Component
Script Component executes the script for each row,it applies
custom transformation on each of the data row reads.Script Component works in
the Data Flow task, by using Script Component breakpoints are not supported.
- Slowly Changing Dimension
Slowly Changing Dimension transformation in data flow task of
SSIS used to logical groupings of data such as geographical location, Customer
or product information. With Slowly Changing Dimensions data changes slowly
rather than changing on time-based, regular schedule.
There are different types of Slowly changing dimesions:
Type 1: In Type 1 SCD new information
overwrites the orginal information. No history is kept.
SupplierID SupplierName Address
S001 ABC Company USA
ABC
Company moved from USA to UK, the new information replaces the new record
SupplierID SupplierName Address
S001 ABC Company UK
In
Type 1 it is not possilbe to track history.
Type 2: In Type 2 SCD new record is added to the table to represent new
information. Here both orginal and new record will be present.
SupplierID SupplierName Address
S001 ABC Company USA
S001 ABC Company UK
In Type 2
to keep history information.
Type 3: In Type 3 SCD, there will be two columns to indicate orginal
value and Current value. There will be another column that indicates when the current
value becomes active.
SupplierID SupplierName OrginalAddress CurrentAddress Effectivedate
S001 ABC Company USA UK 10-oct-2012
Type 3 rarely used and allows to keep
some part of history
Sort
Sort transformation is used to Sort the data either ascending or descending order. Sort
transformation is mainly used with Merge or Merge join transformations.We can
call Sort transformation as Asynchronous transformation and Blocked
transformation.
- Term Extraction
Term Extraction transformation extract terms from the text in
a transformtion input column, and then writes the terms to a transformation
output column.the transformation works only with english text.This transform
analyze free flowing text in your data source, and enables you to pull out
nouns and noun phrases that are found in
it,along with frequencies with which they occur.
- Term Lookup
Term Look is one of the powerful transformation in SSIS. This
transformation is bascially helps to analyze the set of text records.Term Look
transformation will look for set of user predefined nouns or noun phrases in
transformation input text column and provide the count of occurences as an
output on row by row basics
- Union All
Union All transformation can accepts two more than two
datasets for input.this transformation doesn’t require sort datasets.
Unpivot
Unpivot transformation state that values in the multiple
columns can be transformed into single column.
Ex: Sales information for each year is stored in year column
i.e before unpivot Year wise sales is apppear as columns(2001,2002,2003,2004)
after apply Unpivot transformation Year wise columns are transformed to single
column (Year)
Data
flow destinations
- Data Mining Model Training
The Data Mining Model Training destination trains data mining
models by passing the data that the destination receives through the data
mining model algorithms. Multiple data mining models can be trained by one
destination if the models are bulit on the same datamining structure.
- DataReader Destination
The DataReader Destination exposes the data in a data flow by
using the ADO.NET DataReader interface. The data can then be consumed by other
applications. For example, you can configure the data source of a Reporting
services report to use the result of
running a Microsoft SQL Server Integration Services package. To do this,
you create a data flow that implements the DataReader destination.
- Dimension Processing
The Dimension Processing destination loads and processes an
SQL Server Analysis Services dimension.
- Excel Destination
Excel Destination is the Data flow destination in SSIS. Get
the data from various source i.e. Flat file, Oledb source into Excel by using Excel
destination.
- Flat File Destination
Flat File destination is the data flow destination from SSIS
Tool box.I have a text file with input data I want to import into Flat file
destination by using Flat File connection manager.when you configure the Flat file
connection manager, you specify information about the file and about each
column in the file and you specify the characters that delimt columns and rows
in the file and specify the data type and length of each column.
- OLE DB Destination
OLEDB Destinaion is used to load data from variety of sources
such as Flat file,Excel, and OLEDB source etc.This destination uses an OLEDB
connection manager to connect to a datasource and the connection manager
specifies the OLEDB provider to use
OLEDB Destination provides five different data access modes
for loading data
A Table or View : You can specify existing table
or view, or you create a new table.
A Table or view using fast-load
options : you can
specify an existing table or create a new table.
A Table or View specified in a
variable
A Table or view specified in a
variable using fast-load options.
The result of an Sqlstatement.
- Partition Processing
- The Partition Processing destination loads and processes an SQL Server Analysis Services partition.
- Raw File Destination
The Raw File Destination writes raw data to a file. Because
the format of the data is native to the destination, the data requires no
translation and little parsing. This
means that the Raw file destination can write data more quickly than other
destinations such as the Flat File and the OLE DB destinations.
- Recordset Destination
The Recordset Destination creates and populates an in-memory
ADO recordset. The shape of the recordset is defined by the input the Record
set destination at design time.
- SQL Server Destination
SQL Server Destination connects to local Sqlserver database
and bulk loads data into SQL Server tables and views.Compare to OLEDB
Destination SQL Server Destination is quick to load data from source to
destination.you cannot use Sql server destination in packages that access a SQL
server database on remote server.
- SQL Server Mobile Destination
Variable
Variables can store the
values and share it with one or more other tasks in the package. Using
variables populating parameter values for T-SQL statements at run time for
Example a package can run an Execute SQL task and then use variables
dynamically set the parameters in a T-SQL statement.
Connection Manager
Connection Manager plays
a vital role in SSIS Packages. SIS packages typically interact with variety of
data sources. There is a connection manager’s area on the design surface where
you can specify each data source that you will access
Ex: My source is Excel I
have to load data from Excel to Sqlserver table I required two connection
managers one for Excel spreadsheet another for database.
Event Handler
AS the name says based
on certain event, we would like to take some action (handle the event). We may
create an email in case of a success or failure in the package.
SSIS provides number of
event handlers
1) On Error 2) On
Information 3) On post Execute 4) On post validate 5) On pre execute.
Logging
Logging will help to use
to log the events that are happening in the package. For example we need to
know which all tasks have executed successfully or failed, what are the error
messages etc. we have various options to save SSIS Logging. We can save the
logs to Windows Event log, a Text File, XML File, SQL Server Table or SQL
profiler. Choose the one you feel perfect for your need. You need to do log for
the entire package, you could select the executable for which you want to
enable to logging.
SSAS
SQL Server Analysis
Services (SSAS) is the technology from the Microsoft Business Stack, to develop
online Analytical processing (OLAP) solutions.In simple terms, you can use SSAS
to create cubes using data from data marts/ data warehouse for deeper and
faster analysis
Data source
Data source contains
connection information provides Provider Name (Native Oledb\ Sql server native
client 10.0), Server Name, Authentication, database, and Impersonation
information (use service account) to get the data from datamart/datawarehoue to
Analysis services
Data source view
Data Source view
contains Persistant set of tables means get the required dimensions and measure
groups to create a cube. Data source view is a single, unified view of the
metadata from the specified tables and views that the data source defines in
the project. It contains several features Named query, Named calculation,
define logical primary keys etc.
Named Query
Named Query in DSV to
add only required field from the multiple tables. It is very similar to the
views in database. There may be scenario where few fields required out of n
number of fields from multiple tables for e.g only theree fields are important
for product sales analysis such as product name, product subcategory and
product category but these are spread across theree tables product,
productsubcategory, productcategory. Here we make use of named query instead of
referring these tables directly in DSV which improves the performance as well.
Named Calculation
Named Calculation is a
sqlexpression represented as a calculated column. This expression behaves as a column in the table.A named calculation lets
you extend the relational schema of existing tables or views in a datasource
view with out midifying the tables or views in the underlying datasource.you
cannot add a named calculation to a named query.
Dimension
Dimensions are business
objects or entities which contains a list of attribures that describe the
object(dimension itself) and they are usually derived from some code
tables.These attribures are appear as attribute hierarchies and can be
organized into user-defined hierarchies, or can be defined as parent-child
hierarchies based on columns in the underlying dimension table.
Attribute
The Columns in the
dimension table represent as Attributes.using these attributes we create
attribute hierarchies by default every attribute in a Dimension has an
Attribure Hierarchy
Hierarchy
A Hierarchy is an very important part of any
OLAP engine and allows users to drill down from summary levels down to detail
levels.they are differnet types of user-defined hierarchies that can be created
based on the relationship of attributes members with each other.
Natural Hierarchy
When levels in a
hierarchy are linked in a natural relationship like one-to-one or
many-to-one,such hierarchies are know as Natural Hierarchies. Like in a
calender Hierarchy, Day level related to month level,month level related to
Quarter level and so on.Natural Hierarchies are also know as Balanced Hierarchy
UnNatural Hierarchy
In some type of
Hierarchies the natural relationship is lacking between the hierarchy levels
and branches of the hierarchy descend to different levels. Such hierarchies are
know as UnNatural Hierarchies or Unbalanced hierarchies.For example, the
Martial Status a Gender Hierarchy is a Unnatural hierarchy as Martial status and gender attributes do
not have relationship to each other.
Ragged Hierarchy
Are the type of
Unbalanced hierarchies where in, the logical parent member of atleast one
member is not in the level immediately above the member. In this case hierarchy
descends to different levels for different drill down paths. Take an example of
geography hierarchy the levels available are Country a state/ Province a
city.But some counties like Greece do not have provinces thus creating a ragged
hierarchy.
Attribute relationship
The process of setting
relationships between the attributes.these attributes in a dimension are
organized based on the relationship defined between the attributes in a
dimension table.
A dimension model is
based on either star schema or snow-flake schema.In star schema, all the
dimension attributes belongs to the same relational table and hence an
attribute relationship is automatically defined between the key attribute and
each non-key attribute of the dimension.In snow-flake schema the dimension
attributes are derived from more than one tables related with each other.For
example Product dimension which has the attributes derived from theree
dimension tables,Dimproduct category,Dimproductsubcategory,Dimproduct.
Cube
The basic unit of
storage and analysis in Analysis services is the cube.A cube is a collection of
data that’s been aggregated to allow queries to return data quickly.Cubes are
ordered into dimensions and measures. The data for a cube comes from a set of
datawarehoue tables, Dimensions in the cube come from dimension tables in the
datawarehouse database,while measure come from fact tables in dataware house
database.
Measure group
A Measure group is a
collection of measures which belong to the same underlying fact table.
Measure
A Measure is any numeric
value that represents a metric aligned to an organizatons goals. This is the
value which the business users are intersted in viewing these values from
different angles and different granularity levels. A measure is also commonly
called a fact.
Calculated measure
A Calculation is nothing
but an MDX expression or script which is used to define a calculated member in
a cube.Calculated members can be defined for dimensions or measures.The members
defined for measure dimensions are called Calculated measures.Calculated
members are defined using the data available in the cube and can be made
complex by combining data with arithmetic operators,numbers and functions.
Named set
A Named set is reusable
set of dimension members or a defined set expression,using Multidimensional
Expression(MDX). You define named sets by combining cube data,arthimetic
operators, numbers and functions.For example you can define a named set to
render the Top Ten products, based on highest values of the Sales Measure.
Dimension usage
Dimension usage is used to
define the relationships between dimensions and measure groups, and the
granularity of each dimension within each measure group. If you use multiple
fact tables, you might have to identify whether measures do not apply to one or
more dimensions.
Calculations
Use this tab to examine calculations
that are defined for the cube, to define new calculations for the whole cube or
for a subcube, to reorder existing calculations, and to debug calculations step
by step by using breakpoints. Calculations let you define new members and
measures based on existing values, such as a profit calculation, and to define
named sets.
KPIs
Use this tab to create, edit,
and modify the Key Performance Indicators (KPIs) in a cube. KPIs enable the
designer to quickly determine useful information about a value, such as whether
the defined value exceeds a goal or falls short of the goal, or whether the
trend for the defined value is getting better or worse.
Actions
Use this tab to create or modify
drillthrough, reporting, and other actions for the selected cube. Actions
provide to client applications context-sensitive information, commands, and
reports that end users can access.
Partitions
Partitions let you store
sections of a cube in different locations with different properties, such as
aggregation definitions.
Perspectives
Use this tab to create and
manage the perspectives in a cube. A perspective is a defined subset of a cube,
and is used to reduce the perceived complexity of a cube to the business user.
Translations
Use this tab to create and
manage translated names for cube objects, such as month or product names.
Happy Business Intelligence:-)
సునీల్ ఆనిపాకుల / Sunil Anipakula.
No comments:
Post a Comment