Wednesday, August 27, 2014

Basics & Components Of Microsoft Business Intelligence


Data ware House

  • Before speaking about data warehouse let us discuss something generic to make clear understanding to beginners so that they will not be in assumption that data warehouse is some new technology. 
  • So now what is warehouse? Answer is simple, it is a building where raw materials or manufactured goods may be stored prior to their distribution for sale. So in simple words if you want to build a warehouse you have to construct a building. Now what is difference between a construction of your home and warehouse? It is only the structure difference and usage difference. If you construct a building with hall, kitchen, bedroom, etc it is meant for living and if the same construction structure changes to a huge hall it may be called as warehouse where raw materials or manufactured goods can be stored. 
  • Now I will put a word here that data warehouse is nothing but a database.
  • There may be lots of databases built for different purpose, for example ABC company may build a database to store its data entered from front end. Likewise based on purpose and necessity a database is being built. 
  • If an organization needs to "store historical business data where it can be used to analyze historical, current and predictive views of business operations" then we can call it as data warehouse.

Business Intelligence
  • In order to make good business decisions on business/organization multiple levels of people in organization may need historical, current, and predictive views of business operations in the form of data. To add value to business data one should have interactive reporting, such reports can be built on top of well-structured data.
  • So Business Intelligence applications need data in which it is stored in well structured format. Data should be collected from different sources (Transactional data) and stored in a data mart/data ware house which is a relational database. This data mart/data warehouse will be source for OLAP/multi-dimensional database (cubes). Finally we can generate reports either on top of data mart/data ware house or OLAP/multi-dimensional database (cubes).
Data mart
  • Data mart is subset of data-ware house (Not all times).
  • Let us say an organization is having sales data which is maintained at its retail shop systems and online systems. 
  • The data which is being pulled from retail shops can be called as "Retail Sales Data mart" and data from online can be called as "Online Sales Data mart".
  • Now the combination of "Retail Sales Data mart" + "Online Sales Data mart" gives complete "Sales Data ware house". 
  • From our above example "Sales Data ware house" is super set formed from "Retail Sales Data mart" and "Online Sales Data mart" subsets. 
  • In certain situations if we have only retail sales then "Retail Sales Data mart" will be data mart as well as data ware house which justifies definition "Data mart may or may not be subset of data ware house."
OLTP
  • OLTP stands for Online Transaction Processing. 
  • OLTP deals with processing of data from transaction systems. 
  • For example, an application that maintains data of IRCTC(Rail) reservation is an OLTP system. An OLTP system is designed mainly keeping in the mind the performance of the end application. It comprises of the application, database & the reporting system that directly works on this database. The database in an OLTP system would be designed in a manner as to facilitate the improvement in the application efficiency thereby reducing the processing time of the application. The database of such an application would be designed mainly for faster inserts of the travel related data. It would also be designed in a manner as to get a faster retrieval of the reservation availability information.

OLAP
  • OLAP stands for Online Analytical Processing.
  • OLAP systems were mainly developed using data in a warehouse. It deals with older data, old data is necessary to store them in a format that would give good performance for reporting. OLAP systems provide faster & efficient reporting.
Data Modelling
  • Data modelling is conceptual representation of objects(tables) required for database. It represents the nature of data, business rules, logical design and physical design of database. 
  • Data modelling is achieved by top down or bottom up approach. 
  • Top down data models are created by understanding and analyzing business requirements. 
  • Bottom up approach data model will be derived from existing database. 
  • Data modelling involves highly normalized tables since it deals with OLTP systems.
Dimensional Modelling
  • Dimensional modelling is conceptual representation of dimension and fact tables required for Data warehouse.
  • It is the process of splitting dimension data in dimension tables and fact data into fact tables.
  • Dimensional modelling involves de-normalized tables since it deals with OLAP systems.
Schema
  • Schema is logical design of a database / data ware house. Based on relationship between fact and dimension tables in data ware house schema falls under two categories.
  • Star schema: If every dimension table is directly related to fact table in a data ware house then it is said to be star schema.
  • Snowflake schema: If few dimension tables are directly related and few are indirectly related to fact table in a data ware house then it is said to be snowflake schema. 
Dimension table
Dimension table is a table which contains textual description and their facts in a fact table 

Fact table
Fact table is a table contains numeric values which speaks about the performance of organization

ETL
          ETL stands for Extraction, Transformation and Loading. ETL can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations.

SSIS
          SSIS stands for SQL server integration services.
A SQL server integration service is a component of SQL server which can be used to perform a wide range of data migration and ETL operations.
This is a platform for integration and work flow applications. It is known for a fast and flexible OLTP and OLAP extensions used for data Extraction, Transform and Loading. The tool may also be used to automate maintenance of SQL server data bases and multiple dimensional data sets. 

Package
Package is an organized collection of connections, control flow elements, data flow elements, event handlers, variables, parameters, and configurations that you assemble using the graphical design tools.

Connection Manager
It connects outside of SSIS and points to different data sources.
It can be used to connect to a source of data or to a destination.
Integration services includes a variety of connection managers for connecting to different  data sources, such as Relational databases, Analysis services databases, and files in CSV and XML formats.

Control Flow
   Control flow manages flow of tasks in a package. Package consists of one or more data flows. SSIS provides three different types of Control flow elements
Containers that provide structures in packages, tasks that provide functionality,
And precedence constraints that connect the executables.

Precedence constraint
  In SSIS, tasks are linked by precedence constraints. A task will only execute if the condition that is set by the precedence. By using these constraints, it is possible to ensure different execution paths depending on the success or failure of other tasks. This means that you can use tasks with precedence constraints to determine the work flow of an SSIS package.

Data Flow
 Data flow most frequently moves data from source to destination, SSIS Provides three different types of data flow components: Sources, transformations, and destinations.
Sources extract data from data stores such as tables and views in relational databases, files, and Analysis services databases.
Transformations modify, summarize and clean data. Destinations load data into data stores    

Data flow sources

  • ADO .NET / DataReader Source
The DataReader Object in ADO.NET is a stream based, forward only, read only retrieval of query results from the Data sources, which do not update the data. The DataReader cannot be created directly from code, they can be created only by calling the EXECTEREADER method of a command object.
  • Excel Source
Use Excel Source and Excel Connection manager when reading from workbooks from Excel 2003 and prior.In Excel, a worksheet or range is the equivalent of a table or view.The list of available tables in the Excel Source and Destination editors displays existing worksheets ( identified by the $ sign appended to the worksheet name, such as Sheet1$)
  • Flat File Source
                      A Flat File source is a data flow component that uses metadata defined by a Flat File  Connection manager to specify the format and structure of the data to be extracted from the flat file by a transform process.A flat file can be a plain text file or a binary file configured to extract data from a single flat file by using the file format definition provided by the Flat File connection manager.
  • OLE DB Source
The OLE DB source extracts data from a variety of OLE DB relational databases by using a database table, a view, or an SQL Command. For  example the OLE DB source can extract data from tables in SQL server databases.
  • Raw File Source
The Raw file source reads raw data from a file. Because the representation of the data is native to the source, no data requires no translation and almost no parsing. This means that the Raw file source can read data more quickly than other sources such as the Flat File and OLEDB Source
  • XML Source
The XML Source reads an XML data file and populates the columns in the source output with the data. The data in XML files frequently includes hierarchical relationships. For example, an XML data file can represent catalogs and items in catalogs. Before the data can enter the dataflow, the relationships of the elements in XML data file must be determined, and an output must be generated for each element in the file.
Data Transformations
  • Aggregate
The Aggregate transformation is used to perform aggregate operations/ functions. The available aggregate functions are Group by, Sum, Average, Count, Count distinct, Minimum, Maximum. The aggregate transformation has one input and one or more outputs.
  • Audit
Audit transformation helps the users to build a package which requires about the environment, audit transformation is a simple component that adds the values of certain System variables as new columns to the dataflow. It allows for a single system variable to be added as many times, the variables are Execution Instance GUID, Package ID, Package Name, Version ID, Execution Start time, Machine name, User name, Task name, Task ID.
   Mostly using this component creating log entries or adding Meta data to error traps. 
  • Character Map
Character Map Transformation is used to make changes to string type columns only. Using this transformation to change lower case to upper case and Upper case to lower case.
 The List of operation details in character map transformation is Lower case, Upper case, Byte reversal, Full width, half width etc.  
  • Conditional Split
Conditional split transformation split data rows to different outputs depending on what ever criteria of the data that you wish.
  • Copy Column
Copy column transformation helps you to creating new columns by copying input columns and adding new columns to transformation output. In simple it creates alias columns of the input columns. You can create any number of new columns by copying the same input column
  • Data Conversion
Data Conversion is used to change the data type of column.you have Unicode string formate in source database and the same column is defined as Non unicode in destination. In this case we need to convert the data before it reaches the destination we use Data conversion transformation.

  • Data Mining Query
The Data Mining Query transformatin performs prediction queries against data mining models. This transformation contains a query builder for creating Data Mining Extensions(DMX) queries.This transformation uses an Analysis services connection manager to connect the Analysis services project or the instance of Analysis Services that contains the mining structure and mining models.This transformation has one input and one output.It doesn’t support an error output.
  • Derived Column
Derived column transformation is used to deriving a new column at run time. Let’s say we have bonus and salary as two fields and we need to calculate the gross salary using derived column transformation.
  • Export Column
The Export column transformation reads data in a dataflow and inserts the data in a file. For example, if the data flow contains product information, such as a picture of each product, Use the export column transformation to save the images to files.
  • Fuzzy Grouping
Fuzzy Grouping transformation helps you group rows of data together based on the similarity of the values in a specified column. It is used as a data cleaning component, which identifies rows that are possible duplicates and then selects a specific row of data to use in standardizing the data. This is useful for performing aggregations on this data after fuzzy grouping has been done.
  • Fuzzy Lookup
Fuzzy Lookup transformation perform data cleaning tasks such as standardizing data, correcting data, and providing missing values.
The Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches in the reference table. Fuzzy lookup are commonly used to standardize address and names

  • Import Column
Import Column transformation is used in order to import some data from a file to the dataflow and do some manipulations and then forward the data to the destination, here data includes a binary file, an image, a media, or any sort of document. If we want to move a huge file from one location to other using the package use import column transformation.
  • Lookup
    •  Lookup transformation acts like equi join in sqlserver.  And It performs lookup by joining data in inputs columns in a reference dataset

  • Merge
The Merge transformation merges two sorted datasets and outputs a sorted dataset. It can accepts max two inputs and the inputs must be sorted.the merge transformation is similar to Union all transformation.use union all transformtion instead of merge some situations.
The transformation has more than two inputs and the transformation inputs are not sorted at this situations we go for Union all transformation.
  • Merge Join
The Merge join transformation provides an ouput that is generated by joinning two sorted datasets using a FULL, LEFT, or  INNER join. This is similar to database join operation. This transformation has two inputs and one output.
  • Multicast
The Multicast transformation distributes its inputs to multiple outputs.it basically creates copies of an input dataset. This transformation has one input and multiple outputs.
  • OLE DB Command
OLEDB Command is used with SCD (Slowly changing dimension), as SCD is designed In the way that it performs insert operations for the new rows,update operation through OLEDB Command transformation.
  • Percentage Sampling
Percentage Sampling transformation is used to split the dataset into separate outputs based on the percent.lets say I have 2000 records in my source table and I connect source to PERCENTAGE SAMPLING transformation by providing limit record values to 20 then 20% percent of total records will be flowed from the transformation.
Pivot
Pivot transformation in SSIS helps to used to convert the rowsto columns to visualize data in different way.
  • Row Count
RowCount transformation is used to track how many records are loaded.
Ex: We receive flat file that we need to load into our sqlserver table.Beside loading the data from flat file into our destination table we also want to track how many records loaded into flat filie.
  • Row Sampling
Row Sampling transformation will Limit the records by passing integer value .Row Sampling transformation is similar to percentage sampling.
Percentage Sampling will Limit the records by passing %
Ex: I have 1000 records in my source we use Row Sampling transformation if I take 10 as a value then output is 10 records.
Ex: Same as above example we use Percentage Sampling transformation
If I take 10 as value then 10% of 1000 i.e 100 records will be the output.
  • Script Component
Script Component executes the script for each row,it applies custom transformation on each of the data row reads.Script Component works in the Data Flow task, by using Script Component breakpoints are not supported.

  • Slowly Changing Dimension
Slowly Changing Dimension transformation in data flow task of SSIS used to logical groupings of data such as geographical location, Customer or product information. With Slowly Changing Dimensions data changes slowly rather than changing on time-based, regular schedule.
There are different types of Slowly changing dimesions:
Type 1: In Type 1 SCD new information overwrites the orginal information. No history is kept.
SupplierID             SupplierName                 Address
S001                     ABC Company                USA
ABC Company moved from USA to UK, the new information replaces the new record
SupplierID             SupplierName                 Address
S001                     ABC Company                UK
In Type 1 it is not possilbe to track history.
Type 2: In Type 2 SCD new record is added to the table to represent new information. Here both orginal and new record will be present.
SupplierID             SupplierName                 Address
S001                     ABC Company                USA
S001                     ABC Company                UK
In Type 2 to keep history information.
Type 3: In Type 3 SCD, there will be two columns to indicate orginal value and Current value. There will be another column that indicates when the current value becomes active.

SupplierID   SupplierName   OrginalAddress   CurrentAddress   Effectivedate
S001            ABC Company         USA                            UK              10-oct-2012
      Type 3 rarely used and allows to keep some part of history
Sort
Sort transformation is used to Sort the data either ascending or descending order. Sort transformation is mainly used with Merge or Merge join transformations.We can call Sort transformation as Asynchronous transformation and Blocked transformation.
  • Term Extraction
Term Extraction transformation extract terms from the text in a transformtion input column, and then writes the terms to a transformation output column.the transformation works only with english text.This transform analyze free flowing text in your data source, and enables you to pull out nouns and noun phrases  that are found in it,along with frequencies with which they occur.
  • Term Lookup
Term Look is one of the powerful transformation in SSIS. This transformation is bascially helps to analyze the set of text records.Term Look transformation will look for set of user predefined nouns or noun phrases in transformation input text column and provide the count of occurences as an output on row by row basics
  • Union All
Union All transformation can accepts two more than two datasets for input.this transformation doesn’t require sort datasets.
Unpivot
Unpivot transformation state that values in the multiple columns can be transformed into single column.
Ex: Sales information for each year is stored in year column i.e before unpivot Year wise sales is apppear as columns(2001,2002,2003,2004) after apply Unpivot transformation Year wise columns are transformed to single column (Year)


Data flow destinations
  • Data Mining Model Training
The Data Mining Model Training destination trains data mining models by passing the data that the destination receives through the data mining model algorithms. Multiple data mining models can be trained by one destination if the models are bulit on the same datamining structure.
  • DataReader Destination
The DataReader Destination exposes the data in a data flow by using the ADO.NET DataReader interface. The data can then be consumed by other applications. For example, you can configure the data source of a Reporting services report to use the result of  running a Microsoft SQL Server Integration Services package. To do this, you create a data flow that implements the DataReader destination.
  • Dimension Processing
The Dimension Processing destination loads and processes an SQL Server Analysis Services dimension.
  • Excel Destination
Excel Destination is the Data flow destination in SSIS. Get the data from various source i.e. Flat file, Oledb source into Excel by using Excel destination.
  • Flat File Destination
Flat File destination is the data flow destination from SSIS Tool box.I have a text file with input data I want to import into Flat file destination by using Flat File connection manager.when you configure the Flat file connection manager, you specify information about the file and about each column in the file and you specify the characters that delimt columns and rows in the file and specify the data type and length of each column.
  • OLE DB Destination
OLEDB Destinaion is used to load data from variety of sources such as Flat file,Excel, and OLEDB source etc.This destination uses an OLEDB connection manager to connect to a datasource and the connection manager specifies the OLEDB provider to use
OLEDB Destination provides five different data access modes for loading  data
A Table or View : You can specify existing table or view, or you create a new table.
A Table or view using fast-load options : you can specify an existing table or create a new table.
A Table or View specified in a variable
A Table or view specified in a variable using fast-load options.
The result of an Sqlstatement.
  • Partition Processing
  • The Partition Processing destination loads and processes an SQL Server Analysis Services partition.
  • Raw File Destination
The Raw File Destination writes raw data to a file. Because the format of the data is native to the destination, the data requires no translation  and little parsing. This means that the Raw file destination can write data more quickly than other destinations such as the Flat File and the OLE DB destinations.
  • Recordset Destination
The Recordset Destination creates and populates an in-memory ADO recordset. The shape of the recordset is defined by the input the Record set destination at design time.
  • SQL Server Destination
SQL Server Destination connects to local Sqlserver database and bulk loads data into SQL Server tables and views.Compare to OLEDB Destination SQL Server Destination is quick to load data from source to destination.you cannot use Sql server destination in packages that access a SQL server database on remote server.
  • SQL Server Mobile Destination
Variable
Variables can store the values and share it with one or more other tasks in the package. Using variables populating parameter values for T-SQL statements at run time for Example a package can run an Execute SQL task and then use variables dynamically set the parameters in a T-SQL statement.

Connection Manager
Connection Manager plays a vital role in SSIS Packages. SIS packages typically interact with variety of data sources. There is a connection manager’s area on the design surface where you can specify each data source that you will access
Ex: My source is Excel I have to load data from Excel to Sqlserver table I required two connection managers one for Excel spreadsheet another for database.

Event Handler
AS the name says based on certain event, we would like to take some action (handle the event). We may create an email in case of a success or failure in the package.
SSIS provides number of event handlers
1) On Error 2) On Information 3) On post Execute 4) On post validate 5) On pre execute.

Logging
Logging will help to use to log the events that are happening in the package. For example we need to know which all tasks have executed successfully or failed, what are the error messages etc. we have various options to save SSIS Logging. We can save the logs to Windows Event log, a Text File, XML File, SQL Server Table or SQL profiler. Choose the one you feel perfect for your need. You need to do log for the entire package, you could select the executable for which you want to enable to logging.

SSAS

SQL Server Analysis Services (SSAS) is the technology from the Microsoft Business Stack, to develop online Analytical processing (OLAP) solutions.In simple terms, you can use SSAS to create cubes using data from data marts/ data warehouse for deeper and faster analysis

Data source
Data source contains connection information provides Provider Name (Native Oledb\ Sql server native client 10.0), Server Name, Authentication, database, and Impersonation information (use service account) to get the data from datamart/datawarehoue to Analysis services

Data source view
Data Source view contains Persistant set of tables means get the required dimensions and measure groups to create a cube. Data source view is a single, unified view of the metadata from the specified tables and views that the data source defines in the project. It contains several features Named query, Named calculation, define logical primary keys etc.

Named Query
Named Query in DSV to add only required field from the multiple tables. It is very similar to the views in database. There may be scenario where few fields required out of n number of fields from multiple tables for e.g only theree fields are important for product sales analysis such as product name, product subcategory and product category but these are spread across theree tables product, productsubcategory, productcategory. Here we make use of named query instead of referring these tables directly in DSV which improves the performance as well.

Named Calculation
Named Calculation is a sqlexpression represented as a calculated column. This expression behaves as a  column in the table.A named calculation lets you extend the relational schema of existing tables or views in a datasource view with out midifying the tables or views in the underlying datasource.you cannot add a named calculation to a named query.

Dimension
Dimensions are business objects or entities which contains a list of attribures that describe the object(dimension itself) and they are usually derived from some code tables.These attribures are appear as attribute hierarchies and can be organized into user-defined hierarchies, or can be defined as parent-child hierarchies based on columns in the underlying dimension table.

Attribute
The Columns in the dimension table represent as Attributes.using these attributes we create attribute hierarchies by default every attribute in a Dimension has an Attribure Hierarchy

Hierarchy
 A Hierarchy is an very important part of any OLAP engine and allows users to drill down from summary levels down to detail levels.they are differnet types of user-defined hierarchies that can be created based on the relationship of attributes members with each other.

Natural Hierarchy
When levels in a hierarchy are linked in a natural relationship like one-to-one or many-to-one,such hierarchies are know as Natural Hierarchies. Like in a calender Hierarchy, Day level related to month level,month level related to Quarter level and so on.Natural Hierarchies are also know as Balanced Hierarchy

UnNatural Hierarchy
In some type of Hierarchies the natural relationship is lacking between the hierarchy levels and branches of the hierarchy descend to different levels. Such hierarchies are know as UnNatural Hierarchies or Unbalanced hierarchies.For example, the Martial Status a Gender Hierarchy is a Unnatural hierarchy  as Martial status and gender attributes do not have relationship to each other.

Ragged Hierarchy
Are the type of Unbalanced hierarchies where in, the logical parent member of atleast one member is not in the level immediately above the member. In this case hierarchy descends to different levels for different drill down paths. Take an example of geography hierarchy the levels available are Country a state/ Province a city.But some counties like Greece do not have provinces thus creating a ragged hierarchy.  

Attribute relationship
The process of setting relationships between the attributes.these attributes in a dimension are organized based on the relationship defined between the attributes in a dimension table.
A dimension model is based on either star schema or snow-flake schema.In star schema, all the dimension attributes belongs to the same relational table and hence an attribute relationship is automatically defined between the key attribute and each non-key attribute of the dimension.In snow-flake schema the dimension attributes are derived from more than one tables related with each other.For example Product dimension which has the attributes derived from theree dimension tables,Dimproduct category,Dimproductsubcategory,Dimproduct.

Cube
The basic unit of storage and analysis in Analysis services is the cube.A cube is a collection of data that’s been aggregated to allow queries to return data quickly.Cubes are ordered into dimensions and measures. The data for a cube comes from a set of datawarehoue tables, Dimensions in the cube come from dimension tables in the datawarehouse database,while measure come from fact tables in dataware house database.

Measure group
A Measure group is a collection of measures which belong to the same underlying fact table.

Measure
A Measure is any numeric value that represents a metric aligned to an organizatons goals. This is the value which the business users are intersted in viewing these values from different angles and different granularity levels. A measure is also commonly called a fact.

Calculated measure
A Calculation is nothing but an MDX expression or script which is used to define a calculated member in a cube.Calculated members can be defined for dimensions or measures.The members defined for measure dimensions are called Calculated measures.Calculated members are defined using the data available in the cube and can be made complex by combining data with arithmetic operators,numbers and functions.

Named set
A Named set is reusable set of dimension members or a defined set expression,using Multidimensional Expression(MDX). You define named sets by combining cube data,arthimetic operators, numbers and functions.For example you can define a named set to render the Top Ten products, based on highest values of the Sales Measure.

Dimension usage
Dimension usage is used to define the relationships between dimensions and measure groups, and the granularity of each dimension within each measure group. If you use multiple fact tables, you might have to identify whether measures do not apply to one or more dimensions. 

Calculations
Use this tab to examine calculations that are defined for the cube, to define new calculations for the whole cube or for a subcube, to reorder existing calculations, and to debug calculations step by step by using breakpoints. Calculations let you define new members and measures based on existing values, such as a profit calculation, and to define named sets.

KPIs
Use this tab to create, edit, and modify the Key Performance Indicators (KPIs) in a cube. KPIs enable the designer to quickly determine useful information about a value, such as whether the defined value exceeds a goal or falls short of the goal, or whether the trend for the defined value is getting better or worse.

Actions 
Use this tab to create or modify drillthrough, reporting, and other actions for the selected cube. Actions provide to client applications context-sensitive information, commands, and reports that end users can access.
 Partitions
Partitions let you store sections of a cube in different locations with different properties, such as aggregation definitions.
Perspectives
Use this tab to create and manage the perspectives in a cube. A perspective is a defined subset of a cube, and is used to reduce the perceived complexity of a cube to the business user.
Translations
Use this tab to create and manage translated names for cube objects, such as month or product names.

Happy Business Intelligence:-)
సునీల్ ఆనిపాకుల / Sunil Anipakula.

No comments:

Post a Comment