Showing posts with label SSIS. Show all posts
Showing posts with label SSIS. Show all posts

Wednesday, December 12, 2012

Alternatives to handling null values in aggregate functions or How to avoid the "Warning: Null value is eliminated by an aggregate or other SET operation."

Most SQL developers have written many queries that perform calculations using aggregate functions such as SUM() or COUNT().  Many have not thought deeply about the what is happening within that aggregate function, and lesser so deeply about the handling of NULL values.  While typically, the developer should not have to think deeply about such internals, there is a situation where it might rise to your attentions.   When you see the "Warning: Null value is eliminated by an aggregate or other SET operation".

In most querying activities, the "Warning: Null value is eliminated by an aggregate or other SET operation" wouldn't bubble up to your attention.  In SQL Server Management Studio, you might notice if you click over to the 'Messages' tab:
This appears on an otherwise working query. 

There is a situation where this demands developer attention, when it is encountered in a SQL Server Integration Services package.  A Microsoft Connect-reported issue describes the scenario where "
An SSIS package which runs fine but sporadically fails with "Warning: Null value is eliminated by an aggregate or other SET operation" (Bug 483175). 

This can occur when the query for a Source Adapter is written in such a way that it aggregates potentially Null values, either directly or through dependent objects.

Microsoft has weighed in with "The real problem is why the underlying OLEDB provider returns a failure code then. Probably the warning is related but not necessary. The specific warning can happen to be there."  Agree or disagree with the resolution to that, the reality is something about the query in the Source Adapter that will have to change.   The following is a very simple example of the scenario, and two alternatives to change the query such that OLEDB warning will no longer occur.

The requirement from the script was a result set that will count the not null values.

The first step in the script is to setup a very simple table variable.  Without specifying anything for the column beyond the name and data type, it will allow Null values.  The script then inserts two legitimate values, one Null value, then one zero.  This will be sufficient to demonstrate that different logical results are gathered by counting rows, by counting a column, and by summing.

DECLARE @tst TABLE
(
VALUE int
)

INSERT INTO @tst (Value) VALUES (1)
INSERT INTO @tst (Value) VALUES (7)
INSERT INTO @tst (Value) VALUES (NULL)
INSERT INTO @tst (Value) VALUES (0)

SELECT * FROM @tst

The first test is the query most developers would naturally right.  You might have even built this through the Query Designer:

SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst


Results meet requirements, the 'CountOfValues' clearly returns the '3' legitimate values.  The 'CountOfRows' and 'SumOfValues' are true for what they are, but neither can be used to gain any insight into the number of non-null values.  Back on the 'Messages' tab, though, the Warning about null values appears - this is the type of query that will cause an SSIS Source Adapter to raise an exception.

What's happening here?   Basically, this query is relying on the database management systems handling of null values in a certain way, instead of having the query itself specify how nulls should be handled.  The ANSI-SQL 92 standard specifies that warnings be raised in the case of aggregate functions that rely on the elimination of null values before completing. (Gotcha! SQL Aggregate Functions and NULL has deep discussion on this topic)  The implication is that a given database management system might handle null values differently, or the handling could be set through configuration.  There are schools-of-thought that the result of the aggregate function would be a Null value since at least one value is a null value.  Perhaps from a purist point-of-view, but not particularly useful for business reporting or many common SQL querying examples.

So how to avoid this error?

Alternative 1:
Simply add the statement 'SET ANSI_WARNINGS OFF' before your script.  As you might expect, this suppresses any of the ANSI-specified warnings such as the handling of null values in aggregate functions.   In the following sample code, the ANSI_WARNINGS is turned off, the same query is ran.   No warning appears on the Messages screen.  In this sample ANSI_WARNINGS is turned back so the rest of the script can be demonstrated.  But it is not strictly required to set ANSI_WARNINGS back on.   Setting it to Off will only last through that query session.

--Alternative 1: turn warnings off, rerun
SET ANSI_WARNINGS OFF
SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

--turn warnings back on, for this session
SET ANSI_WARNINGS ON


Disadvantages I have found so far.  From a testability view, this suppression of warning for the remainder of the query and session could suppress other legitimate warnings.  The only practical consequence I have seen is that if the query is designed with a graphical editor such as the Query Editor, the SET ANSI_WARNINGS statement will not be included in the edited query, and could be overwritten inadvertently in a copy-paste operation.

Alternative 2:
Rewrite the query to make explicit the handling of Null values.  In the rewritten version, the query won't rely on the Database Management Systems' handling of Nulls or the ANSI standard, the handling will be explicitly stated.   Basically, the Nullable column is converted to a nonnullable boolean column based on whether the original value was populated or not.  The new nonnullable boolean column (the bit data type in SQL Server) can be summed to meet the original requirement of a count of non-null values.   In the sample script, this is all done in a single statement.  If the query was the definition of an object such as a view, you would expect the column to have a not null constraint.

--Alternative 2: rewrite to provide a logical handling of the NULL value
SELECT COUNT(*) AS CountOfRows, SUM(CASE WHEN Value IS NOT NULL THEN 1 ELSE 0 END) AS CountOfValues, SUM(COALESCE(value,0)) AS SumOfValues FROM @tst



Disadvantages seem to be adjusting to the idea of using the SUM() aggregate function to produce what one might naturally write using the COUNT() aggregate function.  It is supported, but less readable in a graphically query designer.

Conclusion:
In most cases, writing a query "Warning: Null value is eliminated by an aggregate or other SET operation" should barely register with you.  At best, it might prompt you to verify the null-ability of the columns which you are querying matches your expectation.  However, if the query is to be used in the Source Adapter of a SQL Server Integration Services package, action must be taken. 

If it is a quick work-around that is needed, and there is a strong protection against the inadvertent modification of a query, you might consider adding that statement 'SET ANSI_WARNINGS OFF' before the beginning of the query. 

If a more resilient situation is called for, for example if the query is to be accessed from multiple callers in the form of a database view, you may consider doing the work of rewriting the query to explicitly handle null values.

Entire Script:
DECLARE @tst TABLE
(
VALUE int
)

INSERT INTO @tst (Value) VALUES (1)
INSERT INTO @tst (Value) VALUES (7)
INSERT INTO @tst (Value) VALUES (NULL)
INSERT INTO @tst (Value) VALUES (0)

SELECT * FROM @tst

--works but triggers the 'Warning: Null value is eliminated by an aggregate or other SET operation' warning
--SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

--Alternative 1: turn warnings off, rerun
SET ANSI_WARNINGS OFF
SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

--turn warnings back on, for this session
SET ANSI_WARNINGS ON

--Alternative 2: rewrite to provide a logical handling of the NULL value
SELECT COUNT(*) AS CountOfRows, SUM(CASE WHEN Value IS NOT NULL THEN 1 ELSE 0 END) AS CountOfValues, SUM(COALESCE(value,0)) AS SumOfValues FROM @tst



Wednesday, November 14, 2012

Anatomy of an SSIS Package InfoPath Import

Notes: This is the draft of a sample project and code article being developed to document a solution for collecting disparate values via a Microsoft Office InfoPath form deployed in SharePoint server.   The data model may remind you of structures that are referred to as 'key-value pairs' or 'entity-attribute-value'.  Although the focus here was on meeting a business need versus strict adherence to a pattern

This has been core to my work, but I believe the generic pattern you see may have some use to others.  Since I could take this topic in many directions
·         upstream to the mechanics of deploying this form in a SharePoint server
·         downstream to the database structures for working with this data all the way to incorporating it into performance metrics and a data warehouse, or
·        across to the management of multiple InfoPath forms using this model
Since I'm not yet decided where to start the sample project, this is an anatomy of the 'middle', the place where the data crosses from the user interface oriented structures of InfoPath to the raw data where it can be used in further database processes.

Preparation Step
Creates the Working folder and Archive folder if they don't already exist

For Each Form
The 'Prepare XML File' container prepares the XML file generated by submitting an InfoPath form for access as a tabular data set, suitable for importing to a relational database.   The XML file is progressively transformed and the output of each step stored in the working folder.  This enables instrumentation or troubleshooting to occur.

The steps include:
"Copy File Into Working Space" [File System Task]: Copying the current file being processed as is into the working directory
"Strip Namespaces" [XML Task]: Removes to a custom namespace generated by the InfoPath form submittal called 'my'.  This is accomplished through a short XSLT script.
"Wrap in a table element" [XML Task]: Encapsulates the default 'myFields' node in a node called 'myTable'.   This allows the later data flow task to see the file as a dataset with each InfoPath field as a column.  Without this, each column appears as a separate data set. This is accomplished through a short XSLT script.
"Pivot to Key-Values" [XML Task]: In this solution, the exact names and quantity of the fields vary over time or are unknown at design time.  This step converts every InfoPath field into a 'field' record in the file, making the true InfoPath field name into an attribute.  It then can be imported as data rather than being part of the metadata.

The 'Import Raw Values to Output' container actually imports from the file into a data destination such as a relational database in the 'Import Raw Values' [Data Flow Task].   This is how the values from multiple InfoPath forms get to a single location against which further queries can be written.  Since the form was earlier pivoted to key values, the InfoPath field name is one of the columns that is available to import.

The last step 'Archive the File' [File System Task] completes the processing of that form.  The form is moved to a subdirectory of the Directory being processed, renamed with a timestamp that can be used to examine the contents of the file at that point-in-time.  That subdirectory is compressed or archived in a separate process.

Connection Managers

The connection managers for the package support the InfoPath XML translation.  The output of each step, and then the final transformed copy is output to the file system to ease troubleshooting.  Each of the separate XSLT scripts that does some transformation to the file is a file connection.  This allows the XSLT to be housed in a shared source directory. 

For the sample code, the 'output' and 'outputtext' connection managers exist to demonstrate exporting multiple InfoPath forms to a flat files.  In production, this would be a relational database such as SQL Server



Tuesday, January 4, 2011

Recommended SSIS blog

Been following a blog series on SSIS, with recent topic:
31 Days of SSIS - Folder Structures

My response:
Can tell you use an organized layout, and some of the concepts under workfiles I hadn’t
considered before.

What do I do differently?
Well the concept of WorkFiles is outside of the ‘SSIS’ hierachy, in fact by the time it gets to production on an entirely different drive. One reason is that the WorkFiles lives on whether another developer was to follow me and structure packages differently… or for that matter a different ETL tool was used, or even reverted to importing ‘manually’. Second reason is the source control and permissions alluded to. Non-developers in my environment need access to the WorkFiles, on a regular basis to drop off files, and in rarer situations, to see what files have been imported. Confuses me to much to have this mixed in with folders that need source code control.

Under my equivalant to WorkFiles/Project, there is a ‘Samples’ folder which contains samples (i.e. templates) of what the files are ‘supposed’ to look like. Especially with Excel files, i’ve found a groove where if the package uses a file under ‘WorkFiles/Project/Samples’ at design, then have it loop through ‘WorkFiles/Project’ at execute time. This can be challenging though if the ‘Samples’ get out of sync with the package.

Wednesday, October 1, 2008

SQL Server Integration Services – Beyond the Wizard – Part 1

In a method not entirely unusual for me, and I'm sure many of you, I'm picking up a technology 'on-the-fly'. SQL Server Integration Services (SSIS) 2005. Since I have a fairly solid understanding of Extract-Transform-Load and other integration topics, and experience with the predecessor tool - Data Transformation Services (DTS), most of the learning curve has been the nuances of the tool.

I'm finding I'm learning it in essentially a 'refactoring' approach. I've been able to get done the work I need to, but not always in the most elegant way.

This blog series will highlight some topics that will be of interest to the intermediate developer - You've been able to build a basic data flow, import/export, or other common routine, possibly only what the import/export wizard out of SQL Server Management Studio generates. Perhaps a number of them following a similar pattern. The next task is to polish it off to make it more dynamic and maintainable on a long-term basis.

One of the first issues I noticed with my SSIS project was that the connection strings were stored in each package - This is what you will get if the results of the Import/Export Wizard are saved as an SSIS package. This export shows a routine import of a CSV file into an AdventureWorks database on the local SQL Server.




I didn't see an obvious way to move the 'Connection Managers' from the package to the project level, the actual connection string property of a Connection Manager seemed like the next bet, but couldn't find a way to dynamic-ize this. While I had seen the Data Sources container under the project - I couldn't figure out how to reference these inside a package.

To change an SSIS package from using a per-package 'Connection Manager' to using a per-project 'Data Source' (useful if many SSIS packages use the same database) involves:


  • Creating a new project-level data source

  • Associating each Control Flow and Data Flow Task that uses the package's Connection Manager with the package source

  • Deleting the package-level connection manager

1. Create a Project-Level Data Source


a. Right-Click the Data Sources folder under the project

b. Choose 'New Data Source'



c. Click 'Next ->'



d. Select the appropriate Server, Database, and Authentication



e. Choose 'OK'

2. Associate each Control Flow and Data Flow Task that uses the Package's Connection Manager


a. Where it now shows 'DestinationConnectionOLEDB' (the default name)



b. Choose the drop-down -> New Connection…



c. Select the data source created in Step 1 above



d. Repeat for any task objects that reference 'DestinationConnectionOLEDB'

3. Delete the Connection Manager


To tidy things up, from the connections managers tab, delete 'DestinationConnectionOLEDB'



Summary



These steps demonstrated converting from a per-package connection manager to a data source. The Data Source can now be referenced in other packages, providing a single home for the Server Name, Database Name, and Authentication.


In this case, only the target data source - AdventureWorks was created. The source file could have been created as a data source as well, but there is less value if one side or the other of the integration is relatively unique to that package. Where I was doing multiple connections between the same databases, it would make sense to create both as Data Sources.