Wednesday, December 12, 2012

Alternatives to handling null values in aggregate functions or How to avoid the "Warning: Null value is eliminated by an aggregate or other SET operation."

Most SQL developers have written many queries that perform calculations using aggregate functions such as SUM() or COUNT().  Many have not thought deeply about the what is happening within that aggregate function, and lesser so deeply about the handling of NULL values.  While typically, the developer should not have to think deeply about such internals, there is a situation where it might rise to your attentions.   When you see the "Warning: Null value is eliminated by an aggregate or other SET operation".

In most querying activities, the "Warning: Null value is eliminated by an aggregate or other SET operation" wouldn't bubble up to your attention.  In SQL Server Management Studio, you might notice if you click over to the 'Messages' tab:
This appears on an otherwise working query. 

There is a situation where this demands developer attention, when it is encountered in a SQL Server Integration Services package.  A Microsoft Connect-reported issue describes the scenario where "
An SSIS package which runs fine but sporadically fails with "Warning: Null value is eliminated by an aggregate or other SET operation" (Bug 483175). 

This can occur when the query for a Source Adapter is written in such a way that it aggregates potentially Null values, either directly or through dependent objects.

Microsoft has weighed in with "The real problem is why the underlying OLEDB provider returns a failure code then. Probably the warning is related but not necessary. The specific warning can happen to be there."  Agree or disagree with the resolution to that, the reality is something about the query in the Source Adapter that will have to change.   The following is a very simple example of the scenario, and two alternatives to change the query such that OLEDB warning will no longer occur.

The requirement from the script was a result set that will count the not null values.

The first step in the script is to setup a very simple table variable.  Without specifying anything for the column beyond the name and data type, it will allow Null values.  The script then inserts two legitimate values, one Null value, then one zero.  This will be sufficient to demonstrate that different logical results are gathered by counting rows, by counting a column, and by summing.


INSERT INTO @tst (Value) VALUES (1)
INSERT INTO @tst (Value) VALUES (7)
INSERT INTO @tst (Value) VALUES (0)


The first test is the query most developers would naturally right.  You might have even built this through the Query Designer:

SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

Results meet requirements, the 'CountOfValues' clearly returns the '3' legitimate values.  The 'CountOfRows' and 'SumOfValues' are true for what they are, but neither can be used to gain any insight into the number of non-null values.  Back on the 'Messages' tab, though, the Warning about null values appears - this is the type of query that will cause an SSIS Source Adapter to raise an exception.

What's happening here?   Basically, this query is relying on the database management systems handling of null values in a certain way, instead of having the query itself specify how nulls should be handled.  The ANSI-SQL 92 standard specifies that warnings be raised in the case of aggregate functions that rely on the elimination of null values before completing. (Gotcha! SQL Aggregate Functions and NULL has deep discussion on this topic)  The implication is that a given database management system might handle null values differently, or the handling could be set through configuration.  There are schools-of-thought that the result of the aggregate function would be a Null value since at least one value is a null value.  Perhaps from a purist point-of-view, but not particularly useful for business reporting or many common SQL querying examples.

So how to avoid this error?

Alternative 1:
Simply add the statement 'SET ANSI_WARNINGS OFF' before your script.  As you might expect, this suppresses any of the ANSI-specified warnings such as the handling of null values in aggregate functions.   In the following sample code, the ANSI_WARNINGS is turned off, the same query is ran.   No warning appears on the Messages screen.  In this sample ANSI_WARNINGS is turned back so the rest of the script can be demonstrated.  But it is not strictly required to set ANSI_WARNINGS back on.   Setting it to Off will only last through that query session.

--Alternative 1: turn warnings off, rerun
SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

--turn warnings back on, for this session

Disadvantages I have found so far.  From a testability view, this suppression of warning for the remainder of the query and session could suppress other legitimate warnings.  The only practical consequence I have seen is that if the query is designed with a graphical editor such as the Query Editor, the SET ANSI_WARNINGS statement will not be included in the edited query, and could be overwritten inadvertently in a copy-paste operation.

Alternative 2:
Rewrite the query to make explicit the handling of Null values.  In the rewritten version, the query won't rely on the Database Management Systems' handling of Nulls or the ANSI standard, the handling will be explicitly stated.   Basically, the Nullable column is converted to a nonnullable boolean column based on whether the original value was populated or not.  The new nonnullable boolean column (the bit data type in SQL Server) can be summed to meet the original requirement of a count of non-null values.   In the sample script, this is all done in a single statement.  If the query was the definition of an object such as a view, you would expect the column to have a not null constraint.

--Alternative 2: rewrite to provide a logical handling of the NULL value
SELECT COUNT(*) AS CountOfRows, SUM(CASE WHEN Value IS NOT NULL THEN 1 ELSE 0 END) AS CountOfValues, SUM(COALESCE(value,0)) AS SumOfValues FROM @tst

Disadvantages seem to be adjusting to the idea of using the SUM() aggregate function to produce what one might naturally write using the COUNT() aggregate function.  It is supported, but less readable in a graphically query designer.

In most cases, writing a query "Warning: Null value is eliminated by an aggregate or other SET operation" should barely register with you.  At best, it might prompt you to verify the null-ability of the columns which you are querying matches your expectation.  However, if the query is to be used in the Source Adapter of a SQL Server Integration Services package, action must be taken. 

If it is a quick work-around that is needed, and there is a strong protection against the inadvertent modification of a query, you might consider adding that statement 'SET ANSI_WARNINGS OFF' before the beginning of the query. 

If a more resilient situation is called for, for example if the query is to be accessed from multiple callers in the form of a database view, you may consider doing the work of rewriting the query to explicitly handle null values.

Entire Script:

INSERT INTO @tst (Value) VALUES (1)
INSERT INTO @tst (Value) VALUES (7)
INSERT INTO @tst (Value) VALUES (0)


--works but triggers the 'Warning: Null value is eliminated by an aggregate or other SET operation' warning
--SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

--Alternative 1: turn warnings off, rerun
SELECT COUNT(*) AS CountOfRows, COUNT(value) AS CountOfValues, SUM(value) AS SumOfValues FROM @tst

--turn warnings back on, for this session

--Alternative 2: rewrite to provide a logical handling of the NULL value
SELECT COUNT(*) AS CountOfRows, SUM(CASE WHEN Value IS NOT NULL THEN 1 ELSE 0 END) AS CountOfValues, SUM(COALESCE(value,0)) AS SumOfValues FROM @tst

Friday, December 7, 2012

Reporting Services Report Optimized for the 'Select All' option of a Multi-Select Parameter

A SQL Server Reporting Services (SSRS) report was behaving dramatically slow when the user selected all parameters in a multi-select parameter list of 30-40 locations.  The report performed within 5-10 seconds when 1 location was selected.  Selecting all locations would eventually time out, exceeding what I believe is a 5 minute connection time out by default in a Reporting Services in SharePoint integrated mode. 

Below are the steps used to improve this performance to an acceptable level..

Removed the Where clause that was filtering by the location parameter

Changed it to a DataSet filter.

When passed in the query, SQL Server was trying to incorporate it in the query plan.  Because the institution is a string value that itself needs to be parsed, this made for a complex loop through the passed in institutions.  This might be optimized in the future by 'closing' the view (there are some requirements on the structure of the view to do so)

The query plan for the query filtered compared to the unfiltered query were similar.  Both actually rely on table scans because there is some text parsing work at play to get at key attributes.  The unfiltered query actually reports more expensive operations in the form of the hash match and is reported to be slightly larger in cost. 

In practice, though the unfiltered returns 100-200 rows in a second, the filtered query returns in between 1-5 minutes depending on how many locations are filtered for.

Data Set Filters are conceptually filtered 'client-side'.  In the case of Reporting Services, the 'client-side' is the Report Server.  It has to go out to the database server, pull 'all' data matching the query, then process and only display the ones that match the filter.  So more data is transferred between the two processes… and more work is done in the reporting server process.  Usually that is not ideal

In this situation, it was faster to get all rows, unfiltered, out of the database, make them available for reporting, and have the report decide which to display.

Thursday, November 22, 2012

How is pitching a movie script like pitching a software solution?

A couple of nights ago, I listened to a podcast featuring Max Landis at the where he talked about the screenwriting process.  In it he covers his screenwriting experience and then... in about the last half-hour, talks about three approaches to pitching a script to movie producers.  Basically, getting his script funded or purchased.  After thinking about it further, I considered, could his approaches be used in pitching software solutions or alternatives to clients or internal customers.  How is pitching a movie script like pitching a software solution?

Max Landis' bona fides include just a few past screenwriting credits, then the cult movie Chronicle and the YouTube video The Death and Return of Superman.  Apparently he has something like twelve scripts sold and in production.  In an unconventional profession, Landis is a bit unconventional-er... (more unconventional) than many.  He cleanly acknowledges this and spends solid parts of the podcast describing what goes in his mind as he as creating.  Midway through the podcast, he starts weaving a story idea of a British sailor in the late 18th century.  The technique he uses to 'pitch' this story he will then reveal is his favorite of three approaches of pitching a script to movie producers.

The first approach Landis describes is to pitch the script by talking through the plot scene-by-scene.  The second approach, he uses in the British sailor story, is to talk about the script as if you just saw the film and want to talk your friends into seeing it with you.  The third approach he mentions is to start with an idea, a potentially big and abstract idea, and flesh it out with possible scenarios that would center of the idea.  Now can we pitch, or sell, or present, a software solution using these approaches?

The scene-by-scene approach is traditionally used in pitching software solutions.  It is what I learned at University in courses on Software Development Life Cycle (SDLC).  In this approach, you gather the requirements and present them back as data flows or process charts.  Waterfall models, stereo-typically, exude this step-by-step sequential feel.  Also, agile techniques such as Use Cases and User Stories can have a fairly rote sequentiality.  The process starts at point A, goes through these steps and ends when point B is reached.  Some demos and slide decks follow this approach: this is how you log in, this is how you search, this is how you view, this is how you edit.

This can work.  Just like the example in the podcast of a script for a movie where the studio already has the idea.  It is a sequel, or a remake, or just the idea has been developed independently.  This can be a situation where the solution is routine or the customer already knows what they want - but doesn't know if you can do it.

I've presented quite a few solutions with this style, and most were to an extent already initiated or just matter-of-fact accepted.  They may have been required regulatory compliance and my solution was pretty much the only alternative available and feasible.

On the other hand I have pitched at least two projects using this approach unsuccessfully.  At least one I wish had been successful.  In that case, there was a opportunity to provide some automation support for California prison staff that traveled around that state negotiating transfer of prisoners between prisons.  They had to take into take into account a number of factors of each prisoner's criminal history, manually cross reference availability at other locations and then print appropriate documents to authorize transfers.  I led a team that both did site visits to see this process firsthand and had subject-matter experts visit and review documentation.  I presented it back to leadership as mostly flowcharts and diagrams, to sort of... prove that we understood the need and were technically competent.  But none of the documentation really resonated and the project didn't go anywhere at the time.  This customer may or may not have been willing to do the work if I had presented it more vibrantly.

Another project, I will confess, I was glad was not selected.  It was for a subset of results from a specialized type of audit.  I had taken the approach of presenting a brief but thorough scope document that mostly focused on the structure that would be needed to provide reliable, trust-able numbers while most of the stakeholders focused on the high-level summary output output.  This was largely to tease out the customer's willingness to commit their resources to doing detailed quality work such that the reporting would be reliable and sustainable over time.  They weren't... and so the 'scene-by-scene' technique was effective in bearing that out.

Talk about the Solution as if you have Seen it
This technique can definitely be used to pitch a software solution, and it is definitely more difficult.  Classic techniques such as storyboards and prototypes are probably the most obvious tools used in this approach.  It's not just the use of them, though, I've found.  The sequence and pacing and story being told through the purely visual artifacts is important.  In this technique, it helps to talk present tense tense - you're talking about the business process as it exists with your solution in place.  The client wants to get to this place and should start talking in the same present tense: "Yes, and we have ..," or "Exactly, then we do ..."

Landis in his script-pitching process will describe the script in a 'loose' way when pitching, leaving a lot of details to the imagination, not necessarily telling the audience the entire ending in one sitting.  When using this approach to pitch a software solution I tend to skip over things like logging in or key entry and menu navigation.  Not because they're not needed, but because they don't directly help the end state be visualized.  They exist 'around' the goal.  To the extent possible, I would even argue as a philosophy to minimize the these tertiary features in the eventual solution.  With the movement towards 'apps' vs. applications, 'chromeless' windows, single sign-on (SSO), this seems to be a trend throughout the software industry.

The 'pitch' of the software solution need not be complete and comprehensive.  If the customer intrigued by the start of solution, showing them only one logical path to an end point ends the creative process.  My interpretation of this to leave out a lot of features.  I try to be careful not to imply we missed them or don't know how to deliver a feature, but instead make it clear that there is a lot of, perhaps infinite, unexplored ground.  For example, a grid view in a application might have a related print-friendly report.  Then when you show the next grid view, the customer expect to see a print-friendly report, or email link button.  Given the ability of modern development environments and platforms for reuse and patterns, filling in the 'story' to meet the customer's expectation could require modest effort.

The project that comes to mind that I have 'pitched' this way was a need to bring data from electronic-filed court documents into a custom application.  This started with a call from a client, I vividly remember, on a day I was home sick.  I made a special effort to take his call because I knew he was concerned about our ability to service his needs on some new types of court cases he would be involved in very quickly.  I described to him what he would see on the court's website, then where it would show up in his application.  All the while I was being honest and realistic about what could be done.  I even described to him data that might not be exist in the electronic-filed documents or fields that might be unreliable, where he would have to manually edit and fix entries.  The next opportunity to 'pitch' this was with a prototype using a pilot court case.  My prototype made a point of showing the court's website in one window and showing the custom application's data in a second window.  The visibility wasn't a technical requirement, but showed the process

If I recall, the prototype was more functional then it needed to be.  The pilot court case was known well in advance and many features could have been hard-coded.  We had already started down the road of using a bevy of patterns and practices that would eventually be required.  Since multiple courts had their own formats, interfaces were used, files were imported from the web and translated to XML, then deposited back into a data layer.  But it was the narrative told through the prototype, not the detailed flowcharts that helped the client visualize his business operating with my software solution in place and gave him the confidence in our ability to support him.

The Abstract Idea
The third approach was just starting with an idea, a sort of central theory around which a need or opportunity existed.  Starting with the abstract idea is really hit-or-miss.  I think inherently you have to take a lot of swings and expect just a few that gain any traction.  How many back-of-the-napkin ideas have you discussed with a client or customer?  How many of those gained any traction?  Of those, how many ended up being funded projects?

My experience is getting many of these out there, many of them nothing comes about.  In other cases, I may not be the one selected to pursue it.  A good attitude to take when you are involved in this idea generation, if the customer or client pursues it some other way is just to think 'great for them, they'll probably come to me when they want to do X project or have Y need'.

One idea that I pitched and pursued was for an organization that had a ton of headaches around employees transferring between work assignments.  One of the headaches was that the employee's user names contained an acronym indicating the location at which they worked.  To keep this manageable, the user names had to be changed upon transfer.  This disrupted the employee This disrupted any other systems to which they needed access rights. (SSO was some ways off) I was the supervisor of a technical support unit for one of the location so I had some ability to influence.  The idea was, 'what if the employee's account name was just their first and last name?'  With my team we filled in gaps around the idea: let's do this if two employees have the same name; let's handle employees with special accounts this way.  Since we controlled at least the user names of new employees to our location, we put it in place with local management support and it later became a standard for the entire organization.  Now, this isn't that 'big' of an idea and undoubtedly I wasn't the first one or only one to have it, but it should show how just starting your 'pitch' around an abstract idea can actually be put in motion.

Is pitching a movie script like pitching a software solution?

(Note: in the Nerdist podcast, the specific discussion about the pitch techniques starts at about 1:26:00.  The British sailor story pitch starts around 45:00)

Wednesday, November 14, 2012

Anatomy of an SSIS Package InfoPath Import

Notes: This is the draft of a sample project and code article being developed to document a solution for collecting disparate values via a Microsoft Office InfoPath form deployed in SharePoint server.   The data model may remind you of structures that are referred to as 'key-value pairs' or 'entity-attribute-value'.  Although the focus here was on meeting a business need versus strict adherence to a pattern

This has been core to my work, but I believe the generic pattern you see may have some use to others.  Since I could take this topic in many directions
·         upstream to the mechanics of deploying this form in a SharePoint server
·         downstream to the database structures for working with this data all the way to incorporating it into performance metrics and a data warehouse, or
·        across to the management of multiple InfoPath forms using this model
Since I'm not yet decided where to start the sample project, this is an anatomy of the 'middle', the place where the data crosses from the user interface oriented structures of InfoPath to the raw data where it can be used in further database processes.

Preparation Step
Creates the Working folder and Archive folder if they don't already exist

For Each Form
The 'Prepare XML File' container prepares the XML file generated by submitting an InfoPath form for access as a tabular data set, suitable for importing to a relational database.   The XML file is progressively transformed and the output of each step stored in the working folder.  This enables instrumentation or troubleshooting to occur.

The steps include:
"Copy File Into Working Space" [File System Task]: Copying the current file being processed as is into the working directory
"Strip Namespaces" [XML Task]: Removes to a custom namespace generated by the InfoPath form submittal called 'my'.  This is accomplished through a short XSLT script.
"Wrap in a table element" [XML Task]: Encapsulates the default 'myFields' node in a node called 'myTable'.   This allows the later data flow task to see the file as a dataset with each InfoPath field as a column.  Without this, each column appears as a separate data set. This is accomplished through a short XSLT script.
"Pivot to Key-Values" [XML Task]: In this solution, the exact names and quantity of the fields vary over time or are unknown at design time.  This step converts every InfoPath field into a 'field' record in the file, making the true InfoPath field name into an attribute.  It then can be imported as data rather than being part of the metadata.

The 'Import Raw Values to Output' container actually imports from the file into a data destination such as a relational database in the 'Import Raw Values' [Data Flow Task].   This is how the values from multiple InfoPath forms get to a single location against which further queries can be written.  Since the form was earlier pivoted to key values, the InfoPath field name is one of the columns that is available to import.

The last step 'Archive the File' [File System Task] completes the processing of that form.  The form is moved to a subdirectory of the Directory being processed, renamed with a timestamp that can be used to examine the contents of the file at that point-in-time.  That subdirectory is compressed or archived in a separate process.

Connection Managers

The connection managers for the package support the InfoPath XML translation.  The output of each step, and then the final transformed copy is output to the file system to ease troubleshooting.  Each of the separate XSLT scripts that does some transformation to the file is a file connection.  This allows the XSLT to be housed in a shared source directory. 

For the sample code, the 'output' and 'outputtext' connection managers exist to demonstrate exporting multiple InfoPath forms to a flat files.  In production, this would be a relational database such as SQL Server

Tuesday, November 6, 2012

SQL Server Agent Job - Failed Stats due to negative duration

Have some jobs that collect statistics from SQL Server Agent job history, looking for spikes and trends in data integration.  Over the weekend, those stat collection jobs failed.
Turns out there were jobs that reported a dramatic ‘negative’ duration.
The cause, this jobs had started at 2:00 AM Sunday morning, then the Daylight Savings Time kicks in.  The job finishes 4 minutes later, but by then it’s actually ~56 minutes earlier than the start time.  For some reason this gets reported as ~-3976 days, 21 hours…   Then an hour later, it’s 2:00 AM again, and the job runs normally.
Fun stuff!

Tuesday, January 3, 2012

Maxim - Communicate in the 2nd person

Communicate in the 2nd person.
As the diagram reads, this reminder to communicate in the 2nd person so you focus on the recipient of the message.  What will the reader want to know? What's important to them?  Not 100% recalling how this was arrived at, it could be a tangential application of any number of Dale Carnegie' topics, but actually is more likely due to a pet peeve of seeing emails overwhelmed with "I's.