Automatic SQL Server statistics creation

Dear readers of our blog, we'd like to recommend you to visit the main page of our website, where you can learn about our product SQLS*Plus and its advantages.

SQLS*Plus - best SQL Server command line reporting and automation tool! SQLS*Plus is several orders of magnitude better than SQL Server sqlcmd and osql command line tools.

REQUEST COMPLIMENTARY SQLS*PLUS LICENCE

Enteros UpBeat offers a patented database performance management SaaS platform. It proactively identifies root causes of complex revenue-impacting database performance issues across a growing number of RDBMS, NoSQL, and deep/machine learning database platforms. We support Oracle, SQL Server, IBM DB2, MongoDB, Casandra, MySQL, Amazon Aurora, and other database systems.

12 June 2020

In which cases it is necessary to manually add statistical objects to tables and indices in Microsoft SQL Server.

The statistics can be generated automatically and updated depending on certain conditions set at the database level. But sometimes you also need to manually add statistical objects to tables and indices in Microsoft SQL Server.

In the first article on distribution statistics, I gave an overview of what statistics are, why they are important and what parameters exist at the database level to automatically create and update statistics. Now it’s time to find out what the consequences of automatic operations are and when it makes sense to manually add statistics objects to tables and indices in Microsoft SQL Server.

Why do we need statistics?

In the previous article it was explained that statistics provides metadata of data values. This metadata is used to create and compile optimal execution plans for queries sent by end-users to the database. If you encounter performance problems with SQL Server instances, many of them can be tied to a query optimizer that chooses the wrong solution when building the execution plans. And these plans are unsatisfactory because they are based on a lack of understanding of data values.

Statistics provides information about what values exist in the data domain, how these values differ from each other, the size of the storage and the number of values. Statistics helps ensure that the Query Optimizer (QO) can identify and prepare the most effective search, browsing, clarifying queries, combining different types of queries, etc., as well as determine the amount of memory allocated for processing results.

Statistics can be automatically generated and updated depending on certain conditions set at the database level. It is also generated automatically when indexes are created and rebuilt (but not reorganized). You can find more information about these parameters and thresholds that determine when to rebuild statistics to collect current propagation metadata in the first article of the cycle. Two issues are also discussed here: the desire to look at the objects that are automatically created and the time of events to determine the intervals for creating objects manually.

Identification of statistical objects in the database

There are many ways to query SQL Server about the properties of existing statistical objects to get an idea of their composition and the data distribution information contained, but we will focus on a simple identification process. You only need to learn two simple SQL Server queries to do this. We will consider both the system representation of the sys.stats directory and the dynamic management function sys.dm_db_stats_properties.

Using the queries described below you can return basic information “does not exist” and other useful information about your statistics objects in the database. First let’s look at the approach using sys.stats, it is shown in Listing 1.

I connect to another system view of the directory, sys.objects, to resolve the name of the object to which the statistics is linked (via object_id), and to filter the results to search only for objects created by the user (using the is_ms_shipped column in sys.objects).

At this stage, no objects have yet been created in the Tech_Outbound database used in this article series. Therefore, no results are returned. However, I want to build a few base tables to view statistics, so first I create a table of numbers using software code prepared by Itzik Ben-Gan. Ben-Gan is the author of numerous articles, including those published in this journal, and this program code has been borrowed from his arsenal of tools (Listing 2).

If you re-run a discovery request listed on Listing 1, the results are explained in particular by the creation of the primary key. The naming convention for the statistical name corresponds to the automatic naming of the primary key. In addition, as a result of the primary key creation a clustered index is formed by default, which in turn leads to the creation of basic statistics for this index, as each page must be viewed, and a full view is obtained. This is the most complete view of your data distribution that you can get. Normally, we get a sample of statistics, but since a full view has been performed during an index rebuild, it is useful to perform a full build of statistics (or update statistics through a full view if conditions are appropriate).

You can use the query for sys.dm_db_stats_properties to get more information than the ones provided from sys.stats. To query this Dynamic Management Function (DMF), two parameter values must be provided: object_id and stats_id. These values are known from the results returned earlier from sys.stats. Using them and attaching them to sys.objects to extract explanatory information, we get this query as shown in Listing 3.

The dynamic management function sys.dm_db_stats_properties allows us to set the time when the statistics was last updated, as well as to get information about the base domain of the object (in this case 50,000 rows) and the number of rows included in the sample when creating or updating the statistics, the number of steps in the statistical histogram, the number of unfiltered rows that are taken into account when working with the filtering statistics and finally, the modification_counter for the object statistics.

The modification_counter tracks the number of change steps made in the base column for the object described by the statistics. In this case it will be the column of the primary key dbo.Numbers. In statistics with many columns this value exists only for the initial column, as well as the metrics for the steps, because only the initial column in statistics with many columns has a histogram associated with it.

The results support what was said earlier about using full view when creating or rebuilding an index: you can draw this conclusion by looking at the rows and rows_sampled columns fromsys.dm_db_stats_properties. Given that they are equal, we can conclude that 100% of the rows have been selected to build statistics for the primary key in this table.

Information on creating statistics

In the following two more examples, I will show what happens to generating statistics when other commands related to table creation and value queries are applied. In the first article I went into detail about enabling AUTO_CREATE_STATISTICS at the database level. As a result, SQL Server has the ability to form statistical objects when they are necessary and do not exist. We will consider it further.

I’m going to use the dbo.Numbers table to create another table that will be used later in the articles of this series for in-depth study of statistics.

If you rerun the sys.stats query (the first discovery query), you will get the results shown in Screen 1, which shows that simply creating a table will not generate new statistics objects:

CREATE TABLE T0 (c1 INT NOT NULL,
c2 NCHAR (200) NOT NULL DEFAULT '#');

Screen 1: Results of the repeated request to sys.stats

We will use the dbo.Numbers table to load 100,000 rows with the value of 1000 into the c1 column and one record with the value of 2000. This table will be used later when considering the impact of multiplicity, a value that characterizes the uniqueness of data values, on execution plans and statistics (Listing 4).

At this stage, we have not yet created any new statistical objects. However, using the code in Listing 5, I want to create an unclustered index for column c1.

Now, when launching the discovery query sys.stats, we see an additional statistical object in the database (Screen 2).

Screen 2. Additional object of statistics ‘ 2.

This is to be expected, the same thing happened with the dbo.Numbers table. But what can we say about the other side of automatic generation of statistics? It manifests itself when statistics need to reflect how associations and search predicates are processed when a execution plan is created. Nowadays, the understanding of distribution statistics exists only for the dbo.Numbers.n column (its primary key column) and the dbo.T0.c1 column, for which a clustered index was manually created. We have to create another table specifically to show the resulting behavior:

CREATE TABLE T1 (c1 INT NOT NULL,
c2 NCHAR (200) NOT NULL DEFAULT
'SQL Cruise')

This instruction will not create any linked statistics objects, as was shown earlier. There is no index for Table T1 when I added a row using Listing 6.

But what happens if I perform the next query?

SELECT T1.c1
FROM T0
INNER JOIN T1 ON T1.c1 = T0.c1;

The only result (2000) will be obtained, but it should be emphasized what happens to statistics. There is no statistical object for T1 until the query is executed (or rather, until the moment just before the query is executed), because the query optimizer has determined that it doesn’t have an understanding of the data to perform the merge. Therefore, it has created statistics in the process of building the execution plan. The output data now returned by the sys.stats discovery query is shown on screen 3.

Screen 3. Output data of discovery request sys.stats

By setting AUTO_CREATE_STATISTICS = ON, you will see the statistics created for unions in the absence of the initial statistics column.

The same behavior can be seen when filtering through the WHERE sentence predicate. In Listing 7, two requests do not create any statistics. The fact is that the request optimizer does not need any additional information about data distribution for any of these requests. For the first query it is explained by the fact that we already have statistics due to the mode of automatic creation from just executed union, which led to the formation of queries _WA_Sys_… (this is the default naming convention for automatically generated statistics). The reason why the second query does not generate statistics is that the query optimizer can determine results based on the table definition that does not allow NULL values.

But what happens when the query optimizer needs more information about the distribution of values for a predicate? We get the automatic generation of statistics, as shown below:

SELECT T0.c2
FROM dbo.T0
WHERE c2 = '1';

This results in the generation of the selected statistics shown in line 4 in Screen 4.

Screen 4. Automatic generation of statistics

While getting acquainted with distribution statistics in Microsoft SQL Server, we have already got an idea of what statistics are, why they are important and how to get basic information about statistics objects in your database. In addition, you now understand when statistics are created automatically and when they are not.

In the next article we will look at the histogram of statistics and additional options to collect detailed metadata of statistics through sys.stats, sys.dm_db_stats_properties and by calling the BCC SHOW STATISTICS command, which we have not yet studied.

Listing 1. Request using sys.stats
USE Tech_Outbound;
GO

SELECT O. [name] AS [the_object_name]
, S.object_id
, S.name AS [the_stats_name]
, S.stats_id
, S.auto_created
, S.user_created
, S.no_recompute
, S.has_filter
, S.filter_definition
, S.is_temporary
, S.is_incremental

FROM sys.stats AS S
INNER JOIN sys.objects AS O
ON S.[object_id] = O.[object_id]
WHERE O.is_ms_shipped = 0
ORDER BY O.[name];

Listing 2. Creating a table of numbers
--NUMBER TABLE

CREATE TABLE Numbers (n INT NOT NULL PRIMARY KEY);
GO

INSERT INTO dbo.Numbers(n)
SELECT rn FROM
(
SELECT ROW_NUMBER()
OVER(ORDER BY CURRENT_TIMESTAMP) AS rn
FROM sys.trace_event_bindings AS b1
, sys.trace_event_bindings AS b2
) AS rd
WHERE rn <= 500000;

SELECT * FROM dbo.Numbers;

Listing 3: Request to the dynamic control function sys.dm_db_
stats_properties
SELECT O. [name] AS [the_object_name]
, S.object_id
, S.stats_id
" S.last_updated
, S.rows
, S.rows_sampled
, S.steps
, S.unfiltered_rows
, S._counter
FROM sys.dm_db_stats_properties(565577053,1) AS S
INNER JOIN sys.objects AS O
ON S.[object_id] = O.[object_id]
WHERE O.is_ms_shipped = 0
And O. [name] != 'foo'.
ORDER BY O.[name];

Listing 4. Loading the table of rows
–-INSERT 100,000 lines. All rows contain the value 1000 for column c1

INSERT INTO T0(c1)
SELECT 1000 FROM dbo.Numbers
WHERE n <= 10000;

--INSERT 1 line with value 2000

INSERT INTO T0(c1)
VALUES (2000);

Listing 5: Creating an unclusterized index for column c1.
--Creation of an unclassified Index for Pillar c1
CREATE NONCLUSTERED INDEX ix_T0_1 ON T0(c1);

Listing 6: Adding a row to table T1
INSERT INTO T1(c1)
VALUES (2000)

Listing 7. Two requests without creating statistics
SELECT T1.c2
FROM dbo.T1
WHERE c1 = 2000;
SELECT T0.c2
FROM dbo.T0
WHERE c2 IS NULL;