Which columns should you choose for indexing?
To decide on the columns of an index you have to consider the most frequent queries running against your tables. You have to examine the WHERE and JOIN clauses (ORDER BY and GROUP BY also benefit from indexes), since SQL query optimizer will try to find the best way to return the results by using a complex mechanism to evaluate available indexes and statistics. An over simplified diagram of the query optimizer’s decision making process is displayed below.
Given that query optimizer decides on the fate of your query execution plan, it is important to maintain useful indexes and current statistics. It is not necessary for the query optimizer to use your indexes, if it finds them costly, even if those indexes are on columns stated in WHERE, JOIN, ORDER BY and GROUP BY statements. Consider the following example; Table APPLICATION has a total of 13,235 rows and a clustered index on app_id column and a non-cluster index on computer_id column.
SELECT computer_id, app_id, name FROM application
The above statement would cause the query optimizer to use a clustered index scan to return the results, since there are no filtering criteria. Consider a WHERE clause on the clustered index key (app_id)
SELECT computer_id,app_id,name FROM application WHERE app_id > 300
Since there is a WHERE clause in the statement, based on the clustered index key, the query optimizer decides that a clustered index seek is better than a clustered index scan to return a result set of 1,729 rows out of 13,235. A WHERE clause on the non-clustered index computer_id would behave differently.
SELECT computer_id,app_id,name FROM application WHERE computer_id = 5000025
Although a non-clustered index exists on the column computer_id defined by the WHERE clause, the query optimizer decided not to use it to return 428 rows out of 13,235 because its bookmark lookup behavior would cause jumps from page-to-page and would be costlier than the scanning of the entire clustered index. Consider the same query with a different value after the equation.
SELECT computer_id,app_id,name FROM application WHERE computer_id = 6000000
Query optimizer and statistics
While query optimizer asses the fate of the execution plans based on statistics, it is quite logical for SQL server to create statistics automatically on the key columns of an index when it is created. The index statistics creation is mandatory and not to be confused with the AUTO_CREATE_STATISTICS option (ON by default) which instructs the query optimizer to automatically create statistics on single non-indexed columns defined on query predicates (JOIN and WHERE clauses). Nevertheless the update of those statistics can be configured to not be updated automatically by using an ALTER DATABASE statement with the AUTO_UPDATE_STATISTICS option. If automatic statistics update are left to the default setting (ON), the query optimizer, in order to optimize CPU utilization, will update statistics based on the following conditions: a table without rows gets a row, a table has fewer than 500 rows and updates 500+ rows, a table has more than 500 rows and updates 500+20% of the number of total rows.
Statistics can be configured to automatically updated either synchronously (default) or asynchronously. In synchronous updates, the query optimizer is always compiles and executes the queries with up-to-date statistics as opposed to asynchronous where only subsequent queries will take benefit of the up-to-date results that completed asynchronously. As a best practice leave statistics creations and updates ON and to SYNC mode. Consider asynchronous updates when your application experiences request time outs waiting for updated statistics or when your application frequently uses the same or similar queries and cached query plans.
Although it is not recommended, you can turn off the automatic statistics update and creation process. In such cases you can use the CREATE STATISTICS and UPDATE STATISTICS statements to maintain statistics as needed.
All information about statistics objects are contained in the sysindexes table and in a statistics binary large object (statblob). Statistics information include the number of rows in the table or index, the number of pages occupied, number of rows modified after last statistics updates, time of statistics collection, number of rows used to produce the histogram, average key length, single-column histogram with the number of steps, estimated number of rows matching the filter, etc.
A histogram is a set of up to 200 values of a column. The sample or the complete set of column values are sorted and divided into up to 199 intervals so that the most statistically significant information is captured. The information on the range of index key values between two consecutive samples is called a step and contains information about:
- RANGE_HI_KEY: the most significant value of a step
- EQ_ROWS: the number of rows equal to RANGE_HI_KEY
- RANGE_ROWS: the range of rows between the previous RANGE_HI_KEY and the current RANGE_HI_KEY
- AVG_RANGE_ROWS: The average number of rows per distinct value within RANGE_ROWS
- DISTINCT_RANGE_ROWS: the number of distinct rows within RANGE_ROWS
Why query optimizer behaved differently in the above example?
Lets analyze the statistics of the application table to understand the different behavior. To find out the available statistics within application table:
The results indicate that two objects have been created; one for my clustered index CIX_APP_ID on column app_id and another for my non-clustered index IX_COMPUTER_ID on column computer_id. To gather some more information about these indexes:
SELECT name, indid, minlen, xmaxlen, dpages, reserved, used, rowmodctr,rows FROM sysindexes WHERE name='CIX_APP_ID' OR name ='IX_COMPUTER_ID'
The results are indicating that CIX_APP_ID is a clustered index (indid=1) while IX_COMPUTER_ID is a non clustered (indid>1). Also there are some useful statistics regarding the minimum and maximum length of a row (minle