SQL Indexing and SQL Performance Part 4: Design Considerations
In the previous part of the blog post series you’ve seen that the query optimizer examines query predicates in order to analyze statistics for useful indexes in the execution plan decision process. Since indexes are quite important to the query optimization process, the first design tip is to consider WHERE, JOIN, MERGE, ORDER BY and GROUP BY clauses in your application queries to determine the most suitable columns for indexing in your tables.
Use narrow and arithmetic data type indexes
Taking into consideration the way that SQL server stores and retrieves data, it is quite reasonable to maximize your performance if you can fit as many rows as possible to a single index page. This would improve IO performance, DB caching and storage allocation space since fewer pages would have to be read, stored and cached by the system.
To achieve fewer pages per index you have to decrease index row size, meaning that you would use as fewer columns as possible and as smaller column types as possible. An index page in MS SQL is always 8KB long. If you use a single column index of INT type (4 bytes), an index page could store more than 2000 rows as opposed to a CHAR(200) type (200 bytes) that could less than 40 rows. In a situation were a table has 3500 rows the INT column index would allocate approximately 2 index pages while the CHAR(200) would allocate 88!
System CPU performs arithmetic and not string operations, thus it needs to convert strings to something understandable in order to process it. Have this in mind when designing your index keys strategy since an arithmetic type index key like INT is usually faster than a string type key like CHAR, although you can configure CHAR(4) to have the same size as INT.
Column order matters
In a composite index key, column order matters. When you create an index on two columns (A,B), the index is sorted on column A and then sub-sorted on column B. If you execute a query with a predicate on column B the query optimizer won’t benefit from the index much, because the index has been constructed with column A, as the first key column, although column B is part of the key.
Because useful indexes are the ones with high selectivity, it is best to set the most distinctive column first, followed by the next distinctive column, etc.
To better understand the above arguments, consider the following example; Table application consists of two columns: app_id and computer_id. The app_id column has a higher selectivity than computer_id and a non-clustered index IX_APP_ID_COMPUTER_IX has been created on app_id (first) and computer_id.
GOCREATEINDEX IX_APP_ID_COMPUTER_ID ON
(app_id ASC, computer_id ASC);
GOGOSELECT [app_id],[computer_id] FROM [application]
WHERE [computer_id] = 5000025;
The WHERE clause caused the scan of the entire index because the first column defined as index key was app_id and not computer_id. If you filter on a subset of app_id column you get an index seek instead of an index scan
SELECT [app_id],[computer_id] FROM [application]
WHERE [app_id] > 241
Examine the selectivity of your indexes
For an index to be useful, it needs to return as fewer rows as possible in a query predicate. If your query returns a large number of rows the query optimizer might not use the index at all, resulting in a costly query execution plan as explained in the previous part of the series. Therefore consider columns with high percentage of distinct values (high selectivity) for your index keys. You can determine the selectivity of a column if you divide the COUNT of DISTINCT values of a column with the COUNT of values of the same column.
SELECTCOUNT(app_id) AS APPcount,
COUNT(DISTINCT app_id) AS D_APPcount,
COUNT(computer_id) AS COMPUTERcount,
COUNT(DISTINCT computer_id) AS D_COMPUTERcount,
The higher the resulting number, the higher the selectivity of the column, thus a better candidate for an index key. In the above example the app_id column is a better candidate than computer_id since APPSelectivity value is higher than COMPSelectivity value.
What about wildcard filtering and OR operators in predicates?
Wildcard predicates on string filters and OR operators can be quite tricky in terms of index usage. An OR operator in a WHERE clause could render an index useless if one of the filters caused a larger result set, even if both predicates are on indexed columns. Consider the following example were application table has three columns: app_id, computer_id and name, a clustered index on computer_idCIX_COMPUTER_ID, a non-clustered index on app_id IX_APP_ID and a non-clustered index on name IX_NAME.
SELECT [app_id],[name] FROM [application]
WHERE [app_id] = 19
SELECT [name],[app_id] FROM [application]
WHERE [name] LIKE ('perl%')
SELECT [app_id],[name] FROM [application]
WHERE [app_id] = 19 OR [name] LIKE ('perl%')
Although both WHERE clauses individually caused an index seek operation, when combined together with an OR operator, they caused a clustered index scan, therefore rendering both non-clustered indexes useless.
LIKE operators with wildcards on string indexes can also become tricky. If you set a wildcard character in the beginning of a string, for example LIKE(‘%perl’), query optimizer will probably perform a scan and not an index seek, since it does not know from where to start seeking in the sorted columns(anything can match before the perl string). A common example of such wildcard uses are email addresses. In such cases you may need to filter all @gmail.com addresses thus use a LIKE (‘%@gmail.com’) filter. To successfully utilize an index in such cases you would have to create a REVERSE INDEX on the affected column, a feature not currently supported by MS SQL SERVER (right now is at 2008 R2 version). Nevertheless, filters like string% or string%string% may utilize index more appropriately, since query optimizer knows from where to start seeking in the index keys.
Check if you can avoid bookmark lookup cost with covering indexes
As explained in part 2 of the series, a non-clustered index has an overhead associated when retrieving results that include columns not part of the index, because of the lookup that is required to the table RID or to the clustered key value. There is a special situation though, where the bookmark lookup operation won’t be utilized. If you manage to include all columns (as key or as included columns), that are returned by a query, to your non-clustered index, the query optimizer will return the results from the index itself without any further lookup. This technique is called “covering index”.
A covering index can also be produced by the join of two non-clustered indexes. For example, if you setup a non-clustered index on column A and another non-clustered index on column B and execute a select statement returning these two rows, the query optimizer is smart enough to combine these two indexes and return the results like in a “covering index”.
Examine opportunities to create filtered indexes
There is a special type of non-clustered index, the filtered index. To create a filtered index, you have to specify, along with your columns, a WHERE clause with the filtering criteria. Filtered indexes are quite useful because they can increase the selectivity of a key column that may had not good selectivity otherwise. A classic example is a column with many NULL values. Creating an index by filtering out all the NULL values, could increase the efficiency of the index. Another example could be a column with distinct ranges of values; In such case you could create multiple indexes with filters on these categories.
CREATENONCLUSTEREDINDEX IX_DESCRIPTION_FILTERED ON [application]
(description ASC) WHERE description ISNOT NULL
CREATENONCLUSTEREDINDEX IX_COMPUTER_ID_FILTERED ON [application]
(computer_id ASC, category_id ASC) INCLUDE (description) WHERE [category_id] >= 1 AND [category_id] <= 15
Since filtered indexes are operating in a subset of data, they have some significant advantages over full-table indexes:
Improved query performance because of their smaller size
May cost less on INSERT, DELETE and UPDATE statements. If a DML statement alters data that does not affect the range of the filtered index, then that index needs not to be updated.
Because of their smaller size they occupy less storage space
Decide on the index type
You have to decide on the type of the indexes you want to create and there are some best practices you can follow. As explained in part 2 of the series, a table can have only one clustered index and multiple non-clustered. A clustered index is best suited when you need to return large data result sets or presorted results, since table’s rows are physically sorted within the clustered index key order. Frequent changes to the index key columns can be quite problematic for clustered indexes, since DB engine needs to keep the data values of a row in physical order. This can produce page splits and significant IO overhead, especially in high-volume transaction processing systems, causing a serious performance hit. Note that clustered indexes are NOT the best candidates for wide index keys, since they affect the size of all non-clustered indexes in the table, because non-clustered indexes contain the clustered index key value as their row locator.
Non-clustered indexes, from the other side, can be used on columns that undergo frequent updates. Wider keys can also be used, although not recommended, since their keys do not affect other indexes in the table. They are best suited for queries that do not return large data sets and especially in queries that frequently filter data that return exact matches (like WHERE clauses). Covering and filtered indexes are excellent candidates for query performance improvements and should be thoroughly examined.
Consider Index fill factor
An option you need to consider when creating or rebuilding indexes is the fill factor. The fill factor option sets the percentage of free space of leaf index pages in the affected index. For example, a fill factor of 70 means that 30% of each leaf level page will be left empty. Fill factor is a method to minimize page splits caused by intermediate row inserts and row updates. As explained in part 1 of the series, in a full-filled index page, an intermediate update would cause a page split in order to make room for the updated rows as DB engine has to maintain the rows order. If indexes are prepopulated with some free space, these page splits can be reduced, thus improving performance.
You need to take into account though, that since fill factor reserves some free space to the end of the leaf level index pages, more pages may need to be allocated to store index columns, thus decreasing read performance. You need to examine carefully the frequency of intermediate rows update in order to figure out an appropriate fill factor. On an IDENTITY column were the key for new rows is always increasing, therefore the index rows are always added to the end of the index, page splits may still occur when existing rows are updated with data that lengthens the existing rows. In such cases it is wise to use a fill factor less than 100 to minimize page splits effect.
WITH (FILLFACTOR = 80, PAD_INDEX = ON)