So, why do you need to index your tables?
Because without an index the SQL server has to scan the entire table to return the requested data. It is like the index page in a book. You check within the index for the keyword you want to learn about. From that point forward, you jump directly to the page where the content belongs, instead of scanning page by page for the material you want to read.
Similarly, a table’s index allows you to locate the data without the need to scan the entire table. You create instead indexes on one or more columns in a table, to aid in the query process. Consider the following example:
SELECT [computer_id],[nic_device_id],[nic_vendor_id],[nic_desc] FROM [eXpress].[dbo].[nics]
You are about to retrieve all computers with computer_id > 5100. The SQL server will have to scan the entire table in order to return the results, without the presence of an index. An index on the computer_id column will speed up this process, by sorting the column’s values.
Now, if you want to return all data where computer_id > 5100, the SQL server will know how to locate the first value greater than 5100. Why? Because the computer_id column is sorted, eliminating the need to scan the entire table, thus improving performance.
SQL Index Types
There are two main index types: Clustered index and Non-Clustered index.
A clustered index alters the way that the rows are physically stored. When you create a clustered index on a column (or a number of columns), the SQL server sorts the table’s rows by that column(s).
It is like a dictionary, where all words are sorted in an alphabetical order. Note, that only one clustered index can be created per table. It alters the way the table is physically stored, it couldn’t be otherwise.
In the example below, all rows are sorted by computer_id, as a clustered index on the computer_id column has been created.
CREATE CLUSTERED INDEX [IX_CLUSTERED_COMPUTER_ID] ON [dbo].[nics] ([computer_id] ASC)
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. Instead, it creates a completely different object within the table, that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data.
It is like an index in the last pages of a book. All keywords are sorted and contain a reference back to the appropriate page number. A non-clustered index on the computer_id column, in the previous example, would look like the table below:
CREATE NONCLUSTERED INDEX [IX_NONCLUSTERED_COMPUTER_ID] ON [dbo].[nics] ([computer_id] ASC)
Note, that SQL server sorts the indexes efficiently by using a B-tree structure. This is a tree data structure that allows SQL Server to keep data sorted, to allow searches, sequential access, insertions and deletions, in a logarithmic amortized time. This methodology minimizes the number of pages accessed, in order to locate the desired index key, therefore resulting in an improved performance.
Relation between clustered and non-clustered indexes
As explained above, a non-clustered index contains a pointer back to the rowID (RID), of the table, in order to relate the index’s column with the rest of the columns in a row.
But this is not always the case:
If a clustered index already exists in the table, the non-clustered index uses the clustered index’s key as the row locator, instead of the RID reference.
In the example below, when a non-clustered index is created on the computer_id column and a clustered index already exists on the nic_desc, the non-clustered index would look like the table below.
|5000025||VMware Accelerated AMD PCNet Adapter|
|5000031||Intel(R) PRO/100 VE Network Connection|
|5000045||Broadcom 440x 10/100 Integrated Controller|
|5000046||Broadcom NetXtreme 57xx Gigabit Controller|
|5000055||VMware Accelerated AMD PCNet Adapter|
|5000059||Broadcom 440x 10/100 Integrated Controller|
Index Benefits and Side Effects
A table without a clustered-index is called a “heap table”. A heap table has not its data sorted. The SQL server has to scan the entire table in order to locate the data, in a process called a “scan”.
In the case of a clustered index, the data are sorted on the key values (columns) of the index. The SQL server is now able to locate the data by navigating down from the root node, to the branch and finally to the leaf nodes of the B-tree structure of the index. This process called a “seek”. The later approach is much faster, when you want to filter or sort the data you want to retrieve.
A non-clustered index, on the other hand, is a completely different object in the table. It contains only a subset of the columns. It also contains a row locator looking back to the table’s rows, or to the clustered index’s key.
Because of its smaller size (subset of columns), a non-clustered index can fit more rows in an index page, therefore resulting to an improved I/O performance. Furthermore a non-clustered index can be allocated to a different FileGroup, which can utilize a different physical storage in order to improve performance even more.
The side effects of indexes are related to the cost of INSERT, UPDATE, MERGE and DELETE statements. Such statements can take longer to execute, in the presence of indexes, as it alters the data in the table, thus to the indexes too.
Imagine the situation of an INSERT statement. It has to add new rows in a table with a clustered index. In such case the table rows may need to be re-positioned! Remember…? The clustered index needs to order the data pages themselves! This will cause overhead.
So, it is crucial to take into account the overhead of INSERT, UPDATE and DELETE statements before designing your indexing strategy. Although there is an overhead to the above statements, you have to take into account, that many times, an UPDATE or DELETE statement will execute in a subset of data. This subset can be defined by a WHERE clause, where indexing may outweigh the additional cost of index updates, because the SQL server will have to find the data before updating them.
As explained above, a non-clustered index includes the clustered index’s key as its row locator, in the presence of a clustered index in the table.
This comes with a cost and a benefit:
The cost has to do with the non-clustered index bookmark lookup. What if a query has to return more columns that the ones hosted in the index itself? In the case of a HEAP table, the SQL server would have to check the RID of the non-clustered index, in order to navigate directly to the row, where the rest of the columns belong.
In the case of a clustered index, the SQL server would have to check the row locator of the non-clustered index, in order to do an additional navigation to the B-tree structure of the clustered index, to retrieve the desired row. You see, the row locator does not contain the RID, but the clustered-index key.
On the other hand, there is a benefit. It has to do with the clustered index updates. Imagine the following situation: Two new rows with index key values of A2 and A3 have to be added in the clustered index below.
Because this is a clustered index page, its physical structure has to be reallocated in order to fit A2 and A3 between A1 and A4. It has to maintain index’s order. Since there is no free space in the index page to accommodate these changes, a page split will occur. Now, there is enough space to fit A2 and A3 between A1 and A4.
The goal achieved and the order maintained within the index. But imagine what would have happened if the non-clustered index was looking at the RID, instead of the clustered index’s key? It would have to change its row locators to reflect the changes. This could have been a huge performance hit! Especially, in the case of large clustered indexes.
Instead of the RID, the row locators now point at the clustered index key. Meaning, that there is no longer needed to change its values. This is quite a benefit if you think of the large clustered indexes, that are usually maintained in many tables.
Clustered indexes VS non-clustered indexes
|PROS|| || |
|CONS|| || |