Indexes in databases are very similar to indexes in libraries. Indexes allow locating information within a database fast, much like they do in libraries. If all books in a library are indexed alphabetically then you don’t need to browse the whole library to find particular book. Instead you’ll simply get the first letter from the book title and you’ll find this letter’s section in the library starting your search from there, which will narrow down your search significantly.
Indexes are created on columns in tables or views. The index provides a fast way to look up data based on the values within those columns. For example, if you create an index on the primary key and then search for a row of data based on one of the primary key values, SQL Server first finds that value in the index, and then uses the index to quickly locate the entire row of data. Without the index, a table scan would have to be performed in order to locate the row, which can have a significant effect on performance.
You can create indexes on most columns in a table or a view. The exceptions are primarily those columns configured with large object (LOB) data types, such as image, text, and varchar(max). You can also create indexes on XML columns, but those indexes are slightly different from the basic index and are beyond the scope of this article. Instead, I’ll focus on those indexes that are implemented most commonly in a SQL Server database.
Another definition from CodeProject.com:
Index is a database object, which can be created on one or more columns (16 Max column combination). When creating the index will read the column(s) and forms a relevant data structure to minimize the number of data comparisons. The index will improve the performance of data retrieval and adds some overhead on data modification such as create, delete and modify. So it depends on how much data retrieval can be performed on table versus how much of DML (INSERT, DELETE and UPDATE
Binary Tree (B-Tree):
A B-tree consist of a root node that contains a single page of data, zero or more intermediate levels containing additional pages and leaf level.
We can also say that the index is made up of a set of pages (index nodes) that are organized in a B-tree structure. This structure is hierarchical in nature, with the root node at the top of the hierarchy and the leaf nodes at the bottom … as shown in the next figure:
Figure No. 1:
When a query is issued against an indexed column, the query engine starts at the root node and navigates down through the intermediate nodes, with each layer of the intermediate level more granular than the one above. The query engine continues down through the index nodes until it reaches the leaf node. For example, if you’re searching for the value 123 in an indexed column, the query engine would first look in the root level to determine which page to reference in the top intermediate level. In this example, the first page points the values 1-100, and the second page, the values 101-200, so the query engine would go to the second page on that level. The query engine would then determine that it must go to the third page at the next intermediate level. From there, the query engine would navigate to the leaf node for value 123. The leaf node will contain either the entire row of data or a pointer to that row, depending on whether the index is clustered or non-clustered.
The Query Optimizer: Query Optimizer indexes to reduce operations of disk input-output and using of system resources when we fire query on data. Data manipulation Query statements (like SELECT, DELETE OR UPDATE) need indexes for maximization of the performance. When Query fires the most efficient method for retrieval of the data is evaluated among available methods. It uses table scans or index scans.
Table scans uses many Input-output operations, it also uses large number of resources as all rows from the table are scanned.
Index scan used for searching index key columns to find storage location.
Indexes may be either Clustered or Non-Clustered !!
A clustered index stores the actual data rows at the leaf level of the index. Returning to the example above, that would mean that the entire row of data associated with the primary key value of 123 would be stored in that leaf node. An important characteristic of the clustered index is that the indexed values are sorted in either ascending or descending order. As a result, there can be only one clustered index on a table or view. In addition, data in a table is sorted only if a clustered index has been defined on a table.
Note: A table that has a clustered index is referred to as a clustered table. A table that has no clustered index is referred to as a heap.
Unlike a clustered indexed, the leaf nodes of a non-clustered index contain only the values from the indexed columns and row locators that point to the actual data rows, rather than contain the data rows themselves. This means that the query engine must take an additional step in order to locate the actual data.
A row locator’s structure depends on whether it points to a clustered table or to a heap. If referencing a clustered table, the row locator points to the clustered index, using the value from the clustered index to navigate to the correct data row. If referencing a heap, the row locator points to the actual data row.
Non-clustered indexes cannot be sorted like clustered indexes; however, you can create more than one non-clustered index per table or view. SQL Server 2005 supports up to 249 non-clustered indexes, and SQL Server 2008 support up to 999. This certainly doesn’t mean you should create that many indexes. Indexes can both help and hinder performance, as I explain later in the article.
In addition to being able to create multiple non-clustered indexes on a table or view, you can also add included columns to your index. This means that you can store at the leaf level not only the values from the indexed column, but also the values from non-indexed columns. This strategy allows you to get around some of the limitations on indexes. For example, you can include non-indexed columns in order to exceed the size limit of indexed columns (900 bytes in most cases).
In addition to an index being clustered or nonclustered, it can be configured in other ways:
- Composite index: An index that contains more than one column. In both SQL Server 2005 and 2008, you can include up to 16columns in an index, as long as the index doesn’t exceed the 900-byte limit. Both clustered and non-clustered indexes can be composite indexes.
- Unique Index: An index that ensures the uniqueness of each value in the indexed column. If the index is a composite, the uniqueness is enforced across the columns as a whole, not on the individual columns. For example, if you were to create an index on the FirstName and LastName columns in a table, the names together must be unique, but the individual names can be duplicated.
A unique index is automatically created when you define a primary key or unique constraint:
- Primary key: When you define a primary key constraint on one or more columns, SQL Server automatically creates a unique, clustered index if a clustered index does not already exist on the table or view. However, you can override the default behavior and define a unique, non-clustered index on the primary key.
- Unique: When you define a unique constraint, SQL Server automatically creates a unique, non-clustered index. You can specify that a unique clustered index be created if a clustered index does not already exist on the table.
- Covering index: A type of index that includes all the columns that are needed to process a particular query. For example, your query might retrieve the FirstName and LastName columns from a table, based on a value in the ContactID column. You can create a covering index that includes all three columns.
How index works generally ???
The columns specified in the CREATE INDEX T-SQL command taken by the database engine and sorts the values in Binary Tree (B-Tree) data structure. B-Tree structure supports faster search with minimum dist reads, and allows the database engine to find quick start and end point for the stated query.
The database takes the columns specified in a CREATE INDEX command and sorts the values into a special data structure known as a B-tree. A B-tree structure supports fast searches with a minimum amount of disk reads, allowing the database engine to quickly find the starting and stopping points for the query we are using.
Conceptually, every index entry has the index key. Each entry also includes a references to the table rows which share that particular value and from which we can retrieve the required information.
It is much similar to the back of a book helps us to find keywords quickly, so the database is able to quickly narrow the number of records it must examine to a minimum by using the sorted list of Key values stored in the index. Thus we avoid a table scan to fetch the query results. Following some of the scenarios where indexes offer a benefit.
[Copied from CodeProject.com]
Create the following tables by run this T-SQL command:
CREATE TABLE Student (StudID smallint, StudName varchar(50), Class tinyint);
CREATE TABLE TotalMarks (StudentID smallint, TotalMarks smallint);
How clustered index works ??
When creating the clustered index, SQL server 2005 reads the StudID column and forms a Binary tree on it. This binary tree information is then stored separately in the disc. Expand the table Student and then expand the Indexes. You will see the following index created for you when the primary key is created:
With the use of the binary tree, now the search for the student based on the StudID decreases the number of comparisons to a large amount. Let us assume that you had entered the following data in the table student:
The index will form the below specified binary tree. Note that for a given parent, there are only one or two Childs. The left side will always have a lesser value and the right side will always have a greater value when compared to parent. The tree can be constructed in the reverse way also. That is, left side higher and right side lower.
SELECT * FROM student WHERE studid = 103;
SELECT * FROM student WHERE studid = 107;
Execution without index will return value for the first query after third comparison.
Execution without index will return value for the second query at eights comparison.
Execution of first query with index will return value at first comparison.
Execution of second query with index will return the value at the third comparison. Look below:
Compare 107 vs 103 : Move to right node
Compare 107 vs 106 : Move to right node
Compare 107 vs 107 : Matched, return the record
If numbers of records are less, you cannot see a different one. Now apply this technique with a Yahoo email user accounts stored in a table called say YahooLogin. Let us assume there are 33 million users around the world that have Yahoo email id and that is stored in the YahooLogin. When a user logs in by giving the user name and password, the comparison required is 1 to 25, with the binary tree that is clustered index.
Look at the above picture and guess yourself how fast you will reach into the level 25. Without Clustered index, the comparison required is 1 to 33 millions.
Got the usage of Clustered index? Let us move to Non-Clustered index.
How does a Non-Clustered index work?
A table can have more than one Non-Clustered index. But, it should have only one clustered index that works based on the Binary tree concept. Non-Clustered column always depends on the Clustered column on the database.
This can be easily explained with the concept of a book and its index page at the end. Let us assume that you are going to a bookshop and found a big 1500 pages of C# book that says all about C#. When you glanced at the book, it has all beautiful color pages and shiny papers. But, that is not only the eligibility for a good book right? One you are impressed, you want to see your favorite topic of Regular Expressions and how it is explained in the book. What will you do? I just peeped at you from behind and recorded what you did as below:
- You went to the Index page (it has total 25 pages). It is already sorted and hence you easily picked up Regular Expression that comes on page Number 17.
- Next, you noted down the number displayed next to it which is 407, 816, 1200–1220.
- Your first target is Page 407. You opened a page in the middle, the page is greater than 500.
- Then you moved to a somewhat lower page. But it still reads 310.
- Then you moved to a higher page. You are very lucky you exactly got page 407. [Yes man you got it. Otherwise I need to write more. OK?]
- That’s all, you started exploring what is written about Regular expression on that page, keeping in mind that you need to find page 816 also.
In the above scenario, the Index page is Non-Clustered index and the page numbers are clustered index arranged in a binary tree. See how you came to the page 407 very quickly. Your mind actually traversed the binary tree way left and right to reach the page 407 quickly.
Here, the class column with distinct values 1,2,3..12 will store the clustered index columns value along with it. Say for example; Let us take only class value of 1. The Index goes like this:
1: 100, 104, 105