Tuesday, October 6, 2009

Index on a Foreign Key?

About two weeks ago, I had a discussion on Twitter and via email with Jeremiah Peschka (@peschkaj) about placing indexes on Foreign Key columns.  I made the comment that I while indexing a foreign key is not required that I thought they made good candidates.  I also shared this blog post by Greg Low with him. 

Jeremiah mentioned that he had never had SQL Server recommend an index on a foreign key which I thought was strange so I decided to run some tests.  I ran the tests against my copy of AdventureWorks to which I have added data.  Here is the query I ran:

SELECT
SOD.SalesOrderDetailID,
SOD.OrderQty,
SOD.ProductID,
SOD.UnitPrice,
SOD.UnitPriceDiscount,
SOD.LineTotal,
SOD.UnitPrice2,
SOH.SalesOrderID,
SOH.RevisionNumber,
SOH.OrderDate,
SOH.DueDate,
SOH.ShipDate,
SOH.Status,
SOH.SalesOrderNumber,
SOH.PurchaseOrderNumber,
SOH.AccountNumber,
SOH.CustomerID,
SOH.SubTotal,
SOH.TaxAmt,
SOH.Freight,
SOH.TotalDue,
SOH.Comment
FROM
Sales.SalesOrderHeader AS SOH JOIN
Sales.SalesOrderDetail AS SOD ON
SOH.SalesOrderID = SOD.SalesOrderID
WHERE
SOH.OrderDate >= '12/1/2007'


In my copy of AdventureWorks I have 131465 rows in SalesOrderHeader and 1121317 rows in SalesOrderDetail.  The Max(SalesOrderHeader.OrderDate) is 12/30/2007 so I just queried for the last month (1095 rows).  In this query SalesOrderDetail.SalesOrderID is a foreign key referencing SalesOrderHeader.SalesOrderID and before modification the Clustered Primary Key on SalesOrderDetail is on SalesOrderID, SalesOrderDetailID.  Because SalesOrderID is the lead column in the clustering key the optimizer uses a clustered index seek to access SalesOrderDetail as shown here:



imageNow, to see if the Optimizer would recommend an index on SalesOrderDetail.SalesOrderID, I altered the clustered primary key on SalesOrderDetail, removing SalesOrderID with the following code:



USE [AdventureWorks]
GO

IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[Sales].[SalesOrderDetail]') AND name = N'PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID')
ALTER TABLE [Sales].[SalesOrderDetail] DROP CONSTRAINT [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]
GO

USE
[AdventureWorks]
GO

ALTER TABLE [Sales].[SalesOrderDetail] ADD CONSTRAINT [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] PRIMARY KEY CLUSTERED
(
[SalesOrderDetailID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO



I then re-ran my first query and now you can see that I there is a clustered index scan, a hash match join, and, in green at the top, a index recommended on SalesOrderDetail.SalesOrderID:



imageNext I added the suggested index (remember that you should not just add indexes to production without checking other queries as well):



USE [AdventureWorks]
GO
CREATE NONCLUSTERED INDEX
SalesOrderDetails_SalesOrderId
ON [Sales].[SalesOrderDetail] ([SalesOrderID])
INCLUDE ([SalesOrderDetailID],[OrderQty],
[ProductID],[UnitPrice],
[UnitPriceDiscount],[LineTotal],
[UnitPrice2])
GO



Then I re-ran my query, and here’s the execution plan:



image



Here are the Statistics IO results from the 3 queries as well:



Results with SalesOrderID as lead column in Primary Key (Original State)



Table 'SalesOrderDetail'. Scan count 147, logical reads 516, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'SalesOrderHeader'. Scan count 1, logical reads 461, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



Results with SalesOrderID removed from Primary Key and not added as an index.



Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'SalesOrderDetail'. Scan count 1, logical reads 12584, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Table 'SalesOrderHeader'. Scan count 1, logical reads 461, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



Results with the suggested index on SalesOrderID.



Table 'SalesOrderDetail'. Scan count 147, logical reads 504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'SalesOrderHeader'. Scan count 1, logical reads 461, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



Conclusion



In this case you can see that adding an index on the foreign key, whether the lead column in the Clustered Primary Key or in a non-clustered index, you get much improved performance particularly regarding reads, and that the non-clustered index actual provides a slightly better access path than being part of the clustered primary key.  Certainly you need to test for your individual situations but, I would say that the majority of the time a foreign key will benefit from an index since it is used as a filter in the JOIN criteria.



AdventureWorks can be downloaded from codeplex.  Execution Plans and T-SQL can be downloaded from here.

5 comments:

  1. Hey Jack, this is quite a good piece of info. Acutally, it is a common misconception that foreign key columns need not to be indexed. In fact, indexing them as you said can improve performance by simply giving the optimizer more choices. It may be good to have index on foreign key columns just to give a chance to use merge join or nested loops join instead of hash join. This saves memory and CPU ticks big time.

    ReplyDelete
  2. Thanks Piotr. There are definitely some misconceptions around this subject.

    ReplyDelete
  3. Hello i would like to ask about the actual purpose of the composite key in table SalesOrderDetail?

    I think that the field SalesOrderDetailId is actual primary key by itself

    ReplyDelete
  4. While SalesOrderDetailID could be a primary key by it self, I think the AdventureWorks design team chose to make it a copmosite key on SalesOrderID and SalesOrderDetailID because it is also the clustered key and they anticipate the insertion to be in ascending order by SalesOrderID and it doesn't make it necessary to have second index on SalesOrderID.

    ReplyDelete

So what do you think I am?