Wednesday, March 21, 2012

Dynamic lookup with periodical (Yearly) data using JOIN

Hi,

I am generating a dynamic SQL query to perform a lookup based on criterias entered by the user. The quey is generated by a C# application and the target database is Sql Server 2005.

Basically I have 2 tables:

--Cie-- (Over 40k companies, 1 rows per companies, total 40k rows)
(
CieId int not null,
Symbol varchar,
CieName varchar
) Primary key(CieId)

--YearlyData-- (Over 40k companies, 20 years of data per companies, total 800k rows)
(
CieId int not null,
Year int not null,
Revenue currency,
NetIncome currency,
[...over 170 columns]
)Primary key(CieId, Year)

//Sample Case:
The user want to get the list of all the compagnies (CieId) where
YearlyData.NetIncome > 25k for YearlyData.Year 2005
and YearlyData.Revenue > 100k for YearlyData.Year 2005
and YearlyData.Revenue > 95k for YearlyData.Year 2004
and YearlyData.Revenue > 90k for YearlyData.Year 2003
and YearlyData.Revenue of YearlyData.Year 2005 > YearlyData.Revenue of YearlyData.Year 2004

Currently, the Sql query that would be generated looks like this:

SELECT Cie.CieId
FROM Cie
JOIN (select CieId, Revenue, NetIncome from YearlyData where Year = 2005) Y2005 ON Y2005.CieId = Cie.CieId
JOIN (select CieId, Revenue, NetIncome from YearlyData where Year = 2004) Y2004 ON Y2004.CieId = Cie.CieId
JOIN (select CieId, Revenue, NetIncome from YearlyData where Year = 2003) Y2003 ON Y2003.CieId = Cie.CieId
WHERE 1=1
AND Y2005.NetIncome > 25000
AND Y2005.Revenue > 100000
AND Y2004.Revenue > 95000
AND Y2003.Revenue > 90000
AND Y2005.Revenue > Y2004.Revenue

I dont feel good about all those JOINs especially since there is up to 20 years of data and the user could use all of them in his query. The request become slower with each new join.

Does anyone see a better way to do this, maybe a PIVOT or something?

Thanks

Math:

I mocked this up establishing 40k cie entries and 800k yearly records. When I ran your query I got the expected execution plan and the expected performance. When the data is not cached it ran in about 15 seconds and with the data cached in ran in zero seconds. I got, of course, 3 table (clustered index) scans as it read in the yearly data. I got the merge join (which I expected).

My view is that (1) this is a report and (2) it doesn't figure to be run often and therefore performance for this is not critical. Now, I did show about 10,000 logical IOs as it scanned the tables. This will roughly amount to a grab of about 80Meg of buffer cache whenever this query runs. For most modern servers used for SQL Server 2005 this should not be a problem. Since (1) this is a report and doesn't figure to run often and (2) it seems to provide very adequate performance for a report I would not take any time to try to upgrade the performance of this; I don't figure you will get a good return on time invested in this report unless you are having some specific problems.

One thing that can be done to provide a marginal performance boost for this particular is to build a "cover index" for the specific columns you are using for this report; however, I don't feel like this is a good investment either. The modest boost received from the cover index does not figure to be a good tradeoff for the cost of the necessary cover index.

Are you having some specific problems? If so, what are they?


Dave

|||

Dave, thank you for the reply.

I dont have any specific problems other than perfomances, I am just trying to see if there is a better way to construct this kind of query (i.e. without 1 join per year). This is not really a report, it is more like a screener, a tools used when looking for companies to invest in. The screener is used frequently so performance is an issue. The data is readonly with only 1 update at the end of each day so we indexed every columns (grouped logicaly).

The query in the OP was only a sample case, the user can really compare:
"columnX of yearX" with "columnX of yearY"
"columnX of yearX" with "columnY of yearY"
"columnX of yearX" with "columnY of yearX"
"columnX of yearX" with a static value (like 250.95)
and more

Do you think the query can be resructured so it perform better (like removing the "join per year")?

Thanks

|||Would partitioning by year be helpful? It seems to be a common filter.|||

You can do below query instead and there is no reason to join with Cie table if you want just CieId values and there is FK relationship between YearlyData & Cie on CieId column.

select y.CieId

from YearlyData as y

where y.Year in (2003, 2004, 2005)

group by y.CieId

having max(case y.Year when 2005 then NetIncome end) > 25000

and max(case y.Year when 2005 then Revenue end) > 100000

and max(case y.Year when 2004 then Reveue end) > 95000

and max(case y.Year when 2003 then Reveue end) > 90000

and max(case y.Year when 2005 then Revenue end) > max(case y.Year when 2004 then Reveue end)

This query will provide best performance if there is an index on Year or (Year, CieId). You seem to have primary key on (CieId, Year) which will not help for this particular query resulting in a table/index scan.

|||

Math:

Both ideas help. When I ran Umachandar's query it reduced the logical IOs from about 10000 to about 3200, reduced alpha read time to about 4600 ms and reduced beta read time to around 420 ms. When I then partitioned the data on year logical IO was further reduced from 3200 to about 500, alpha read time from 4600ms to about 3200ms and beta read time from 420ms to about 380 ms.


Dave

|||

Both idea looks good, I will see if this query structure is compatible with all my test case and keep you inform.

Thank you for your help.

No comments:

Post a Comment