Saturday, February 25, 2012

Best Data Type for a Tracking ID?

Hello,
We're in the table design process and could really use outside thoughts on
our options.
The table in question will grow to several million records. This table will
update and insert depending on what the tracking ID does as it lives out it
day long life span.
Example Stored Proc:
UPDATE tbTracking
SET myCount = myCount +1
WHERE trackingID = @.trackingID AND productID = @.productID
IF (@.@.rowcount = 0)
BEGIN
INSERT INTO tbTracking (...) VALUES (...) ;
END
Everyday at midnight the table will be truncate just afer many other queries
generated reports and store cumulative data into other tables.
Our goal is extreme performance, with that in mind what is the best data
type for trackingID and productID? We have the luxury to make them whatever
is best, as this is a completely new project.
Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the like,
but we're in need of advise from those with more experience with these
matters.
Thank you for your time.
MarkMark S. wrote:

> Hello,
> We're in the table design process and could really use outside thoughts on
> our options.
> The table in question will grow to several million records. This table wil
l
> update and insert depending on what the tracking ID does as it lives out i
t
> day long life span.
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE trackingID = @.trackingID AND productID = @.productID
> IF (@.@.rowcount = 0)
> BEGIN
> INSERT INTO tbTracking (...) VALUES (...) ;
> END
> Everyday at midnight the table will be truncate just afer many other queri
es
> generated reports and store cumulative data into other tables.
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them whateve
r
> is best, as this is a completely new project.
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the like
,
> but we're in need of advise from those with more experience with these
> matters.
> Thank you for your time.
> Mark
This doesn't make much sense to me. As described it isn't clear whether
trackingid is a surrogate key or not. If it isn't then I don't
understand what its purpose is. What is the business meaning of
trackingid? Why isn't the datatype predetermined? What key or keys
exist in this table?
Maybe we are talking surrogate keys here, in which case see:
http://www.aspfaq.com/show.asp?id=2504

> Everyday at midnight the table will be truncate just afer many other queri
es
> generated reports and store cumulative data into other tables.
A guiding in principle in data warehouse applications is usually to
capture the data at the finest possible level of granularity. I expect
you've considered this, but I just thought it worth restating for the
benefit of all.
David Portas
SQL Server MVP
--|||Who or what determines the value of TrackingID?
If this key is to uniquely identify an event (such as a support call), then
perhaps (8 byte) datetime would be the logical choice. It has inherent
meaning, and unless you have hundreds of support calls coming in per minute,
it is unlikely that this key value would be duplicated, and retry logic on
the insert procedure or statement can handle the unlikely event that does.
There may even be a need to purge the tbTracking table based on < a specific
date/time rather than truncating the entire table at the end of the day.
As for ProductID, this should be an existing attribute in your database.
"Mark S." <marks@.yahoo.com> wrote in message
news:Ow6CKQ9CGHA.4004@.tk2msftngp13.phx.gbl...
> Hello,
> We're in the table design process and could really use outside thoughts on
> our options.
> The table in question will grow to several million records. This table
> will update and insert depending on what the tracking ID does as it lives
> out it day long life span.
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE trackingID = @.trackingID AND productID = @.productID
> IF (@.@.rowcount = 0)
> BEGIN
> INSERT INTO tbTracking (...) VALUES (...) ;
> END
> Everyday at midnight the table will be truncate just afer many other
> queries generated reports and store cumulative data into other tables.
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them
> whatever is best, as this is a completely new project.
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the
> like, but we're in need of advise from those with more experience with
> these matters.
> Thank you for your time.
> Mark
>|||In my experiences, int data type is the best performer especially when it
comes to indexing, joins, etc. (This comment relates to 32-bit O/S and SQL
Server 2000.)
I would make a meaningless key:
TrackingRowId int, identity, primary key, clustered
ProductRowId int, identity, primary key, clustered
SQL Server takes care of assigning the next number. The clustered index
will always insert data at the end of the data pages which avoids page split
s.
If you want to reset the tracking id each midnight, simply change the reseed
the tables as part of your processing.
Personally...
I would NOT use a character string for a key because its slower to compare
strings.
I would NOT use a UNIQUEIDENTIFIER for a key because it is a pain in the
butt when you have to write a manual query to pull data.
Finally, I have to reiterate David's point of capturing data at the lowest
level. Are you sure you want to summarize and truncate on a nightly basis.
What if a bug is discovered? You won't have the original data to recreate
the totals. What if a new way of summarizing is added? You won't have the
original data to drill into or summarize.
And, to reiterate JT's point of having a datetime column that can be very
handy. It is debateable whether to use it as the key or not. If you're
truncating the tables each night, it may be a waste.
Just my two cents,
Joe
"Mark S." wrote:

> Hello,
> We're in the table design process and could really use outside thoughts on
> our options.
> The table in question will grow to several million records. This table wil
l
> update and insert depending on what the tracking ID does as it lives out i
t
> day long life span.
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE trackingID = @.trackingID AND productID = @.productID
> IF (@.@.rowcount = 0)
> BEGIN
> INSERT INTO tbTracking (...) VALUES (...) ;
> END
> Everyday at midnight the table will be truncate just afer many other queri
es
> generated reports and store cumulative data into other tables.
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them whateve
r
> is best, as this is a completely new project.
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the like
,
> but we're in need of advise from those with more experience with these
> matters.
> Thank you for your time.
> Mark
>
>|||Gentleman thank you for your feedback.
Joe, other than the UniqueIndentifier being unfriendly for humans, do you
find it faster than INTs?
As to th other questions and caveats, all that has been considered
previously, and wasn't mentioned in my question in the interests of brevity
and an attempt to focus my question on a single point. Regardless, thank you
for your full consideration.
Cheers,
Mark
"Joe from WI" <JoefromWI@.discussions.microsoft.com> wrote in message
news:50A0324B-1446-4044-A850-706B59A04169@.microsoft.com...
> In my experiences, int data type is the best performer especially when it
> comes to indexing, joins, etc. (This comment relates to 32-bit O/S and
> SQL
> Server 2000.)
> I would make a meaningless key:
> TrackingRowId int, identity, primary key, clustered
> ProductRowId int, identity, primary key, clustered
> SQL Server takes care of assigning the next number. The clustered index
> will always insert data at the end of the data pages which avoids page
> splits.
> If you want to reset the tracking id each midnight, simply change the
> reseed
> the tables as part of your processing.
> Personally...
> I would NOT use a character string for a key because its slower to compare
> strings.
> I would NOT use a UNIQUEIDENTIFIER for a key because it is a pain in the
> butt when you have to write a manual query to pull data.
> Finally, I have to reiterate David's point of capturing data at the lowest
> level. Are you sure you want to summarize and truncate on a nightly
> basis.
> What if a bug is discovered? You won't have the original data to recreate
> the totals. What if a new way of summarizing is added? You won't have
> the
> original data to drill into or summarize.
> And, to reiterate JT's point of having a datetime column that can be very
> handy. It is debateable whether to use it as the key or not. If you're
> truncating the tables each night, it may be a waste.
> Just my two cents,
> Joe
> "Mark S." wrote:
>|||The last line of the article David suggested
http://www.aspfaq.com/show.asp?id=2504
Says alot about GUI() not being optimized, if anyone disagrees, feel free to
speak up:
"the wider datatype leads to a drop in index performance (if clustered, each
insert almost guaranteed to 'dirty' a different page), and an increase in
storage requirements; " and five other cons.
Thank you.|||To maximize INSERT performance: use INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED
WITH FILLFACTOR = 100
This causes every new row to be added at the end of the table--minimizing
index maintenance and eliminating page splits. Use caution, however: since
SQL Server doesn't automatically reorganize indexes, a high volume of DELETE
activity will cause the index to become sparse, which can reduce SELECT
performance, but it will not affect INSERT performance.
The INT datatype matches the word size of most Intel processors (32-bit), so
comparisons require fewer CPU cycles.
The stored procedure is an example of what not to do. It's a recipie for
primary key constraint violations. There's nothing that prevents two
transactions from trying to INSERT the same information at the same time.
Most of the time what will happen is that one connection will succeed with
the INSERT and the other will UPDATE the newly inserted row, but a collision
will occur if identical UPDATE statements occur simultaneously on two
unbound connections followed by (since the row doesn't yet exist)
simultaneous identical INSERT statements.
The correct method is to use something like this:
BEGIN TRAN
IF EXISTS (SELECT WITH(UPDLOCK, HOLDLOCK))
UPDATE
ELSE
INSERT
COMMIT
Some people use the following instead, but I prefer the above method since
it is easier to read and understand
BEGIN TRAN
INSERT...SELECT...WHERE NOT EXISTS(SELECT WITH(UPDLOCK, HOLDLOCK))
IF @.@.ROWCOUNT = 0
UPDATE
COMMIT
Note that there isn't any marked reduction in performance or concurrency
between this and your sample, because UPDLOCK doesn't block SELECTs, and any
blocking that does occur is necessary to maintain integrity. Without the
EXISTS clause, the INSERT or UPDATE will be applying an exclusive lock
anyway which involves reading the index page into memory. With the EXISTS
clause, the SELECT reads the index page and applies an update lock, and the
INSERT or UPDATE simply transition from an update lock to an exclusive lock
in memory--no additional physical read is necessary.
"Mark S." <marks@.yahoo.com> wrote in message
news:Ow6CKQ9CGHA.4004@.tk2msftngp13.phx.gbl...
> Hello,
> We're in the table design process and could really use outside thoughts on
> our options.
> The table in question will grow to several million records. This table
> will update and insert depending on what the tracking ID does as it lives
> out it day long life span.
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE trackingID = @.trackingID AND productID = @.productID
> IF (@.@.rowcount = 0)
> BEGIN
> INSERT INTO tbTracking (...) VALUES (...) ;
> END
> Everyday at midnight the table will be truncate just afer many other
> queries generated reports and store cumulative data into other tables.
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them
> whatever is best, as this is a completely new project.
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the
> like, but we're in need of advise from those with more experience with
> these matters.
> Thank you for your time.
> Mark
>|||Thank you very much.
"Brian Selzer" <brian@.selzer-software.com> wrote in message
news:%23oG2PTDDGHA.3980@.TK2MSFTNGP12.phx.gbl...
> To maximize INSERT performance: use INT IDENTITY(1, 1) PRIMARY KEY
> CLUSTERED WITH FILLFACTOR = 100
> This causes every new row to be added at the end of the table--minimizing
> index maintenance and eliminating page splits. Use caution, however:
> since SQL Server doesn't automatically reorganize indexes, a high volume
> of DELETE activity will cause the index to become sparse, which can reduce
> SELECT performance, but it will not affect INSERT performance.
> The INT datatype matches the word size of most Intel processors (32-bit),
> so comparisons require fewer CPU cycles.
> The stored procedure is an example of what not to do. It's a recipie for
> primary key constraint violations. There's nothing that prevents two
> transactions from trying to INSERT the same information at the same time.
> Most of the time what will happen is that one connection will succeed with
> the INSERT and the other will UPDATE the newly inserted row, but a
> collision will occur if identical UPDATE statements occur simultaneously
> on two unbound connections followed by (since the row doesn't yet exist)
> simultaneous identical INSERT statements.
> The correct method is to use something like this:
> BEGIN TRAN
> IF EXISTS (SELECT WITH(UPDLOCK, HOLDLOCK))
> UPDATE
> ELSE
> INSERT
> COMMIT
> Some people use the following instead, but I prefer the above method since
> it is easier to read and understand
> BEGIN TRAN
> INSERT...SELECT...WHERE NOT EXISTS(SELECT WITH(UPDLOCK, HOLDLOCK))
> IF @.@.ROWCOUNT = 0
> UPDATE
> COMMIT
> Note that there isn't any marked reduction in performance or concurrency
> between this and your sample, because UPDLOCK doesn't block SELECTs, and
> any blocking that does occur is necessary to maintain integrity. Without
> the EXISTS clause, the INSERT or UPDATE will be applying an exclusive lock
> anyway which involves reading the index page into memory. With the EXISTS
> clause, the SELECT reads the index page and applies an update lock, and
> the INSERT or UPDATE simply transition from an update lock to an exclusive
> lock in memory--no additional physical read is necessary.
> "Mark S." <marks@.yahoo.com> wrote in message
> news:Ow6CKQ9CGHA.4004@.tk2msftngp13.phx.gbl...
>

No comments:

Post a Comment