Showing posts with label dwh. Show all posts
Showing posts with label dwh. Show all posts

Wednesday, February 15, 2012

DWH problem: updating a table with every 1000 records a checkpoint

Hi,

Currently we're a building a metadatadriven datawarehouse in SQL
Server 2000. We're investigating the possibility of the updating
tables with enormeous number of updates and insert and the use of
checkpoints (for simple recovery and Backup Log for full recovery).

On several website people speak about full transaction log and the
pace of growing can't keep up with the update. Therefore we want to
create a script which flushes the dirty pages to the disk. It's not
quite clear to me how it works. Questions we have is:
* How does the process of updating, insert and deleting works with SQL
Server 2000 with respect to log cache, log file, buffer cache, commit,
checkpoint, etc?
What happens when?
* As far as i can see now: i'm thinking of creating chunks of data of
1000 records with a checkpoint after the Query. SQL server has the
default of implicit transactions and so it will not need a commit.
Something like this?
* How do i create chunks of 1000 records automatically without
creating a identity field or something. Is there something like SELECT
NEXT 1000?

Greetz,

Hennie> * How does the process of updating, insert and deleting works with SQL
> Server 2000 with respect to log cache, log file, buffer cache, commit,
> checkpoint, etc?
> What happens when?

There's a discussion of this in the Books Online
<architec.chm::/8_ar_sa_8unn.htm>. In summary, all data modifications are
written to both the buffer cache and transaction log buffer cache. The log
buffer is written first. Modified buffer cache pages are written to disk by
either worker threads, the lazy writer or the checkpoint process. Worker
threads write data when they need to wait for a read i/o anyway. The lazy
writer process wakes up periodically and writes data in order to maintain a
reasonable number of free cache buffers. The checkpoint process writes all
dirty pages to disk periodically (about a minute by default) in order to
reduce startup recovery time.

Log buffers are written as they become full, when you issue a COMMIT and
during a checkpoint.

> * As far as i can see now: i'm thinking of creating chunks of data of
> 1000 records with a checkpoint after the Query. SQL server has the
> default of implicit transactions and so it will not need a commit.
> Something like this?
> * How do i create chunks of 1000 records automatically without
> creating a identity field or something. Is there something like SELECT
> NEXT 1000?

I'm not sure I understand what you are asking here. In the default
autocommit mode, each statement is a separate transaction.

It seems you are mostly concerned with transaction log space management.
The main consideration is to keep your transactions reasonably sized since
only committed data can be removed from the log during a log backup (FULL or
BULK_LOGGED recovery model) or checkpoint (SIMPLE model).

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Hennie de Nooijer" <hdenooijer@.hotmail.com> wrote in message
news:191115aa.0402090341.23275aae@.posting.google.c om...
> Hi,
> Currently we're a building a metadatadriven datawarehouse in SQL
> Server 2000. We're investigating the possibility of the updating
> tables with enormeous number of updates and insert and the use of
> checkpoints (for simple recovery and Backup Log for full recovery).
> On several website people speak about full transaction log and the
> pace of growing can't keep up with the update. Therefore we want to
> create a script which flushes the dirty pages to the disk. It's not
> quite clear to me how it works. Questions we have is:
> * How does the process of updating, insert and deleting works with SQL
> Server 2000 with respect to log cache, log file, buffer cache, commit,
> checkpoint, etc?
> What happens when?
> * As far as i can see now: i'm thinking of creating chunks of data of
> 1000 records with a checkpoint after the Query. SQL server has the
> default of implicit transactions and so it will not need a commit.
> Something like this?
> * How do i create chunks of 1000 records automatically without
> creating a identity field or something. Is there something like SELECT
> NEXT 1000?
> Greetz,
> Hennie|||Hennie de Nooijer (hdenooijer@.hotmail.com) writes:
> * How does the process of updating, insert and deleting works with SQL
> Server 2000 with respect to log cache, log file, buffer cache, commit,
> checkpoint, etc?
> What happens when?
> * As far as i can see now: i'm thinking of creating chunks of data of
> 1000 records with a checkpoint after the Query. SQL server has the
> default of implicit transactions and so it will not need a commit.
> Something like this?
> * How do i create chunks of 1000 records automatically without
> creating a identity field or something. Is there something like SELECT
> NEXT 1000?

I hope Dan's reply shed some light on how the log works.

As for breaking up in chucks there is nothing built-in like:

UPDATE tbl
SET yadayada
...
WITH COMMIT EACH 1000 ROW

For DML statements you will have to find out how to do the batching best
from your data. Note there is no reason to keep a strict batch size, but
if you have some column that divides the data nicely that can be used.
For instance, if you are handling order data, you can take one day or
month at a time.

The one case where you can tell SQL Server to commit by each n row is
when you bulk load with BCP or BULK INSERT.

I also like to point out that selecting the batch size, is a trade-off
between log size and speed. Particularly if the search condition to
identify a batch takes time to execute, too small batch sizes can be
costly. A little depending on row size, but 100000 rows at time is not
an unreasonable number. 1000 is by far to small.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||> > * How does the process of updating, insert and deleting works with SQL
> > Server 2000 with respect to log cache, log file, buffer cache, commit,
> > checkpoint, etc?
> > What happens when?
> There's a discussion of this in the Books Online
> <architec.chm::/8_ar_sa_8unn.htm>.
How does this works?

> In summary, all data modifications are
> written to both the buffer cache and transaction log buffer cache. The log
> buffer is written first. Modified buffer cache pages are written to disk by
> either worker threads, the lazy writer or the checkpoint process. Worker
> threads write data when they need to wait for a read i/o anyway. The lazy
> writer process wakes up periodically and writes data in order to maintain a
> reasonable number of free cache buffers. The checkpoint process writes all
> dirty pages to disk periodically (about a minute by default) in order to
> reduce startup recovery time.
Ok clear..

> Log buffers are written as they become full, when you issue a COMMIT and
> during a checkpoint.
It thought that a commit is not neccessary a write to the disk. A
commit is just a end mark of a transaction as far i can see in the
documentation. A checkpoint will write the transactions to the
database.

> > * As far as i can see now: i'm thinking of creating chunks of data of
> > 1000 records with a checkpoint after the Query. SQL server has the
> > default of implicit transactions and so it will not need a commit.
> > Something like this?
> > * How do i create chunks of 1000 records automatically without
> > creating a identity field or something. Is there something like SELECT
> > NEXT 1000?
> I'm not sure I understand what you are asking here. In the default
> autocommit mode, each statement is a separate transaction.
> It seems you are mostly concerned with transaction log space management.
> The main consideration is to keep your transactions reasonably sized since
> only committed data can be removed from the log during a log backup (FULL or
> BULK_LOGGED recovery model) or checkpoint (SIMPLE model).
Yup this is what i meant. Only the question is what is a reasonable
chunk? Perhaps i will create parameter for this so i can tune this.|||> I hope Dan's reply shed some light on how the log works.
Yup

> As for breaking up in chucks there is nothing built-in like:
> UPDATE tbl
> SET yadayada
> ...
> WITH COMMIT EACH 1000 ROW
> For DML statements you will have to find out how to do the batching best
> from your data. Note there is no reason to keep a strict batch size, but
> if you have some column that divides the data nicely that can be used.
> For instance, if you are handling order data, you can take one day or
> month at a time.
Yeah well i have identity column which i could use for chunking the
records. Disadvantage is that it is an implementation depending on the
functionality of a column in the table. It's not what i want but i'll
have to deal with it.

> The one case where you can tell SQL Server to commit by each n row is
> when you bulk load with BCP or BULK INSERT.
Well this is not possible. I'm reading and insert/updating from other
tables in SQL server and BCP reads only from files.

> I also like to point out that selecting the batch size, is a trade-off
> between log size and speed. Particularly if the search condition to
> identify a batch takes time to execute, too small batch sizes can be
> costly. A little depending on row size, but 100000 rows at time is not
> an unreasonable number. 1000 is by far to small.
Well 1000 was just a hunch (is this rightly written). I thinking of
creating a parameter from which we can tune the chunks..

Thx you all for your answers

Greetz
Hennie|||> > There's a discussion of this in the Books Online
> > <architec.chm::/8_ar_sa_8unn.htm>.
> How does this works?

From the Books Online, you can go directly to the referenced topic by
clicking Go-->URL and pasting in the URL 'architec.chm::/8_ar_sa_8unn.htm'.

> > Log buffers are written as they become full, when you issue a COMMIT and
> > during a checkpoint.
> It thought that a commit is not neccessary a write to the disk. A
> commit is just a end mark of a transaction as far i can see in the
> documentation. A checkpoint will write the transactions to the
> database.

A COMMIT ensures all committed data are written to the log file so that
committed data are permanently persisted to the database. Modified data may
or may not have been written to data files because dirty data pages are
written separately by worker threads, the lazy writer or checkpoint process.
This is transparent to your application because data are always retrieved
from buffer cache.

Only committed data before the oldest uncommitted transaction can be removed
from the log. This occurs during a log backup in the FULL or BULK_LOGGED
recovery model or during a checkpoint in the SIMPLE model. Rather than
trying to take control of the checkpoint frequency, all you really need to
do is ensure your transactions are reasonably short and small.

> Only the question is what is a reasonable
> chunk? Perhaps i will create parameter for this so i can tune this.

If you are running in the SIMPLE recovery model and have many small
insert/update/delete transactions, I suggest you simply let SQL Server
truncate the log automatically. However, if you modify a lot of data (e.g.
millions of rows) in a single statement or transaction, you can use the
techniques Erland mentioned to keep the log size reasonable. In my
experience, the optimal amount is somewhat less that the amount of memory
allocated to SQL Server.

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Hennie de Nooijer" <hdenooijer@.hotmail.com> wrote in message
news:191115aa.0402110050.7797d139@.posting.google.c om...
> > > * How does the process of updating, insert and deleting works with SQL
> > > Server 2000 with respect to log cache, log file, buffer cache, commit,
> > > checkpoint, etc?
> > > What happens when?
> > There's a discussion of this in the Books Online
> > <architec.chm::/8_ar_sa_8unn.htm>.
> How does this works?
> > In summary, all data modifications are
> > written to both the buffer cache and transaction log buffer cache. The
log
> > buffer is written first. Modified buffer cache pages are written to
disk by
> > either worker threads, the lazy writer or the checkpoint process.
Worker
> > threads write data when they need to wait for a read i/o anyway. The
lazy
> > writer process wakes up periodically and writes data in order to
maintain a
> > reasonable number of free cache buffers. The checkpoint process writes
all
> > dirty pages to disk periodically (about a minute by default) in order to
> > reduce startup recovery time.
> Ok clear..
> > Log buffers are written as they become full, when you issue a COMMIT and
> > during a checkpoint.
> It thought that a commit is not neccessary a write to the disk. A
> commit is just a end mark of a transaction as far i can see in the
> documentation. A checkpoint will write the transactions to the
> database.
> > > * As far as i can see now: i'm thinking of creating chunks of data of
> > > 1000 records with a checkpoint after the Query. SQL server has the
> > > default of implicit transactions and so it will not need a commit.
> > > Something like this?
> > > * How do i create chunks of 1000 records automatically without
> > > creating a identity field or something. Is there something like SELECT
> > > NEXT 1000?
> > I'm not sure I understand what you are asking here. In the default
> > autocommit mode, each statement is a separate transaction.
> > It seems you are mostly concerned with transaction log space management.
> > The main consideration is to keep your transactions reasonably sized
since
> > only committed data can be removed from the log during a log backup
(FULL or
> > BULK_LOGGED recovery model) or checkpoint (SIMPLE model).
> Yup this is what i meant. Only the question is what is a reasonable
> chunk? Perhaps i will create parameter for this so i can tune this.|||[posted and mailed, please reply in news]

Hennie de Nooijer (hdenooijer@.hotmail.com) writes:
> Yeah well i have identity column which i could use for chunking the
> records. Disadvantage is that it is an implementation depending on the
> functionality of a column in the table. It's not what i want but i'll
> have to deal with it.

One thing that I should have added is that you should see if the
clustered index is good for chunking. That can speed up the time
it takes for SQL Server to locate the rows quite a bit.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp

dwh problem

Hi,
I have to choose between
a) using one single fact table
b) using one fact table and several dimension tables
I know that if you have no dimension tables (i.e. you store all the
information on one single fact table) your sql will not include any joins. I
understand joins penalize performance so it would make sense to go for this
option. Can someone tell me if a star schema approach with surrogate keys on
the fact table pointing to each dimension table would be less performant than
the single fact table approach? What would be more efficient if I want to use
Business Objects as the reporting tool? What are the pros and cons of each
option?
If anyone has previously had to make a similar decision I would very much
appreciate the help.
Thanks a lot,
Roberto
Hi rufian,
I can't completely answer your question, but my experience so far has been
that using the surrogate keys is worth doing. I have built fact tables before
that should everything in one table and it didn't significantly improve
performance.
Analysis services 2005 also seems to be happier with the star schema, but I
don't know much about it.
One of the biggest benefits of the star schema is that you can add more
attributes and dimensions much more easily.
If you know how big your fact table will be, then the decision may be made
for you. As soon as your fact table starts getting big, you should use
dimension tables. If you have 1 million rows in the fact table, there is no
way I would skip the dimension tables. Even for 100,000 I probably wouldn't
skip them.
I hope that helps.
"rufian" wrote:

> Hi,
> I have to choose between
> a) using one single fact table
> b) using one fact table and several dimension tables
> I know that if you have no dimension tables (i.e. you store all the
> information on one single fact table) your sql will not include any joins. I
> understand joins penalize performance so it would make sense to go for this
> option. Can someone tell me if a star schema approach with surrogate keys on
> the fact table pointing to each dimension table would be less performant than
> the single fact table approach? What would be more efficient if I want to use
> Business Objects as the reporting tool? What are the pros and cons of each
> option?
> If anyone has previously had to make a similar decision I would very much
> appreciate the help.
> Thanks a lot,
> Roberto
>
|||I completely agree: the single fact table is not easy to manage in the
future; moreover, it is less performant due to the very high
denormalization (you create an higher I/O work for SQL Server).
Marco Russo
http://www.sqlbi.eu
http://www.sqljunkies.com/weblog/sqlbi
Eager to Learn ha scritto:
[vbcol=seagreen]
> Hi rufian,
> I can't completely answer your question, but my experience so far has been
> that using the surrogate keys is worth doing. I have built fact tables before
> that should everything in one table and it didn't significantly improve
> performance.
> Analysis services 2005 also seems to be happier with the star schema, but I
> don't know much about it.
> One of the biggest benefits of the star schema is that you can add more
> attributes and dimensions much more easily.
> If you know how big your fact table will be, then the decision may be made
> for you. As soon as your fact table starts getting big, you should use
> dimension tables. If you have 1 million rows in the fact table, there is no
> way I would skip the dimension tables. Even for 100,000 I probably wouldn't
> skip them.
> I hope that helps.
> "rufian" wrote:

dwh problem

Hi,
I have to choose between
a) using one single fact table
b) using one fact table and several dimension tables
I know that if you have no dimension tables (i.e. you store all the
information on one single fact table) your sql will not include any joins. I
understand joins penalize performance so it would make sense to go for this
option. Can someone tell me if a star schema approach with surrogate keys on
the fact table pointing to each dimension table would be less performant tha
n
the single fact table approach? What would be more efficient if I want to us
e
Business Objects as the reporting tool? What are the pros and cons of each
option?
If anyone has previously had to make a similar decision I would very much
appreciate the help.
Thanks a lot,
RobertoHi rufian,
I can't completely answer your question, but my experience so far has been
that using the surrogate keys is worth doing. I have built fact tables befor
e
that should everything in one table and it didn't significantly improve
performance.
Analysis services 2005 also seems to be happier with the star schema, but I
don't know much about it.
One of the biggest benefits of the star schema is that you can add more
attributes and dimensions much more easily.
If you know how big your fact table will be, then the decision may be made
for you. As soon as your fact table starts getting big, you should use
dimension tables. If you have 1 million rows in the fact table, there is no
way I would skip the dimension tables. Even for 100,000 I probably wouldn't
skip them.
I hope that helps.
"rufian" wrote:

> Hi,
> I have to choose between
> a) using one single fact table
> b) using one fact table and several dimension tables
> I know that if you have no dimension tables (i.e. you store all the
> information on one single fact table) your sql will not include any joins.
I
> understand joins penalize performance so it would make sense to go for thi
s
> option. Can someone tell me if a star schema approach with surrogate keys
on
> the fact table pointing to each dimension table would be less performant t
han
> the single fact table approach? What would be more efficient if I want to
use
> Business Objects as the reporting tool? What are the pros and cons of each
> option?
> If anyone has previously had to make a similar decision I would very much
> appreciate the help.
> Thanks a lot,
> Roberto
>|||I completely agree: the single fact table is not easy to manage in the
future; moreover, it is less performant due to the very high
denormalization (you create an higher I/O work for SQL Server).
Marco Russo
http://www.sqlbi.eu
http://www.sqljunkies.com/weblog/sqlbi
Eager to Learn ha scritto:
[vbcol=seagreen]
> Hi rufian,
> I can't completely answer your question, but my experience so far has been
> that using the surrogate keys is worth doing. I have built fact tables bef
ore
> that should everything in one table and it didn't significantly improve
> performance.
> Analysis services 2005 also seems to be happier with the star schema, but
I
> don't know much about it.
> One of the biggest benefits of the star schema is that you can add more
> attributes and dimensions much more easily.
> If you know how big your fact table will be, then the decision may be made
> for you. As soon as your fact table starts getting big, you should use
> dimension tables. If you have 1 million rows in the fact table, there is n
o
> way I would skip the dimension tables. Even for 100,000 I probably wouldn'
t
> skip them.
> I hope that helps.
> "rufian" wrote:
>

DWH Hardware Configuration

We are currently in the process of looking into setting up a Data WareHouse
in SQL2000 so that we can use Cognos to report on the data.
I am looking at what hardware to run SQL2000 on just for the DWH. On all of
our other SQL2000 servers we have a very VERY simply hardware config of a
single RAID controller with 2 mirrored drives for the OS and then 3+ drives
in RAID5 for the SQL Data and Logs (on same logical drive).
Obviously, I want to make sure that I configure the new machine around a DWH
environment so was wondering if anyone had any suggestions of the best way
of doing this.
The hardware I am looking at is an IBM x255 a couple of 18Gb drive mirrored
for the OS and then seperate logical drives for the log and data.
Should I use seperate RAID controllers for the SQL data and log?
Is RAID 5 the best way to go with regards to the log and data?
Anything else I should be looking for?
Thanks.What are the requirements? How much data, how well is modeled, how will it
be accessed, how many concurrent queries and what types of queries? Any end
user access or just cube builds? Any ad-hoc access?
"Peter Shankland" <aopz10@.dsl.pipex.com> wrote in message
news:OUyPjwAwDHA.2448@.TK2MSFTNGP12.phx.gbl...
quote:

> We are currently in the process of looking into setting up a Data

WareHouse
quote:

> in SQL2000 so that we can use Cognos to report on the data.
> I am looking at what hardware to run SQL2000 on just for the DWH. On all

of
quote:

> our other SQL2000 servers we have a very VERY simply hardware config of a
> single RAID controller with 2 mirrored drives for the OS and then 3+

drives
quote:

> in RAID5 for the SQL Data and Logs (on same logical drive).
> Obviously, I want to make sure that I configure the new machine around a

DWH
quote:

> environment so was wondering if anyone had any suggestions of the best way
> of doing this.
> The hardware I am looking at is an IBM x255 a couple of 18Gb drive

mirrored
quote:

> for the OS and then seperate logical drives for the log and data.
> Should I use seperate RAID controllers for the SQL data and log?
> Is RAID 5 the best way to go with regards to the log and data?
> Anything else I should be looking for?
> Thanks.
>