Sunday, March 25, 2012

Best practise/architecture question

I need to load a lot of Excel, CSV, ... etc. files. These files have hundreds of columns and I need to validate the data. Some are simple range type checking, some are more complex checking involve multiple columns.

There may have several hundreds of such rules. And I may need to let the program to automatically correct some invalid data in the future.

Where to implement it in SSIS?
Or just load the files without any checking (all type to text), and checking using T-SQL?

(BTW, I don't have biztalk server).

Thanks in advance.

Read more >> Options >>

SSIS has seevral components in the data flow that can help you in the data cleansing/transformation: Derived column, script tasks, etc; so if you have to meve the data from point a to point B; you could apply all the transformation rules as a part of that process.

Based on the litle information I got, I would try with first with SSIS.

|||

I will use SSIS problem. However, I don't want to "hard code" all these rules using SSIS component because

1. There are so many rules. And rules may need to be updated.

2. User will manage the rules. It's not possible to teach them to use VS.Net to update the SSIS package

Maybe a script component to call a C# assembly, which maintain parse rules in a, for example, XML file....

|||

We have a similar issue - we have about 6-8 flat file types, with 50-80 columns each and one-lots of rules for each column. The approach I'm taking is a hybrid. I'm first loading into a catch-all (all characters) staging database, and then using isdate, isnumeric, etc. in TSQL to then pull the data back out into SSIS to do some of the things that SSIS is good at. Where isdate & isnumeric fails on the extract, I'm putting a generic error value in using a CASE in my select (either -1, or '9999-12-31' for dates) so I can at least get the data cleaned before it heads back into SSIS & the production database. From there I'm doing my multiple column validation, or other "softer" error handling. My guess is you will not be able to give the users something to configure absolutely every validation. For us it's going to be a trade-off - I'm going to enforce the high-level/blatant validations up front and do things that aren't going to change often, while giving the user the ability to configure range validations, etc.

|||

Yes, I was thinking doing the same thing. However, it sounds it ruins the purpose of using SSIS as an ETL tool.

But it sounds not easy to use SSIS as a configurable ETL tool.

|||

ydbn wrote:

Yes, I was thinking doing the same thing. However, it sounds it ruins the purpose of using SSIS as an ETL tool.

But it sounds not easy to use SSIS as a configurable ETL tool.

I think your scenario represent the same challenge regardless the ETL tool you choose. Do you have an example of your ideal solution using a different tool? If you shared that with us, I am pretty sure we could help in getting something similar in SSIS terms.

BTW, what about creating a custom task that gets the rules from a table per every file type? You could store the transformations rules in a table where they can be easily updated.

There are more than one way to tackle this problem; but flexibility won't come for free; if you want something robust an reusable, get ready to spend some time in the design table.

sql

No comments:

Post a Comment