Sunday, June 12, 2011
I'm reading: Planning integration of source systems in your data warehousing solutionTweet this !
Any ETL designer would think of integration with source systems, with ideas like, ETL would connect using X driver, X database, stage to X server, load to X destination etc. But an architect cannot take anything as provided and would have to validate even the basic questions. Some of this basic questions are listed below, and when you start figuring out answers to these questions, you would feel a need to start planning the integration from an infrastructure and integration perspective, even before ETL designers can think of the feasibility of an ETL solution.
1) Where are the source systems hosted, on a physical cluster or a VMWare based server in a virtualization environment ? Are load balancers placed on these servers ? You would require to collect Virtual IPs to connect to those servers and keep in mind how your ETL solution would read data from these servers.
2) What is the classification of data in source systems and is this in sync with the zone in which your ETL server is hosted ? Generally in any data center based hosting or enterprise class IT environments, source systems / data is classified as RESTRICTED / UNRESTRICTED / CONFIDENTIAL etc, and different data traversing policies revolve around this classification. Data can generally be exchanged only within systems in the same zone of data classification.
3) Is your connectivity uni-directional or bi-directional ? Many systems just allow uni-directional connectivity from other applications to the database server. So database server might be able to connect to other system, but other systems would not be able to connect back to the DB server i.e. connectivity can be initiated only from DB server and network policy itself would impose this restriction.
4) What is the type of connectivity allowed with the source system database server - Windows integrated or SQL based ? Mostly enterprise class IT environments allow only windows integrated connectivity, as policies can be enforced on windows id.
5) Where is your scheduling server located ? Generally application specific scheduling agents are not allowed in any enterprise class IT environment. You need to locate your scheduling batch server, and check whether you would be able to trigger execution from that server, or you would need an agent installed on your DB server and trigger execution from DB server. If your ETL execution is triggered from a scheduling batch server, connectivity needs to be available to source and destination servers from the batch server. And if execution is going to be triggered from your DB server, connectivity to source system should be available from this server.
Unless and until your solution environment has no governing policies, and you have complete discretion and governance over source and destination systems, and your solution is just hosted on application specific servers and not in any data center based hosting environments, you need to plan your integration at a much deeper level. I hope the above points would provide a few hints in this direction.