This blog is a work in progress, and will be regularly updated. If you have any feedback, thoughts or things you think should be included, I would be glad to hear it.
Being able to design, manage and maintain data pipelines is big business these days. Every major vendor offers a suite of tools to manage ETL and data pipelines: Microsoft, IBM, Oracle, SAP, SAS all have their own flavours. But I wonder how many people are tempted to hack it together themselves using their favourite programming language?
- What are the hidden dangers with custom solutions?
- What are the limitations of off-the-shelf products like SSIS or IBM Infosphere?
Efficiency, Productivity and Reliability
I think a lot of us are tempted to throw together a quick and dirty script to move and clean data. Especially if it is a relatively small or trivial task. A fully featured tool like SSIS can be daunting and seem like over kill. I have been forming my own opinions on this, but Andreas de Ruiter states is very nicely in his blog:
An ETL process is a program that periodically runs on a server and orchestrates the refresh of the data in the BI system. SQL Server Integration Services (SSIS) is a development tool and runtime that is optimized for building ETL processes. Learning SSIS involves a steep learning curve and if you have a software development background like I do, you might first be inclined to build your ETL program from scratch using a general purpose programming language such as C#. However, once you master SSIS you’ll be able to write very efficient ETL processes much more quickly. This is because SSIS lets you design ETL processes in a graphical way (but if needed you can write parts using VB or C#). The SSIS components are highly optimized for ETL type tasks and the SSIS run-time executes independent tasks in parallel where possible. If you’re a programmer you’ll find it amazingly difficult to write your own ETL process using a general purpose language and make it run more efficient than one developed in SSIS.
(source: de Ruiter A, Designing an ETL process with SSIS: two approaches to extracting and transforming data, MSDN Blogs (2012). http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data.aspx)
A large benefit to using commercial tools, is that they can help lower the cost of maintaining your ETL solution. Particularly with major vendors like Microsoft, the underlying functionality is maintained by them and does not often change although they may add new features and functions. This means you only have to worry about your logic and your data flow, not the implementation behind it.
This can sometimes be over looked. If you work in a team, it makes sense to work with tools that the entire team knows and understands. If you collaborate across teams, it is critical that something written in one office will work the same way in another office, in another city, perhaps even on the other side of the world. Of course this doesn’t limit you to just commercial software – there is nothing wrong with a collective decision to use Python + MongoDB as your core tech; but it does detract from custom-made or hacked together solutions which are unlikely to be cohesive across an organisation.
Non-relational data and SSIS?
I am working on an interesting ETL project at the moment using climate data. I need to be able to retrieve, clean, standardise and transform this data on its way to the final data store. However the ‘raw-data’ is not relational. The measurements are arranged in a rectangular matrix where the position maps to a given (longitude, latitude) point. I need to be able to filter this data based on its position in the matrix, which is inherently non-relational. I don’t know of a way to do this within SQL Server, and I am assuming (I need to confirm this) that I won’t be able to design an appropriate filter with SQL Data Tools. In this case, I think I will have to call and external R script to clean the data and write it back to file before passing it into SSIS.