Having worked on the Apache stack for sometime, I decided to look at Azure Big Data stack. My starting point is data ingest.For most big data projects the journey starts out with data ingest, clean, transform and have it ready for analysis. Azure Data factory is MSFT Azure offering for cloud based data integration service that automates the movement and transformation of data. At a very basic level below is a representation of data lifecycle in big data projects
Azure Data Factory has the following constructs
- Linked Services have the define where the data has to be sourced from/to.
- Pipeline and Activities – Pipelines are a logical group of activities that performs the job of moving data from/ to.
- DataSets – Linked services interfaces the Data Factory to the external data sources. Datasets are a representation of the data store.
Linked Services provides for the interfaces to external sources, currently the support is limited to Azure, Databases, File based, Salesforce, OData a complete list can be found here.
From a customization stand point of view one can create custom activities. I have the linked service limited. On the contrary Apache NiFi seems to have a better in multiple ways
- Intuitive UI - NiFi designer.Dataflows can become quite complex. Being able to visualize those flows and express them visually can help greatly to reduce that complexity and to identify areas that need to be simplified. NiFi enables not only the visual establishment of dataflows but it does so in real-time. Rather than being design and deploy it is much more like molding clay.
- Better support for external sources linked services in Azure Data Factory a compared to processors in NiFi, have seen NiFi comes out better , list can be found here.
- NiFi is highly fault tolerent
- Superior Exception Handling – finer details here.