Most applications will certainly find the need to use big data in near future. With data storage on the cloud very cheap compared to what an RDBMS on premise/cloud would give. The option for move the application data to No SQL or Hadoop is a natural progression.
Having built a large enterprise architecture on Windows Azure, I did dwell on the need of Hadoop the debates stretching from RDBMS to cheap storage, new complexity and tribulations of accepting Hadoop with extreme experimentation has been a vertical climb.
Below are the experiences, the learning's, mistakes during the past 3 months on the Hadoop Journey and we are still not quite there.
High Level Architecture
There have been lot of architectures white boarded, documented and sent to the shredder this too can be up for correction. Below is something is which works as of current and can change in real-time. The working Architecture will be explained in part II. Part 1 of the post talks about the explorations.
Why does one look at Hadoop as an option?
Our starting point of looking at Hadoop was the following factors (some of them)
- Our current databases and historical database of different line of business application stretching way beyond 10 TB.
- Querying and Reporting especially on the historical database was expensive in terms cost, time.
- Predictive Analysis is more in the heads and not in practice.
- The complete enterprise was moving to the cloud. Relooking the storage strategy for a cheaper option was an alternative.
- The current BI stack is an expensive SKU.
With a lot of buzz, song and dance, we initially prepped up to do a Proof of Concept on Hadoop. Below are the findings.
The Story so far….. a Very Big One…..
Since the complete enterprise architecture was drawn up on Microsoft stack of Windows Azure, the natural choice was to look at HDInsight. Our initial learning's revealed that HDInsight was an incomplete fit.
Azure HDInsight Story – Why this is not suitable?
Started the experimentation on Azure HDInsight soon one does realize the limitations. The Hadoop is predominantly a distributed storage which is what it is best at. The HDInsight story around the Analysis (Static & Interactive) begins to fall apart as the HDInsight doesn’t support any of the interactive stack like Impala, Drill, Storm, Spark. If one Google’s for “The Microsoft Azure HDInsight” website does mention the limited supportability around Impala etc.. Which is kind of the sore point for MSFT.
Getting your hands dirty on Hadoop --For the sake of learning we started with free sandbox provided by Horton works and tutorials, simply amazing. Download the single box sandbox the tutorials will drive you to a basic understanding of Hadoop. Find the same here.
Hadoop Querying on HDFS is slow….. The querying on HDFS is extremely slow, it’s meant to be that way. Hive querying on HDFS is designed to be slow, the Hive query gets converted into Map Reduce, and there are split creation, records generation & mapper generation.
What are alternatives? – There are multiple in memory options stretching from Drill, Spark, Impala and many others. These work over HDFS. The Impala is distributed querying engine which can scale out and run on multiple datanodes. It presents a Sql like querying engine on tables mapped onto HDFS data, the table schema stored in the Hive Schema. The query process is fairly simple
The user selects a certain impalad in the cluster, and registers a query by using impala shell and ODBC.
The impalad that received a query from the user carries out the following pre-task:
- It brings Table Schema from the Hive metastore and judges the appropriateness of the query statement.
- It collects data blocks and location information required to execute the query from the HDFS namenode.
- Based on the latest update of Impala metadata, it sends the information required to perform the query to all impalads in the cluster.
- All the impalads that received the query and metadata read the data block they should process from the local directory and execute the query.
- If all the impalads complete the task, the impalad that received the query from the user collects the result and delivers it to the user.
Impalad is– Impala Daemon - For a detailed read refer here
The decision to go with Impala was based multiple aspects, the support, ease of use and general availability of the same from Cloudera.
So far so good, Getting Data into Hadoop not so straightforward.
This is where most Microsoft developers, architects will have to start wearing newer hats & embrace Linux & Java world.
Getting data into Hadoop was an uphill tasks. The data sources were many and data formats variable & associated with data was the ask of transforming the data in multiple places. ETL – Extract, Transform & Load the good old Sql Server was not quite happening here. Hadoop does provide multiple ways to load data into Hadoop, sqoop, pig , writing batch jobs. The complex transformations from multiple data sources does ask for a reasonably good sized team to code. The ETL jobs into Hadoop were seemingly looking complex.
After some research and evaluation zeroed down on Pentaho’ stack of Data Integration. Spoon is a reasonably good tool if one is not sure about number of data sources, transformation and complex logic. The transformation & jobs hosted in the Data Integration Server in Pentaho were nothing short of a boon considering the time one spend in writing code for the same.
Note: Pentaho Spoon Integration with Cloudera Hadoop the documentation is very limited. One will need to plan sometime how to get around the integration. The implementation we had was a complete cluster in windows azure. The port opening is a religious exercise.
The tutorials on Pentaho spoon are pretty elaborate.
For more reference please refer here.
Getting the Cloudera cluster on Windows Azure
Deploying a Cloudera cluster on Windows Azure Virtual Machines is time consuming activity. The Ubuntu 12.4 on Windows Azure Virtual machine requires elementary on knowledge of DNS bind, Virtual networking on Azure. Please find a complete post on how to set up Cloudera – Impala cluster on Windows Azure here.
A special mention this learning help one move back to Linux and it is fun…
Impala Tables Mapping to HDFS….
The data coming into HDFS is in the following supported formats csv, dat, txt etc..
The Impala Create DDL statement can point directly to HDFS location (pre-existing)
drop table if exists customer;
create external table customer
row format delimited fields terminated by '|'
The tutorials on Impala will cover most of the scenario one needs to know find the same here.
Building Models on Impala
Processes in this phase include feature selection, sampling and aggregation; variable transformation; model estimation; model refinement; and model benchmarking.
The models are based on the impala table set. The models in current case are static and predictive. Static Model are based on the data pulled out of impala data set. They are very much in line with the standard OLAP dimensions and measures and the reports are defined based on the same.
This is a tricky area. One has to get into machine learning. The learning curve is lengthy, the starting point is the courses of coursera on machine learning, which give a fair bit of understanding on algorithms
- linear regression, using gradient descent for linear regression
- logistic regression
- neural networks
- Anomaly Detection.
In the next part of this post there will be a detailed sections on predictive modelling and how does it fit in the overall scheme of things. The programming starts off with octave, matlab, R and mahout.
The debate on which programming language is still on.
The goal at this phase is creating a predictive model that is powerful, robust, comprehensible and implementable. The key requirements for data scientists at this phase are speed, flexibility, productivity, and reproducibility. These requirements are critical in the context of big data: a data scientist will typically construct, refine and compare dozens of models in the search for a powerful and robust real-time algorithm.
This is one part where we still trying to get our hands around.
Integration –API & Integration
The reporting API and models are expected to be exposed for consumption via an API layer.
End of Part 1.. Next Part WIP