Tuesday, November 05, 2019

ELT is passé; STL(Stream-Transform-Load) is the new name for ETL

ETL: Extract Transform Load
ELT: Extract Load Transform
STL: Stream Transform Load

Everybody is interested in talking about how technology can make a difference in real-time and contextual based application experience but is anyone doing anything about it.  If someone is doing anything about it, then they are the market leaders in that domain.

In the past, most of the banking and retail industries used to run batch jobs to move the transaction data that happened during that day to a data warehouse in midnight (also called of end of day processing). One of the standard conditions for batch processing was that there should not be any transaction happening in the source data system when the batch processing was running. 

These days most of the banks and retail companies run 24x7 and they cannot have a minute of downtime on their system, ie. a customer could log on to the mobile application or website and do a transaction at midnight. They are even more concerned about setting up a disaster recovery (DR) site far away from the main production site so that some catastrophe hits their data center, they are not stopping their business. Post-Sept 2011 attack, one of the first things that most enterprises did was to create a DR site with real-time data replication technology to be implemented. Now that they cannot even have a minute of downtime in their data systems and allow customers to transact 24x7, they want to repurpose their ETL systems as STL systems so that they can have real-time ETL functionalities with the new big data technologies that can scale and process for even real-time systems. 

Here is a high-level comparison between ETL to CDC and STL

Traditional ETL
  • Batch mode of extraction, high latency, low throughput on large window
  • High load on sources, usage only during certain times of day
  • On-disk transformations
CDC (Change Data Capture)
  • Real-time Replication, high throughput in a large window
  • Low load on sources, fully utilized systems
  • High load on target to load high volumes
Stream-Transform-Load (a.k.a. Streaming ETL)
  • Best parts of ETL and CDC (low latency, low load on sources, overall higher throughput)
  • In-memory transformations
  • Reduced load to target systems
  • Reduce Garbage-in


Some of the most common use cases of data stream processing include

  • Industrial Automation
  • Log Analytics
  • Building Real-time data lakes
  • IoT (Wearables & devices) Analytics
  • Smart homes and City

Industry-specific use-cases for Stream Processing:

  • Financial Services
    • Fraud Detection
    • Real-time analysis of commodity prices
    • Real-time analysis of currency exchange data
    • Risk Management
  • Retail
    • Markdown optimization 
    • Dynamic pricing and forecasting and Trends
    • Real-time Personalized Offers
    • Shopping cart defections
    • Better store and shelf management
  • Transportation
    • Tracking Containers, Delivery Vehicles, and Other Assets
    • Vehicle Management 
    • Passenger Alerts
    • Logistics and Route Optimization
  • Telecom
    • Wifi Off-Loading
    • Video Analytics
    • Network Management
    • Security Operations
    • Geolocation Marketing
    • Mobile Data Processing
  • Health Care
    • Medical Device Monitoring
    • In-home Patient Monitoring
    • Medical Fraud Detection
o   Safer Cities
  • Utilities, Oil and Gas
    • Outage Intelligence
    • Workforce Management
    • Real-time Drilling Analysis
    • Telemetry on critical assets
  • Manufacturing
    • Smart Inventory
    • Quality Control
    • Building Management
    • Logistics and Route Optimization

Monday, May 13, 2019

Data Science to Wisdom Science

 The hottest technology course for undergraduate studies during the 80s was Computer Engineering,
then it was Computer Science in the 90's,
then it was Information Science in the 2000's, especially MIS was the hottest thing post Y2K (1st Jan 2000) and everyone was designing systems to create dashboards and reports for Executives,


In 2010, we all heard about the biggest trend as Big Data Analytics, Cloud Computing and Social data including Natural Language processing as the hottest trends,.

In 2020 we will see that the hottest graduation degree will be Data Science.... We are seeing everyone is looking for Insights from Data and hence called as Data Science . Some of the latest areas in Data Science is how to reduce (discard unnecessary) data for computing and how to get Fast (real-time) data for every one..


So Computer Science and Engineering had 20 years of life time, Information Science and Technology had 20 years of life time and next 10 years it would be Data Science and Engineering.

Following the above trend, what would be the hottest graduation course in 2040? Would it be below Data or above Knowledge?

My take: I dont think they would go below Data layer, I think experts would soon call it as WISDOM SCIENCE 😄 ?




Thursday, February 28, 2019

Machine Learning: Which clustering algorithm should I use?

Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. There are several clustering methods/algorithms that could be used and you are often confused when to use what. Here is a quick tip on how to categorize and decide which one is best for your data set. 

There are mainly 7 types of basic clustering methods. They are:

1) Partitioning
  • Find mutually exclusive cluster of spherical shape
  • Based on distance
  • Represent cluster using mean or medioid 
  • Good for small to medium sized clusters
  • Methods:
    • k-means
    • k-mediods
    • CLARA - Clustering LARge Applications
2) Hierarchical
  • Hierarchical decomposition method
  • based on agglomerate or divisive
  • cannot correct erroneous merges or splits
  • Methods: 
    • BIRCH -Balanced Iterative Reducing and Clustering using Hierarchies
    • ROCK -RObust Clustering using linKs
    • CHAMELEON -Multiphase hierarchical clustering using dynamic modeling)
    • DBSCAN -Density Based Clustering on Connected Regions with High Density
    • OPTICS -Ordering Points TIdentify the Clustering Structure
    • DENCLUE -DENsity CLUstEring
    • AGNES(AGglomerative NESting)
    • DIANA (DIvisive ANAlysis)
3) Density based mentjods
  • Good for arbitrary shapes
  • Based on density or neighborhood concept
  • Possible to filter outliers
4) Grid based methods
  • Uses multi-resolution grid data structure
  • Fast in processing time irrespective of the number of data objects
5) Model based methods

  •   hypothesizes a model for each of the clusters and finds the best fit of the data to the given model.
  •  It takes “noise” or outliers into account, therein contributing to the robustness of the approach.  
6) High Dimensional data methods
  • As dimensionality increases, the data usually become increasingly sparse because the data points are likely located in different dimensional subspaces
  • Frequent pattern-based clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently.
  • CLIQUE, PROCLUS, Frequent pattern,  pCluster
7) Constraint based methods
  • User-specified or application- oriented constraints. A constraint can express a user’s expectation or describe “properties” of the desired clustering results, and provides an effective means for communicating with the clustering process. 
  • Constraint-based methods are used in spatial clustering for clustering with obstacle objects (e.g., considering obstacles such as rivers and highways when planning the placement of automated banking machines) and user-constrained cluster analysis (e.g, considering specific constraints regarding customer groups when determining the best location for a new service station, such as ‘‘must serve at least 100 high-value customers”). 
  • Semi-supervised clustering employs, for example, pair- wise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering 
Based on the above types, there are a lot of commonly used methods that are used for clustering. Here are some of them:

(a) K-means: Based on centroid 
  • Clustering Type: Partitioning
  • Detectable Shapes: Spherical-shaped clusters; 
  • The number of clusters
  • Sensitive to noise and outliers. 
  • Works well with small data sets only.  
  •  Running time -  O(kni); where k is number of clusters, n is number of objects and i is number of iterations

(b) K-medoids: Representative object based 
  • Clustering Type: Partitioning
  • Detectable Shapes: Spherical-shaped clusters; 
  •  The number of clusters; 
  •  Small data sets (not scalable). 
  •  Running time -  O(k(n-k)^2)
(c) CLARA  
  • Type- Partitioning
  • Detectable Shapes- Spherical-shaped clusters; 
  • The number of clusters
  • Effective and sensitive to the selection of initial samples.  
  • Running time - O (ks^2 + k(n-k)) ; where k=no: of samples and k is no: of clusters and s is the size of sample'.
  • Another version called CLARANS (Clustering LArge Applications based on RANdomized Search) based on a randomized algorithm

(d) BIRCH
  • Clustering Type: Hierarchical
  • Detectable Shapes: Spherical-shaped clusters
  • large N d-dimensional data points; 
  •  Uses a Clustering Feature  (CF) tree can hold only a limited number of entries due to its size, a CF tree does not always correspond to what a user may consider a natural cluster. 
  • Running time: O (n) ; where n is the number of objects to be clustered

(e) ROCK
  • Clustering Type: Hierarchical
  • Detectable Shapes: Arbitrary shape
  • N d-dimensional categorical data points
  • Designed for categorical data, emphasizes interconnectivity, ignores closeness between clusters.
  • Running Time: O(n * Mm * Ma), where Ma and Mm are the averages and maximum number of the neighbors for a point; Worst case O(Ma * N^2)
(f) CHAMELEON 
  • Clustering Type: Hierarchical
  • Detectable Shapes: Arbitrary shape
  • N d-dimensional categorical points
  • Running time: O(n^2) worst case

(g) DBSCAN
  •  Clustering Type: Density
  • Detectable Shape: Arbitrary shape
  • Maximum possible distance for a point to be considered density-reachable and minimum number of points in a cluster
  • Running time: With spatial index O(nlog n) ; O(n^2) worst case.
(h) OPTICS 
  •  Clustering Type: Density
  • Detectable Shape: Arbitrary shape
  •  Outputs Cluster ordering which is a linear list of all objects under analysis. It does not require user to provide a specific density threshold like DBSCAN.
  • Running time:  With spatial index O(nlog n) ; O(n^2) worst case.
(i) DENCLUE
  • Clustering Type: Density 
  • Detectable Shape: Arbitrary shape 
  • Kernel density estimation is used 
  •  Running time:  O(n+ h*n) where h is the average hill climbing time for an object ; O(n^2) worst case. 

 Reference: Data Mining Concepts and Techniques (3rd Edition) by J Han, M Kamber and J Pei

Thursday, February 07, 2019

General Myths about Big Data

Even after the existence of Big Data in the technology world for more than 10 years, I have seen many people having misunderstanding and confusions about the usage of Big Data.

Related image
Myth 1: Big Data is one thing-  Hadoop, and Hadoop is just HDFS
  • Big data is a concept of handling large data sets whereas Hadoop is a framework. 
  • Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in a clustered system. 
  • Hadoop Architecture consists of more than one application other than HDFS (Hadoop Distributed Filesystem).  In the image below, Flume is a streaming application whereas Sqoop is a batch loading application. Similarly, HBase is a Columnar NoSQL Database and Hive is a SQL Query Engine. Other applications like Oozie, Pig, Mahout, Zookeeper, Ambari, Yarn/MapReduce, R are also present in the Hadoop ecosystem.
  • Similar to Hadoop, there are many other NoSQL databases like Cassandra and MongoDB that can also scale in multiple nodes are also considered as Big Data systems.
  • Also, there are many cloud systems like Amazon EMR, S3, Oracle Object Storage, Azure Data Lake systems that are also considered as big data systems.

Myth 2. Big Data will replace my Relational Database

  • Many database administrators are scared that Big Data or Hadoop systems would replace their Relational Database systems. For understanding the difference between RDBMS and other Big Data systems, we need to understand the CAP (Consistency, Availability, and Partitioning Tolerance) theorem. 
  • According to the CAP theorem, data systems are optimized for any of the two features instead of all three. So it is almost virtually impossible to have a data store to have all the three as of today. Refer to the image below:

Related image















Myth 3: Big Data can be used in a similar way like a Relational Database

  • Relational Database systems are built around the ACID rules. In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc. 
  • Big data systems are not built with ACID Rules in mind. They are built around functionalities such as scalability, performance, large clustered deployments of data stores and use-cases such as Business Analytics, Log Analytics, Data warehouse offloading, Recommendation Engines, 360-degree view of the customer, Sentiment Analysis, Fraud detection etc.
  • Some of the Big Data systems believe in an eventual consistency across all the deployed nodes, which means that data is never consistent based on where the query happens. These data stores typically follow BASE protocol - Basically Available Soft State Eventually Consistent
  • Majority of Big Data systems were also designed for non-transactional data, such as user session data, chat messages, videos, images, documents etc
  • If you move data from RDBMS to the Big Data system, you cannot expect to see the same way as you see in a database due to the above-mentioned reasons.
    For example, you are moving change data from Oracle into HDFS. What you will see is a set of files that has incremental data that are captured from the Oracle database in file formats. To get a similar experience that of running a SELECT * from MYTABLE, you would need to create a Hive view and run a Hive query which generates a MapReduce that runs in the background. This might not be the most efficient method of making a query since Hive queries are optimized to run with very large data sets instead of a small dataset.

Myth 4:  Big Data Systems are cheap
  • Typically Big Data systems are running on commodity hardware and hence this hardware is considered to be cheap compared to engineered commercial systems. 
  • If you looking for a similar computing power like a big data warehousing appliance, it might comparable considering the space utilization, power utilization, number of people that are required to manage and many other factors such as software support/warranty etc.
  • It also depends on where you are running it, whether it is on-premises, Hadoop as a Service, Cloud hosted or speciality services like Amazon EMR.


Myth 5: Big Data can solve problems faster than other conventional IT systems

Big Data systems can be usually slower when compared to the conventional system for smaller datasets:
  • Big Data is just a concept and Big Data was not originally designed for speed. It has been designed for handling very large datasets that are spread across different systems.
  • For example, a Hive query can take a long time by creating a MapReduce program which is a batch type of query.
  • Data could be distributed across multiple systems which might take long time access is across networks
  • Most of the storage use cheap disks since it runs on commodity hardware. This means that it could be really slow in responding.

Myth 6: Big Data projects can be really quick than a conventional 2 tier or 3 tier system design 

Implementing Big Data systems can take a really long time considering the following:
  • Most of the commercial applications or systems are ready to use like a packaged software application, whereas the Big Data system needs to be designed from scratch.
  • Often biggest challenge is identifying the right system and application for the use-case. 
  • Next biggest challenge is changing requirements which require you to completely replace the system, for example, you started the project in Hive and then you realized that you would need to move to HBase or started with Cassandra and then had to move to another NoSQL Database.
  • One major concern is a data security concern in Big Data applications since most of the applications are relatively nascent and might not have solved all the security loopholes.

References:
  1. https://searchdatamanagement.techtarget.com/definition/Hadoop 
  2. https://www.quora.com/Why-are-ACID-properties-not-followed-in-Big-Data