Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Sunday, March 08, 2020

Comparing GoldenGate Kafka Handlers

GoldenGate is the only differentiated product in the market to have 3 different types of adapters to Kafka. The three different connectors to Kafka are: 
1) Kafka Generic Handler (Pub/Sub)
2) Kafka Connect Handler
3) Kafka REST Proxy Handler

Each of these interfaces has its own advantages and disadvantages. The table below compares and contrasts the differences between the above three handlers:
FUNCTIONAL AREA
KAFKA HANDLER (PUB/SUB)
KAFKA CONNECT HANDLER
KAFKA REST PROXY HANDLER

Available in Opensource Apache version

Yes
Yes
No*
Schema Registry Integration

No
Yes*
Yes
Formatting to JSON
Yes, with GoldenGate Formatter
Yes, with Kafka Connect Framework

Yes
Formatting to Avro
Yes, with GoldenGate Formatter

Yes*
Yes*
Designed for high volume throughput
Yes


Yes
No
Transactions and Operations

Yes, Both
Note: transactions have specific challenges, hence not recommended

Operations only
Operations only
Run-time mapping of Key and Topic

Yes
Yes
Yes
Connect Protocol

Kafka Producer API
Kafka Producer API
HTTPS, REST
Web Proxy Support

No
No
Yes
Synchronous(also known as Blocking Mode) and Asynchronous Mode of operation

Yes, both
Note: Synchronous has very low throughput
No, Asynchronous only
No, Asynchronous only
Kafka Interceptor support
Yes
Yes
No

* Available with Confluent Community License and Enterprise License

For detailed information refer to GoldenGate for Big Data Documentation:

Monday, May 13, 2019

Data Science to Wisdom Science

 The hottest technology course for undergraduate studies during the 80s was Computer Engineering,
then it was Computer Science in the 90's,
then it was Information Science in the 2000's, especially MIS was the hottest thing post Y2K (1st Jan 2000) and everyone was designing systems to create dashboards and reports for Executives,


In 2010, we all heard about the biggest trend as Big Data Analytics, Cloud Computing and Social data including Natural Language processing as the hottest trends,.

In 2020 we will see that the hottest graduation degree will be Data Science.... We are seeing everyone is looking for Insights from Data and hence called as Data Science . Some of the latest areas in Data Science is how to reduce (discard unnecessary) data for computing and how to get Fast (real-time) data for every one..


So Computer Science and Engineering had 20 years of life time, Information Science and Technology had 20 years of life time and next 10 years it would be Data Science and Engineering.

Following the above trend, what would be the hottest graduation course in 2040? Would it be below Data or above Knowledge?

My take: I dont think they would go below Data layer, I think experts would soon call it as WISDOM SCIENCE 😄 ?




Thursday, February 07, 2019

General Myths about Big Data

Even after the existence of Big Data in the technology world for more than 10 years, I have seen many people having misunderstanding and confusions about the usage of Big Data.

Related image
Myth 1: Big Data is one thing-  Hadoop, and Hadoop is just HDFS
  • Big data is a concept of handling large data sets whereas Hadoop is a framework. 
  • Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in a clustered system. 
  • Hadoop Architecture consists of more than one application other than HDFS (Hadoop Distributed Filesystem).  In the image below, Flume is a streaming application whereas Sqoop is a batch loading application. Similarly, HBase is a Columnar NoSQL Database and Hive is a SQL Query Engine. Other applications like Oozie, Pig, Mahout, Zookeeper, Ambari, Yarn/MapReduce, R are also present in the Hadoop ecosystem.
  • Similar to Hadoop, there are many other NoSQL databases like Cassandra and MongoDB that can also scale in multiple nodes are also considered as Big Data systems.
  • Also, there are many cloud systems like Amazon EMR, S3, Oracle Object Storage, Azure Data Lake systems that are also considered as big data systems.

Myth 2. Big Data will replace my Relational Database

  • Many database administrators are scared that Big Data or Hadoop systems would replace their Relational Database systems. For understanding the difference between RDBMS and other Big Data systems, we need to understand the CAP (Consistency, Availability, and Partitioning Tolerance) theorem. 
  • According to the CAP theorem, data systems are optimized for any of the two features instead of all three. So it is almost virtually impossible to have a data store to have all the three as of today. Refer to the image below:

Related image















Myth 3: Big Data can be used in a similar way like a Relational Database

  • Relational Database systems are built around the ACID rules. In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc. 
  • Big data systems are not built with ACID Rules in mind. They are built around functionalities such as scalability, performance, large clustered deployments of data stores and use-cases such as Business Analytics, Log Analytics, Data warehouse offloading, Recommendation Engines, 360-degree view of the customer, Sentiment Analysis, Fraud detection etc.
  • Some of the Big Data systems believe in an eventual consistency across all the deployed nodes, which means that data is never consistent based on where the query happens. These data stores typically follow BASE protocol - Basically Available Soft State Eventually Consistent
  • Majority of Big Data systems were also designed for non-transactional data, such as user session data, chat messages, videos, images, documents etc
  • If you move data from RDBMS to the Big Data system, you cannot expect to see the same way as you see in a database due to the above-mentioned reasons.
    For example, you are moving change data from Oracle into HDFS. What you will see is a set of files that has incremental data that are captured from the Oracle database in file formats. To get a similar experience that of running a SELECT * from MYTABLE, you would need to create a Hive view and run a Hive query which generates a MapReduce that runs in the background. This might not be the most efficient method of making a query since Hive queries are optimized to run with very large data sets instead of a small dataset.

Myth 4:  Big Data Systems are cheap
  • Typically Big Data systems are running on commodity hardware and hence this hardware is considered to be cheap compared to engineered commercial systems. 
  • If you looking for a similar computing power like a big data warehousing appliance, it might comparable considering the space utilization, power utilization, number of people that are required to manage and many other factors such as software support/warranty etc.
  • It also depends on where you are running it, whether it is on-premises, Hadoop as a Service, Cloud hosted or speciality services like Amazon EMR.


Myth 5: Big Data can solve problems faster than other conventional IT systems

Big Data systems can be usually slower when compared to the conventional system for smaller datasets:
  • Big Data is just a concept and Big Data was not originally designed for speed. It has been designed for handling very large datasets that are spread across different systems.
  • For example, a Hive query can take a long time by creating a MapReduce program which is a batch type of query.
  • Data could be distributed across multiple systems which might take long time access is across networks
  • Most of the storage use cheap disks since it runs on commodity hardware. This means that it could be really slow in responding.

Myth 6: Big Data projects can be really quick than a conventional 2 tier or 3 tier system design 

Implementing Big Data systems can take a really long time considering the following:
  • Most of the commercial applications or systems are ready to use like a packaged software application, whereas the Big Data system needs to be designed from scratch.
  • Often biggest challenge is identifying the right system and application for the use-case. 
  • Next biggest challenge is changing requirements which require you to completely replace the system, for example, you started the project in Hive and then you realized that you would need to move to HBase or started with Cassandra and then had to move to another NoSQL Database.
  • One major concern is a data security concern in Big Data applications since most of the applications are relatively nascent and might not have solved all the security loopholes.

References:
  1. https://searchdatamanagement.techtarget.com/definition/Hadoop 
  2. https://www.quora.com/Why-are-ACID-properties-not-followed-in-Big-Data 

Thursday, September 14, 2017

New Age of Data Mining with Machine Learning, Deep Learning, Natural Language Processing, Artificial Intelligence and Robotics technologies

We are currently in the middle of the perfect storm of incredible time for transformative technology today and it is an evolutionary step in humanity. In the last two decades, computing power has become much cheaper and data has become available for computing to utilize. It's the ability of machines to think in the cognitive way similar way how human mind thinks which will bring in the difference. The machines are able to identify, gather, store, correlate, think and predict based on the historical data and patterns available within it. The machines are much more efficient in storing and computing all the possibilities including all the permutations and combinations much faster than humans do.

We have started getting recommendations on related around products and services that you have bought or searched earlier on internet(eg. Amazon). We are able to talk to our personal digital assistants and get valuable work done out of them(eg. Apple Siri, Amazon Alexa and Google Assistant). We are able to see that our personal emails are able to correlate events and derive a value for actions (eg. an email in your Gmail with a flight schedule and Google Maps automatically mapping to that location).

This is the new age of Data Mining. Data Mining as a branch of computer science has been in existence for several years. Data Mining was the next logical step after processing and storing data in data warehouses. Data mining is defined as a process of discovering hidden valuable knowledge by analyzing large amounts of data, which is stored in databases or data warehouse, using various data mining techniques such as machine learning, artificial intelligence(AI) and statistical.  Knowledge Discovery in Databases, or KDD for short, defines the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods.
 
Fig: KDD Process




The various components in Data Mining Algorithms are as follows:
1)Model Representation: To determine the nature and structure of the representation to be used
2) Score function : To measuring how well different representations fit the data
3) Search/Optimization method: An algorithm to optimize the score function
4) Data Management:  Deciding what principles of data management are required to implement the algorithms efficiently.

Data Mining process included classification, clustering, regression as the fundamental steps.
A typical data mining algorithm used to look like this

Data Mining Algorithm






Here are some of the regression models that used be represented with various algorithmic approaches.







There were different sets of data that used to exist during the data mining model definition
a) training the model with a set of data called "training data",
b) a set of data called "validation data" used calculate the estimate of Squared Error Score and
c) a set of data called "Test Data" to calculate an unbiased estimate of Squared Error Score of  a selected model.

I happened to hear about a lot of new jargosn such as Data Science, Machine Learning and Deep Learning used by most of the amateurs or even experts in the IT field and using these words without understanding the meaning of them. So I decided to take a course on  it from the experts including some hands on labs. Once I took the course, I realized that these jargons are nothing but same old data mining, statistics and neural networks. A typical old wine in new bottle for the industry like  "Big Data" and "Infocomm" in the past.

There are a lot of other entrepreneurs building technology business around Artificial Intelligence and  Robotics in various different business domains including Biology. For example www.claralabs.com,  www.vicarious.com, www.zymergen