Thursday, September 24, 2020

What will be the main AI projects in enterprises for next few years?

 Before I answer what will be the enterprise projects around AI, lets first look at the consumer-facing projects that use AI.

- Shopping recommendation on Amazon.com

- Song recommendation on Spotify

- Movies/Video recommendation on Netflix

- Face recognition on Facebook for tagging

- Image editing using FaceApp or similar

- Talking to Chat applications on company websites  

- Google Assistant/Alexa/Siri conversational system

- Sentiment Analysis for social media customer service

- Self-driving cars

- Route planning on maps

- Facial recognition

- Object recognition

-Video surveillance (intrusion detection and object tracker)

-Audience voice mix in IPL matches in Sept 2020
(Covid season with no physical audience https://www.khaleejtimes.com/sports/ipl-2020/ipl-2020-can-do-without-the-plastic-noise)


In a similar manner, I am expecting that enterprise systems and applications will also start to implement similar projects in specifical industrial vertical as follows:

- Fraud detection of financial transactions (eg: Money Laundering, credit card frauds)

- Identifying threatening and ransom calls from Telecom call data

- Real-time Retail offers for customers 

- Online Exam proctoring

- Utility (Gas/Power/Water) demand-based generation

- Improved Customer support systems (much better than today's dumb chatbots)

- Medical proactive detection from health data trackers

- Context-sensitive and time-sensitive advertisements

- Autonomous server elasticity optimizing billing and performance

- Genetic Algorithms in Drug discovery

- Genome analysis, early detection of diseases, warning and curing it is developing stages (like Cancer)


Now let's look into generic platforms that might come up in the next few years:

- AutoML Platform: An AutoML platform is that it can take any given data, prepare the data, and run multiple ML algorithms along with additional tuning to yield the best result. This would be an AI beginner's platform and will see a lot of adoption between 2020-2023.

RPA: RPA is more of a solution than a problem. It’s the automation of tasks with AI. Most of the existing manual systems will be replaced by this.

- NLP platform: Voice to words (like writing meeting minutes, emails, etc) and words to voice. This would be for specialized enterprise applications including call center platform automation or meeting scheduler etc ( say YES instead of pressing 1).

- Semantic modeling: The majority of data is textual and needs an accurate correlation to make a lot of useful decisions, and this is where semantic modeling comes to help.

- Prognostics Platform: Predicting analytics, root cause analysis, and solution recommendation and useful for enterprises to lower-cost improve productivity. This could also be for System and Application Monitoring/Management tools.

- Optimization Platform: Optimization is a huge problem everywhere from supply chain to manufacturing, sales to marketing. Most of these would still be an industry vertical solution platform for the next 2-3 years.

 - Data quality Platform: Data quality is a huge problem in enterprises, it has a lot of missing and inaccurate data, esp manufacturing. And too many data sources to look for info.

- Real-time Online ML Platform: An online machine learning platform for real-time data learning for ever-changing data. It’s not a single learning algorithm: in fact, lots of algorithms can learn online. An example would be stock market price/volume data which might never follow a pattern of the past.

- Digital Twin Platform: A digital twin is a digital representation of a physical object or system. The technology behind digital twins has expanded to include large items such as buildings, factories, cities, people, and processes.
An example would be to simulate a cricket bowler's actions and bowling style.

However, I think it will take a longer time for such platforms to evolve. It is not as simple on the ground to prepare different solutions from a common platform, unlike the standard platforms that exist today (say a Data Quality platform tool that could be used across different verticals). Tons of variations exist and manually connecting the dots is sometimes not possible as people who have knowledge move on. And a lot of system being introduced, continuity is lost and correlation is not clear at all. This why a software company cannot build generic tools/platforms for some of the real-world cases. 


Additional References:

Sunday, March 29, 2020

Should you invest in Natural Language Processing (NLP) technology?

What is the need for Natural Language Processing (NLP) technology in software products?

I think NLP as a basic technology would be needed in all the software products especially in the end user interfaces (eg: Web Site, Mobile Apps, Amazon Echo apps etc). NLP implementation requirement would be like a standard functionality, something  very similar to a monitoring functionality or a multi-language (I18N & L10N) feature in a standard product.

Why do you say that NLP would be an integral part of every IT products? Can you explain few NLP use-cases in IT products?

For answering the need of an integrral usage of NLP, lets first look at the various use-cases of NLP that would be useful in your product and how it can add value to your product.

  1. Chatbots - Every website would need to have this real-time customer interaction tool to filter lot of customer queries before routing to the sales and support team.
  2. Search Engine - Customers and internal teams should be able to have human-like queries to be issued on the knowledge base.
  3. Spam Filtering - Standard filtering, organizing and prioritizing an incoming job task or email. 
  4. Transcription of Audio/Video -  The option of automatically scripting all the human speech (Speech to Text). In the past, this used to a major task for health care transcription for legal purposes, but today, many tools would also want to create the words automatically.
    For example, when uploading to Youtube like application, it should also have subtitle or wants to search someone's speech for certain words and point out where such a term was referenced.
  5. Content or News Curation - This is a very interesitng usecase for industry specific business use case such as identifying some content and then do some processing in the domain of fake news, advertising, market Intelligence, Recruitments, Social Media monitoring etc.
  6. Sentiment Analysis - Organziations can find out what is the emoition of a customer or end-user based on the usage of terms that a customer is using and can take appropriate action based on their emotion.
  7.  Intelligent Conversational Systems for Voice driven applications(Voice Bots): Human to machine interface over voice. Examples are Amazon Alexa, Google OK and Apple Siri.
  8. Automatic Machine Translation -  This can automatically give language translation like Google Translate and also provide computer-assisted Coding of a standard business rules or even coding from one language to another langugue. 
  9. Cognitive Assistant - Personal assistant who will store all your information including your schedules, and also remind you your activities or also recommend you some improvement activities like time to stand up or to sip some water or you need to cool down and not raise your blood pressure/heart beat etc after integrating with your heart beat/BP monitoring application.

Why should you invest in NLP?  

  • As a Product Owner -  If you do not invest right in the NLP technology, your products would be lagging behind your competior solution. If your novice developer has copied some public NLP code/library, the customer experience might be really pathetic and customer might think your NLP interface is too idiotic. (I personally felt this very often with many NLP applications).
  • As a Developer: This is going to be a great opportunity for getting a new job which gives you a competitive advantage over the other developers who does not know NLP technology. If you know the fundamentals of this technology, you can tune your NLP application based on the business usecase and provide valuable contribution to your product and thereby giving a good customer experience and competitive advantage.

Sunday, March 08, 2020

Comparing GoldenGate Kafka Handlers

GoldenGate is the only differentiated product in the market to have 3 different types of adapters to Kafka. The three different connectors to Kafka are: 
1) Kafka Generic Handler (Pub/Sub)
2) Kafka Connect Handler
3) Kafka REST Proxy Handler

Each of these interfaces has its own advantages and disadvantages. The table below compares and contrasts the differences between the above three handlers:
FUNCTIONAL AREA
KAFKA HANDLER (PUB/SUB)
KAFKA CONNECT HANDLER
KAFKA REST PROXY HANDLER

Available in Opensource Apache version

Yes
Yes
No*
Schema Registry Integration

No
Yes*
Yes
Formatting to JSON
Yes, with GoldenGate Formatter
Yes, with Kafka Connect Framework

Yes
Formatting to Avro
Yes, with GoldenGate Formatter

Yes*
Yes*
Designed for high volume throughput
Yes


Yes
No
Transactions and Operations

Yes, Both
Note: transactions have specific challenges, hence not recommended

Operations only
Operations only
Run-time mapping of Key and Topic

Yes
Yes
Yes
Connect Protocol

Kafka Producer API
Kafka Producer API
HTTPS, REST
Web Proxy Support

No
No
Yes
Synchronous(also known as Blocking Mode) and Asynchronous Mode of operation

Yes, both
Note: Synchronous has very low throughput
No, Asynchronous only
No, Asynchronous only
Kafka Interceptor support
Yes
Yes
No

* Available with Confluent Community License and Enterprise License

For detailed information refer to GoldenGate for Big Data Documentation:

Should developers write additional code which is not given in specifications?


The question is can developers write additional code that is not given in specifications. 

The answer to the question can be self-explained as in the image below :

A developer looking at the code would give 2 solutions to fix the above problem:
i)  Instead of using the assignment operator, the developer should have used a comparison operator as "isCrazyMurderingRobot == true"
ii) Use final keyword so that it is unalterable as "static final bool isCrazyMurderingRobert = false; "

But, I think the above two solutions are not the right ones. The problem is that the whole routine is was an unnecessary piece of code which specifies an option to kill(humans) which was not originally expected to be programmed as per the functional specification. A programmer who tried to act smart, but made a syntactical error created the whole mess.

When I was a software developer, I remember asking a product manager whether it is acceptable to write some additional functions (or methods) in the code for some extra validation which was not there in the specification. The answer that I got was an absolute "NO" and he said it will be considered as a "SIN" in the developer's job. Then I asked him why and he explained to me this.  It might be easy to add a new feature into the product by a developer but is humongous difficult to drop a feature that is there in the code. So as a developer, it might take a day to write a few hundreds of lines of code, but it takes years to remove and maintain the code.

Let's take an example of writing a connector code, a simple program that is connecting to MongoDB and as per the proposed certification matrix, it should connect to 3.5 and 3.6 versions. As a developer, you might have been proactive and added an additional check of Mongo DB version in the code. What happens with this additional check is that if the customer chooses to upgrade the MongoDB to 4.0, your code will stop to work and would require a patch to make it run. If the check was not there, it would have been a simple sanity test on your automation suite to certify the same old code with MongoDB 4.0 as well.

If you have a high urge to write that code, write it in a private branch or commented section as  proactive code that may be required in for the future.

In summary, it is a professional cardinal "SIN" to add additional code into a product mainline without a Product Manager's approval. 

Tuesday, November 05, 2019

ELT is passé; STL(Stream-Transform-Load) is the new name for ETL

ETL: Extract Transform Load
ELT: Extract Load Transform
STL: Stream Transform Load

Everybody is interested in talking about how technology can make a difference in real-time and contextual based application experience but is anyone doing anything about it.  If someone is doing anything about it, then they are the market leaders in that domain.

In the past, most of the banking and retail industries used to run batch jobs to move the transaction data that happened during that day to a data warehouse in midnight (also called of end of day processing). One of the standard conditions for batch processing was that there should not be any transaction happening in the source data system when the batch processing was running. 

These days most of the banks and retail companies run 24x7 and they cannot have a minute of downtime on their system, ie. a customer could log on to the mobile application or website and do a transaction at midnight. They are even more concerned about setting up a disaster recovery (DR) site far away from the main production site so that some catastrophe hits their data center, they are not stopping their business. Post-Sept 2011 attack, one of the first things that most enterprises did was to create a DR site with real-time data replication technology to be implemented. Now that they cannot even have a minute of downtime in their data systems and allow customers to transact 24x7, they want to repurpose their ETL systems as STL systems so that they can have real-time ETL functionalities with the new big data technologies that can scale and process for even real-time systems. 

Here is a high-level comparison between ETL to CDC and STL

Traditional ETL
  • Batch mode of extraction, high latency, low throughput on large window
  • High load on sources, usage only during certain times of day
  • On-disk transformations
CDC (Change Data Capture)
  • Real-time Replication, high throughput in a large window
  • Low load on sources, fully utilized systems
  • High load on target to load high volumes
Stream-Transform-Load (a.k.a. Streaming ETL)
  • Best parts of ETL and CDC (low latency, low load on sources, overall higher throughput)
  • In-memory transformations
  • Reduced load to target systems
  • Reduce Garbage-in


Some of the most common use cases of data stream processing include

  • Industrial Automation
  • Log Analytics
  • Building Real-time data lakes
  • IoT (Wearables & devices) Analytics
  • Smart homes and City

Industry-specific use-cases for Stream Processing:

  • Financial Services
    • Fraud Detection
    • Real-time analysis of commodity prices
    • Real-time analysis of currency exchange data
    • Risk Management
  • Retail
    • Markdown optimization 
    • Dynamic pricing and forecasting and Trends
    • Real-time Personalized Offers
    • Shopping cart defections
    • Better store and shelf management
  • Transportation
    • Tracking Containers, Delivery Vehicles, and Other Assets
    • Vehicle Management 
    • Passenger Alerts
    • Logistics and Route Optimization
  • Telecom
    • Wifi Off-Loading
    • Video Analytics
    • Network Management
    • Security Operations
    • Geolocation Marketing
    • Mobile Data Processing
  • Health Care
    • Medical Device Monitoring
    • In-home Patient Monitoring
    • Medical Fraud Detection
o   Safer Cities
  • Utilities, Oil and Gas
    • Outage Intelligence
    • Workforce Management
    • Real-time Drilling Analysis
    • Telemetry on critical assets
  • Manufacturing
    • Smart Inventory
    • Quality Control
    • Building Management
    • Logistics and Route Optimization

Monday, May 13, 2019

Data Science to Wisdom Science

 The hottest technology course for undergraduate studies during the 80s was Computer Engineering,
then it was Computer Science in the 90's,
then it was Information Science in the 2000's, especially MIS was the hottest thing post Y2K (1st Jan 2000) and everyone was designing systems to create dashboards and reports for Executives,


In 2010, we all heard about the biggest trend as Big Data Analytics, Cloud Computing and Social data including Natural Language processing as the hottest trends,.

In 2020 we will see that the hottest graduation degree will be Data Science.... We are seeing everyone is looking for Insights from Data and hence called as Data Science . Some of the latest areas in Data Science is how to reduce (discard unnecessary) data for computing and how to get Fast (real-time) data for every one..


So Computer Science and Engineering had 20 years of life time, Information Science and Technology had 20 years of life time and next 10 years it would be Data Science and Engineering.

Following the above trend, what would be the hottest graduation course in 2040? Would it be below Data or above Knowledge?

My take: I dont think they would go below Data layer, I think experts would soon call it as WISDOM SCIENCE 😄 ?




Thursday, February 28, 2019

Machine Learning: Which clustering algorithm should I use?

Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. There are several clustering methods/algorithms that could be used and you are often confused when to use what. Here is a quick tip on how to categorize and decide which one is best for your data set. 

There are mainly 7 types of basic clustering methods. They are:

1) Partitioning
  • Find mutually exclusive cluster of spherical shape
  • Based on distance
  • Represent cluster using mean or medioid 
  • Good for small to medium sized clusters
  • Methods:
    • k-means
    • k-mediods
    • CLARA - Clustering LARge Applications
2) Hierarchical
  • Hierarchical decomposition method
  • based on agglomerate or divisive
  • cannot correct erroneous merges or splits
  • Methods: 
    • BIRCH -Balanced Iterative Reducing and Clustering using Hierarchies
    • ROCK -RObust Clustering using linKs
    • CHAMELEON -Multiphase hierarchical clustering using dynamic modeling)
    • DBSCAN -Density Based Clustering on Connected Regions with High Density
    • OPTICS -Ordering Points TIdentify the Clustering Structure
    • DENCLUE -DENsity CLUstEring
    • AGNES(AGglomerative NESting)
    • DIANA (DIvisive ANAlysis)
3) Density based mentjods
  • Good for arbitrary shapes
  • Based on density or neighborhood concept
  • Possible to filter outliers
4) Grid based methods
  • Uses multi-resolution grid data structure
  • Fast in processing time irrespective of the number of data objects
5) Model based methods

  •   hypothesizes a model for each of the clusters and finds the best fit of the data to the given model.
  •  It takes “noise” or outliers into account, therein contributing to the robustness of the approach.  
6) High Dimensional data methods
  • As dimensionality increases, the data usually become increasingly sparse because the data points are likely located in different dimensional subspaces
  • Frequent pattern-based clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently.
  • CLIQUE, PROCLUS, Frequent pattern,  pCluster
7) Constraint based methods
  • User-specified or application- oriented constraints. A constraint can express a user’s expectation or describe “properties” of the desired clustering results, and provides an effective means for communicating with the clustering process. 
  • Constraint-based methods are used in spatial clustering for clustering with obstacle objects (e.g., considering obstacles such as rivers and highways when planning the placement of automated banking machines) and user-constrained cluster analysis (e.g, considering specific constraints regarding customer groups when determining the best location for a new service station, such as ‘‘must serve at least 100 high-value customers”). 
  • Semi-supervised clustering employs, for example, pair- wise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering 
Based on the above types, there are a lot of commonly used methods that are used for clustering. Here are some of them:

(a) K-means: Based on centroid 
  • Clustering Type: Partitioning
  • Detectable Shapes: Spherical-shaped clusters; 
  • The number of clusters
  • Sensitive to noise and outliers. 
  • Works well with small data sets only.  
  •  Running time -  O(kni); where k is number of clusters, n is number of objects and i is number of iterations

(b) K-medoids: Representative object based 
  • Clustering Type: Partitioning
  • Detectable Shapes: Spherical-shaped clusters; 
  •  The number of clusters; 
  •  Small data sets (not scalable). 
  •  Running time -  O(k(n-k)^2)
(c) CLARA  
  • Type- Partitioning
  • Detectable Shapes- Spherical-shaped clusters; 
  • The number of clusters
  • Effective and sensitive to the selection of initial samples.  
  • Running time - O (ks^2 + k(n-k)) ; where k=no: of samples and k is no: of clusters and s is the size of sample'.
  • Another version called CLARANS (Clustering LArge Applications based on RANdomized Search) based on a randomized algorithm

(d) BIRCH
  • Clustering Type: Hierarchical
  • Detectable Shapes: Spherical-shaped clusters
  • large N d-dimensional data points; 
  •  Uses a Clustering Feature  (CF) tree can hold only a limited number of entries due to its size, a CF tree does not always correspond to what a user may consider a natural cluster. 
  • Running time: O (n) ; where n is the number of objects to be clustered

(e) ROCK
  • Clustering Type: Hierarchical
  • Detectable Shapes: Arbitrary shape
  • N d-dimensional categorical data points
  • Designed for categorical data, emphasizes interconnectivity, ignores closeness between clusters.
  • Running Time: O(n * Mm * Ma), where Ma and Mm are the averages and maximum number of the neighbors for a point; Worst case O(Ma * N^2)
(f) CHAMELEON 
  • Clustering Type: Hierarchical
  • Detectable Shapes: Arbitrary shape
  • N d-dimensional categorical points
  • Running time: O(n^2) worst case

(g) DBSCAN
  •  Clustering Type: Density
  • Detectable Shape: Arbitrary shape
  • Maximum possible distance for a point to be considered density-reachable and minimum number of points in a cluster
  • Running time: With spatial index O(nlog n) ; O(n^2) worst case.
(h) OPTICS 
  •  Clustering Type: Density
  • Detectable Shape: Arbitrary shape
  •  Outputs Cluster ordering which is a linear list of all objects under analysis. It does not require user to provide a specific density threshold like DBSCAN.
  • Running time:  With spatial index O(nlog n) ; O(n^2) worst case.
(i) DENCLUE
  • Clustering Type: Density 
  • Detectable Shape: Arbitrary shape 
  • Kernel density estimation is used 
  •  Running time:  O(n+ h*n) where h is the average hill climbing time for an object ; O(n^2) worst case. 

 Reference: Data Mining Concepts and Techniques (3rd Edition) by J Han, M Kamber and J Pei