Technology: Enterprise Software and Integration

Data Science, Machine Learning, Artificial Intelligence, Natural Language Processing, Deep Learning, Real-time streaming, Stream Processing & STL, Data Integration & ETL , Data Services,Enterprise Application Integration (EAI), Databases, Enterprise Application, Middleware, Data warehousing, Web Services, ERP/CRM/SRM. Note: All the expressions here are my individual personal opinions, thoughts or frustrations. These views does not include my employers or business partners views.

Showing posts with label machine learning. Show all posts

Thursday, September 24, 2020

What will be the main AI projects in enterprises for next few years?

Before I answer what will be the enterprise projects around AI, lets first look at the consumer-facing projects that use AI.

- Shopping recommendation on Amazon.com

- Song recommendation on Spotify

- Movies/Video recommendation on Netflix

- Face recognition on Facebook for tagging

- Image editing using FaceApp or similar

- Talking to Chat applications on company websites

- Google Assistant/Alexa/Siri conversational system

- Sentiment Analysis for social media customer service

- Self-driving cars

- Route planning on maps

- Facial recognition

- Object recognition

-Video surveillance (intrusion detection and object tracker)

-Audience voice mix in IPL matches in Sept 2020
(Covid season with no physical audience https://www.khaleejtimes.com/sports/ipl-2020/ipl-2020-can-do-without-the-plastic-noise)

In a similar manner, I am expecting that enterprise systems and applications will also start to implement similar projects in specifical industrial vertical as follows:

- Fraud detection of financial transactions (eg: Money Laundering, credit card frauds)

- Identifying threatening and ransom calls from Telecom call data

- Real-time Retail offers for customers

- Online Exam proctoring

- Utility (Gas/Power/Water) demand-based generation

- Improved Customer support systems (much better than today's dumb chatbots)

- Medical proactive detection from health data trackers

- Context-sensitive and time-sensitive advertisements

- Autonomous server elasticity optimizing billing and performance

- Genetic Algorithms in Drug discovery

- Genome analysis, early detection of diseases, warning and curing it is developing stages (like Cancer)

Now let's look into generic platforms that might come up in the next few years:

- AutoML Platform: An AutoML platform is that it can take any given data, prepare the data, and run multiple ML algorithms along with additional tuning to yield the best result. This would be an AI beginner's platform and will see a lot of adoption between 2020-2023.

- RPA: RPA is more of a solution than a problem. It’s the automation of tasks with AI. Most of the existing manual systems will be replaced by this.

- NLP platform: Voice to words (like writing meeting minutes, emails, etc) and words to voice. This would be for specialized enterprise applications including call center platform automation or meeting scheduler etc ( say YES instead of pressing 1).

- Semantic modeling: The majority of data is textual and needs an accurate correlation to make a lot of useful decisions, and this is where semantic modeling comes to help.

- Prognostics Platform: Predicting analytics, root cause analysis, and solution recommendation and useful for enterprises to lower-cost improve productivity. This could also be for System and Application Monitoring/Management tools.

- Optimization Platform: Optimization is a huge problem everywhere from supply chain to manufacturing, sales to marketing. Most of these would still be an industry vertical solution platform for the next 2-3 years.

- Data quality Platform: Data quality is a huge problem in enterprises, it has a lot of missing and inaccurate data, esp manufacturing. And too many data sources to look for info.

- Real-time Online ML Platform: An online machine learning platform for real-time data learning for ever-changing data. It’s not a single learning algorithm: in fact, lots of algorithms can learn online. An example would be stock market price/volume data which might never follow a pattern of the past.

- Digital Twin Platform: A digital twin is a digital representation of a physical object or system. The technology behind digital twins has expanded to include large items such as buildings, factories, cities, people, and processes.
An example would be to simulate a cricket bowler's actions and bowling style.

However, I think it will take a longer time for such platforms to evolve. It is not as simple on the ground to prepare different solutions from a common platform, unlike the standard platforms that exist today (say a Data Quality platform tool that could be used across different verticals). Tons of variations exist and manually connecting the dots is sometimes not possible as people who have knowledge move on. And a lot of system being introduced, continuity is lost and correlation is not clear at all. This why a software company cannot build generic tools/platforms for some of the real-world cases.

Additional References:

Thursday, February 28, 2019

Machine Learning: Which clustering algorithm should I use?

Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. There are several clustering methods/algorithms that could be used and you are often confused when to use what. Here is a quick tip on how to categorize and decide which one is best for your data set.

There are mainly 7 types of basic clustering methods. They are:

1) Partitioning

Find mutually exclusive cluster of spherical shape
Based on distance
Represent cluster using mean or medioid
Good for small to medium sized clusters
Methods:

k-means
k-mediods
CLARA - Clustering LARge Applications

2) Hierarchical

Hierarchical decomposition method
based on agglomerate or divisive
cannot correct erroneous merges or splits
Methods:

BIRCH -Balanced Iterative Reducing and Clustering using Hierarchies
ROCK -RObust Clustering using linKs
CHAMELEON -Multiphase hierarchical clustering using dynamic modeling)
DBSCAN -Density Based Clustering on Connected Regions with High Density
OPTICS -Ordering Points To Identify the Clustering Structure
DENCLUE -DENsity CLUstEring
AGNES(AGglomerative NESting)
DIANA (DIvisive ANAlysis)

3) Density based mentjods

Good for arbitrary shapes
Based on density or neighborhood concept
Possible to filter outliers

4) Grid based methods

Uses multi-resolution grid data structure
Fast in processing time irrespective of the number of data objects

5) Model based methods

hypothesizes a model for each of the clusters and finds the best fit of the data to the given model.

It takes “noise” or outliers into account, therein contributing to the robustness of the approach.

6) High Dimensional data methods

As dimensionality increases, the data usually become increasingly sparse because the data points are likely located in different dimensional subspaces
Frequent pattern-based clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently.
CLIQUE, PROCLUS, Frequent pattern, pCluster

7) Constraint based methods

User-specified or application- oriented constraints. A constraint can express a user’s expectation or describe “properties” of the desired clustering results, and provides an effective means for communicating with the clustering process.
Constraint-based methods are used in spatial clustering for clustering with obstacle objects (e.g., considering obstacles such as rivers and highways when planning the placement of automated banking machines) and user-constrained cluster analysis (e.g, considering specific constraints regarding customer groups when determining the best location for a new service station, such as ‘‘must serve at least 100 high-value customers”).
Semi-supervised clustering employs, for example, pair- wise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering

Based on the above types, there are a lot of commonly used methods that are used for clustering. Here are some of them:

(a) K-means: Based on centroid

Clustering Type: Partitioning
Detectable Shapes: Spherical-shaped clusters;
The number of clusters
Sensitive to noise and outliers.
Works well with small data sets only.
Running time - O(kni); where k is number of clusters, n is number of objects and i is number of iterations

(b) K-medoids: Representative object based

Clustering Type: Partitioning
Detectable Shapes: Spherical-shaped clusters;
The number of clusters;
Small data sets (not scalable).
Running time - O(k(n-k)^2)

(c) CLARA

Type- Partitioning
Detectable Shapes- Spherical-shaped clusters;
The number of clusters
Effective and sensitive to the selection of initial samples.
Running time - O (ks^2 + k(n-k)) ; where k=no: of samples and k is no: of clusters and s is the size of sample'.
Another version called CLARANS (Clustering LArge Applications based on RANdomized Search) based on a randomized algorithm

(d) BIRCH

Clustering Type: Hierarchical
Detectable Shapes: Spherical-shaped clusters

large N d-dimensional data points;
Uses a Clustering Feature (CF) tree can hold only a limited number of entries due to its size, a CF tree does not always correspond to what a user may consider a natural cluster.
Running time: O (n) ; where n is the number of objects to be clustered

(e) ROCK

Clustering Type: Hierarchical
Detectable Shapes: Arbitrary shape
N d-dimensional categorical data points
Designed for categorical data, emphasizes interconnectivity, ignores closeness between clusters.
Running Time: O(n * Mm * Ma), where Ma and Mm are the averages and maximum number of the neighbors for a point; Worst case O(Ma * N^2)

(f) CHAMELEON

Clustering Type: Hierarchical
Detectable Shapes: Arbitrary shape
N d-dimensional categorical points
Running time: O(n^2) worst case

(g) DBSCAN

Clustering Type: Density
Detectable Shape: Arbitrary shape
Maximum possible distance for a point to be considered density-reachable and minimum number of points in a cluster
Running time: With spatial index O(nlog n) ; O(n^2) worst case.

(h) OPTICS

Clustering Type: Density
Detectable Shape: Arbitrary shape
Outputs Cluster ordering which is a linear list of all objects under analysis. It does not require user to provide a specific density threshold like DBSCAN.
Running time: With spatial index O(nlog n) ; O(n^2) worst case.

(i) DENCLUE

Clustering Type: Density
Detectable Shape: Arbitrary shape
Kernel density estimation is used
Running time: O(n+ h*n) where h is the average hill climbing time for an object ; O(n^2) worst case.

Reference: Data Mining Concepts and Techniques (3rd Edition) by J Han, M Kamber and J Pei

Thursday, September 14, 2017

New Age of Data Mining with Machine Learning, Deep Learning, Natural Language Processing, Artificial Intelligence and Robotics technologies

We are currently in the middle of the perfect storm of incredible time for transformative technology today and it is an evolutionary step in humanity. In the last two decades, computing power has become much cheaper and data has become available for computing to utilize. It's the ability of machines to think in the cognitive way similar way how human mind thinks which will bring in the difference. The machines are able to identify, gather, store, correlate, think and predict based on the historical data and patterns available within it. The machines are much more efficient in storing and computing all the possibilities including all the permutations and combinations much faster than humans do.

We have started getting recommendations on related around products and services that you have bought or searched earlier on internet(eg. Amazon). We are able to talk to our personal digital assistants and get valuable work done out of them(eg. Apple Siri, Amazon Alexa and Google Assistant). We are able to see that our personal emails are able to correlate events and derive a value for actions (eg. an email in your Gmail with a flight schedule and Google Maps automatically mapping to that location).

This is the new age of Data Mining. Data Mining as a branch of computer science has been in existence for several years. Data Mining was the next logical step after processing and storing data in data warehouses. Data mining is defined as a process of discovering hidden valuable knowledge by analyzing large amounts of data, which is stored in databases or data warehouse, using various data mining techniques such as machine learning, artificial intelligence(AI) and statistical. Knowledge Discovery in Databases, or KDD for short, defines the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods.

Fig: KDD Process

The various components in Data Mining Algorithms are as follows:
1)Model Representation: To determine the nature and structure of the representation to be used
2) Score function : To measuring how well different representations fit the data
3) Search/Optimization method: An algorithm to optimize the score function
4) Data Management: Deciding what principles of data management are required to implement the algorithms efficiently.

Data Mining process included classification, clustering, regression as the fundamental steps.
A typical data mining algorithm used to look like this


Data Mining Algorithm

Here are some of the regression models that used be represented with various algorithmic approaches.

There were different sets of data that used to exist during the data mining model definition
a) training the model with a set of data called "training data",
b) a set of data called "validation data" used calculate the estimate of Squared Error Score and
c) a set of data called "Test Data" to calculate an unbiased estimate of Squared Error Score of a selected model.

I happened to hear about a lot of new jargosn such as Data Science, Machine Learning and Deep Learning used by most of the amateurs or even experts in the IT field and using these words without understanding the meaning of them. So I decided to take a course on it from the experts including some hands on labs. Once I took the course, I realized that these jargons are nothing but same old data mining, statistics and neural networks. A typical old wine in new bottle for the industry like "Big Data" and "Infocomm" in the past.

There are a lot of other entrepreneurs building technology business around Artificial Intelligence and Robotics in various different business domains including Biology. For example www.claralabs.com, www.vicarious.com, www.zymergen