Tuesday, August 04, 2015

Simplify Data Wrangling Problem

I happened to read a news paper article on the data wrangling problem and how difficult is to prepare the right data for the right analytic or reporting. Many companies are trying to solve this issue by providing some glorified spread sheets to simplify the data wrangling work. Companies such as Trifacta, Paxata, Informatica Rev, Tamr etc have been building such glorified spread sheets on the browsers for simplifying the life of data scientists. However, I am not sure whether it should be the end goal for a data analyst looking at these glorified spread sheets and doing some data munging work.

I might have a different opinion. I think it is best to optimize the data at the source and filter them from the source with the right filter scheme and then try to categorize them to the right categorizes and then try to automatically merge them intelligently. So in summary, I think the classification and categorization of data should happen at different steps than just dumping all the data into a big data lake or push to cloud based glorified spread sheet.

Ref:

1. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html


Saturday, February 22, 2014

Past, Current and Future of Hadoop

Hadoop came in as a big data revolution. People interpreted Hadoop as Big Data. If some has to be working on Big Data then the answer was only Hadoop.

What is Hadoop?
Hadoop aims to be an open-source software for reliable, scalable, distributed computing.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. So in summary, Hadoop is a low cost server farm with cheap servers and cheap software ready for high scalability.

 Hadoop is being distributed by many companies such as Amazon Web Services, Cloudera, CloudSpace, Datameer, Datamine, Datasalt, Datastax, Debian, Greenplum , Hortonworks,  Hstream, IBM, Impetus, Intel, MapR etc

Is Hadoop technology completely new?
I do not think so. There has been several of those enterprise applications that were proprietary software that existed to solve such reliable, scalable and distributed computing. For web processing, it used to be called as Load balancing servers that used to run scaling out technology for years since the scale of internet happened. Similarly, in the enterprise applications such as SAP, it used to call as Application Server and the Master Gateway server to span out the processing capability. Many of the software companies built solutions to run on the GRID.

History of Linux
I have been in the IT industry since mid 1990's. This reminded me of an old wave that came out with Linux Torvalds way back in the mid 90's which mentioned as the open source paradigm shift that should happen to compete with the Microsoft monopoly. There were several tens of companies that took over the linux bandwagon which includes Debian, Fedora, Suse, Gentoo, Slackware etc. Some of them had the following commerical names such as Redhat, Suse, Mandrake, Ubuntu, Oracle etc. So who benefitted out of this? It is the hardware companies such as IBM, HP, Dell, Intel, AMD etc who could create more chips to create new operating systems and hardwares to sell to customers who did not like the monopoly of Microsoft. The companies such as IBM, HP and Oracle (earlier Sun), even though they had their own operating systems known as AIX, HP-UX and Solaris respectively, they started supported Linux to expand their business. So lets see who made money with Linux and who lost money with Linux. The hardware companies and the Linux distribution companies made money where as the enterprises or software companies lost money in developing their exisiting software solutions that were already built on AIX, HP-UX, Solaris and Windows needed to be ported to yet another 32 and 64 bit Linux OS. The cost of porting and certification was very high since many of these distributions such as RHEL 5.1, 5.2, 5.3 and similar Suse 9.1, 9.2, Mandrake xx..yy has to be tested and certified. So many of the customers who chose to pick any of the non-common Linux distributions could not get such certificated enterprise software.
 
 Relationship with Linux and Hadoop
I find a close relationship with the Linux and Hadoop. I believe that Hadoop is the new Linux of 2010's which is ready to break away the propitiatory software GRID solutions that were offered in a free software manner.  After 15 years of Linux history, there are only a few companies that are now considered successful, which is Redhat and Suse in the enterprise OS space .


Future of Hadoop
So what is the future of Hadoop, it will be the same future that happened to Linux. Few of the distribution companies many money and majority of the distribution companies would die. As I see one of the Hadoop distribution (distro) companies is following the Linux path which is the Hortonworks. The other strong close contenders are Cloudera and MapR in addition to the direct software vendors such as Greenplum and IBM.


redhatrevenue



References:
1) History of Linux http://en.wikipedia.org/wiki/History_of_Linux
2) Linux Distributions http://en.wikipedia.org/wiki/Linux_distribution
3) Hadoop Distributions
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
4)  Hortonworks partnership with  SAP

5) Hortonworks partnership with Redhat and SAP
6) Enterprise Hadoop Market in 2013: Reflections and Directions http://hortonworks.com/blog/enterprise-hadoop-market-in-2013-reflections-and-directions/
7) Hot to get over your inaction on big data http://blogs.hbr.org/2014/02/how-to-get-over-your-inaction-on-big-data-2/
8) The decline and fall of big data http://www.infoworld.com/article/2845926/big-data/the-decline-and-fall-of-big-data.html 

Monday, February 03, 2014

Why is Social Media Data integration important for enterprises?

I happened to read couple of articles last week that talks the social media data analytic for the banking industry. According the research, 
  • Only 46% of banks can analyze external data about customers.
  • Only 32% can analyze social media activity.
  • Data volume and analytics complexity are the most common challenges cited by respondents.
  • Cloud and predictive analytics technologies will be "extremely valuable" to around 60% of respondents' strategies in the next 24 months. (Jan 2014- Dec 2015)
 It is very interesting to find 3 of the private banks in India are among the top 10 list.

Ranking for Top 20 banks

#
Bank
Area
Total # of
Facebook
‘Likes’
Total # of
Twitter
Followers
All-Time
YouTube
Views
Power 100
Score
(Q4 2013)
1
BofA
USA
1,481,401
245,207
14,439,707
2,568
2
Chase
USA
3,754,905
22,276
123,247
2,564
3
Capital One
USA
2,917,273
84,721
2,647,251
2,372
4
ICICI Bank
IN
2,608,044
11,257
1,163,370
1,905
5
Wells Fargo
USA
604,795
69,268
17,828,813
1,836
6
Citi
USA
927,904
219,615
6,002,953
1,577
7
HDFC
IN
1,920,504
11,622
409,799
1,432
8
Axis
IN
1,926,401
9,724
1,196,610
1,421
9
GT Bank
NIG
1,527,777
86,229
293,565
1,252
10
E*TRADE Bank
USA
73,421
11,942
8,882,374
1,206
11
USAA
USA
611,867
58,953
7,331,717
1,175
12
Credit Suisse
CH
63,026
37,976
10,622,427
897
13
Scotiabank
CAN
318,223
29,431
7,502,312
795
14
Barclays
UK
479,668
21,541
4,032,704
706
15
Commonwealth
AUS
523,860
28,985
2,542,623
688
16
IDBI
IN
795,247
9,912
167,598
688
17
FNB
SA
446,063
34,154
1,358,596
652
18
Natwest
UK
143,594
29,401
3,583,634
506
19
HSBC
UK
153,704
7,288
3,143,298
477
20
Kotak Mahindra
IN
150,206
87,549
2,398,734
467

Acfcording to one of the latest publication from McKinsey (Jan 2014), all the COO's should lead social-media based customer service.  According to some survey, nearly 71% of consumers who had a good social-media service experience with a brand is likely to recommend to others. The IT investments are comparatively much lower than the other complex support systems because such a social media based support infrastructure would be reused from the existing social media channels.


This would mean that Social Media data integration is becoming more and more into the main stream of use-cases for an enterprise who has direct end customers. To enable the main stream based data integration would mean that such integrations needs to be automated from the manual BPO based operation where there is a huge workforce that is required. One could use tools like Informatica's PowerExchange for SocialMedia to automate and ingest such Social Media based data to the standard data integration point into the enterprise.



References:
1) http://thefinancialbrand.com/36081/power-100-2013-q4-bank-rankings/
2) http://www.informationweek.com/big-data/big-data-analytics/data-management-key-to-banking-analytics-/d/d-id/1113646?
3) https://community.informatica.com/solutions/powerexchange_for_twitter
4) https://community.informatica.com/solutions/extract_data_from_linkedin_facebook_and_twitter
5) http://www.youtube.com/watch?v=aGU6K0wgGSk&list=PLmi6HWWEAjKqq068jimUILXqr_C8fyQiW&feature=share
6) https://www.youtube.com/watch?v=Wng1M8sEYpw&
7) http://scn.sap.com/community/business-trends/blog/2014/01/06/social-media-is-now-everyone-s-business?source=email-apj-sapflash-newsletter-20140127&lf1=821703689d274213763641e16240131 
8)http://www.mckinsey.com/insights/marketing_sales/Why_the_COO_should_lead_social_media_customer_service?cid=other-eml-alt-mkq-mck-oth-1401
9) http://marketingland.com/5-social-media-trends-to-kill-in-2014-69190  
10) http://thefinancialbrand.com/35160/banking-consumer-social-media-sentiment-report-2013-q4/

Tuesday, October 15, 2013

How can you secure and backup your data hosted on Cloud SaaS services ?

Some of my memories started kicking in when I happened to read an article on the internet "What To Do When Your Cloud Service Fades Away"  by @sergeykandaurov


I have been a cloud service offerings for more than a decade around 1998-2000. I remember using some of the so called dotcom SaaS companies who offered the onpremises tools services online then (now called as the Cloud services). Some of them were box.com, usa.net, freeservers.com I used to have accounts in all of these SaaS free offerings when the storage capacities used to be 1-10MB space whereas the standard data transfer mechanism used to still the 1.44 MB Floppy disk. These websites used to offer free storage and it used to be mechanism for backing up or storing data even on a 64kbps dialup line those days.

Most of these Y2K companies died during the period 2002-2005 and probably the only surviving company is freeservers.com.These companies were good enough to notify their users and gave adequate time for take the data backup from these website before they were shutdown. So I too ensured to take a back copy of my old data that was on the cloud.

Currently, there are many companies that offer the services to a typical internet user such as Microsoft's Skydrive, Google's drive, Dropbox.com, Soundcloud.com etc. As an individual, I have account on all of these cloud services. Similalry, there are many corporates and enterprises uses Cloud SaaS services such as Outlook.com (the microsoft exchange server on cloud), Salesforce.com, workday.com, netsuite.com,....and the list goes on.


Have you thought about how would you ensure to take a backup or migrate from one of the cloud SaaS companies decided to move on by closing their business or when you feel insecure holding data with them. This is one of the place where companies like Informatica can provide tools to integrate between the cloud systems and the on-premises . To read more visit www.informatica.com and www.informaticacloud.com




Friday, September 27, 2013

Databases in the 21st Century: Can the CIO dictate for a single database within an enterprise?

Can the CIO dictate and normalize only one single Database vendor within an organization?

Historically, Database was a standard method for storing data in row format in a predefined manner. Many of the enterprises would have been either associated with IBM DB2/UDBC or Sybase or Informix or Oracle or Microsoft SQLServer for providing their databases requirement along with the other software that is required by them.  Those days, the requirement for Databases were very simple, and the basic expectation was to store the data and provide business backup for a 2-tier or a 3-tier application for either recording transactions or for reporting.

During our college days, most of us would have learned through the lineage of DBMS, RDBMS and OODBMS and the other databases for real large volume of data called the datawarehouses. We have been having some other DBMS such as Document and NoSQL databases. Very recently, some of us would have heard about the Cloud Databases, columnar databases, device databases and in-memory databases. So where does that leave us? Can we still depend only on one database vendor for the enterprise- say IBM or Microsoft or SAP(merger with Sybase) or Oracle ?

The high level classification of NoSQL DB are as follows:

Data Model Performance Scalability Flexibility Complexity Functionality
Relational Database variable variable low moderate relational algebra.
Key–value Stores high high high none variable (none)
Graph Database variable variable high high graph theory
Document Store high variable (high) high low variable (low)
Column Store high high moderate low minimal

For the list of  all the various databases that are currently available on various hardware's and systems from various vendors, you can visit http://en.wikipedia.org/wiki/NoSQL


The answer is a simple 'NO'. The reason is due to the fact that the IT systems have expanded in the 21st century and have created many more use-case scenarios for the databases, data-warehouses and data storage optimization

Reference
http://www.computerworld.com/s/article/9246543/IBM_buys_NoSQL_cloud_provider_Cloudant

Data connectivity to third party applications

I often get this query from my sales: "How can I connect to this xyz application" which falls into the long-tail of connectivity problem. There are multiple ways of connecting to third party applications which may fall into any of the following categories:


  1.  Standard 2-tier or 3-tier application: Majority of the custom business application deployed at an enterprise customer site would have a database behind the application. Typically, connecting directly to the database using either our native or ODBC drivers would be the easiest data  integration point for such a custom application which does not expose any other standard application interfaces.
  2.      Cloud hosted application: Many of the new cloud based application vendors provide the standard Web-Services/REST based interface for connecting to the application
  3. On-premise or cloud application exposing CLI’s to integrate: Use standard CLI functions that are exposed by the applications and write a custom program to integrate with the flat files generated as the output of the CLI 
  4. No DB Connection, No Webservices or CLI interface - Exposed through programming API: If none of the above is possible and the third party applications needs to build an exclusive connection using a programming API(C,C++,Java) to the connecting applications.
In addition to the base connectivity decision areas as mentioned above, we would need to look at the customer’s data integration use cases for connecting into each of the following areas before defining the appropriate connectivity solution.
      a)      Volume of data & Scalability (partitioning, Bulk interfaces, CDC etc)
      b)      Velocity (performance-bulk or real time)
      c)       Security(authentication, authorizations, data staging concerns etc)
      d)      Variety(mapping application data types to your data types or transforming the specialized encoded data such as JSON, EDIFACT)  
       e)      Validity (history snapshots or real time)

Thursday, June 27, 2013

Marriage of Onpremises and Cloud Software - Problems and concerns


Social Media or Cloud Adapters break too often
Non-mature cloud vendors changes an API or interface or authentication very quickly with/without informing the data integrators or the DI vendors are not ready to make the changes in that time frame.
e.g.. Microsoft Dynamics CRM Live authentication, Facebook API changes,Twitter API v1 obsolescence.
Delay in delivering fixes and pains in implementing fixes
A change in the cloud application back-end system or the API or metadata would require a patch to be shipped pretty fast and installed quickly to continue data integration. Usually, the overheads of shipping such an on-premise installers are in days instead of hours. The customer base could be thousands.
Older versions of  adapters does not support newer cloud features
Older versions of cloud adapters were built and certified some time back. Typically R&D does not have the bandwidth to certify the old adapter versions everytime when the cloud vendor makes a change.
Reduced synergy between the  Cloud and on-premise product teams
Typically the inhouse cloud product teams does not use the latest on-premise software versions or vice versa (incase the cloud software is a flavor of the on-premise on deployment.

Tuesday, March 05, 2013

Big Data Integration: Where are we heading to?

I happen to hear that Splunk.com was ranked for the 4th best innovative company on planet earth in 2013 by an agency couple of weeks back. So what do they do? They provide a software platform for real-time operational intelligence. In simple terms they are the enterprise's Google equivalent search tool to search for information from the machine generated data such as log files and alerts from operational monitoring tools. So a first word, what is so big about it. The key is the answer "big data" where the volumes are really big and someone is trying to look for a needle in the haystack.


What does Big Data mean to you? Here is an interpretation. Big Data is something big that you cannot handle in your standard databases or data warehouses where majority of data does not make sense directly for you, but there is a possibility that you could find something interesting from it if you run the expected analytic rules on it. Being said that, we have been assuming that the data is coming in a structured format just like how Splunk handles. What about unstructured data or format/schema less data? How would your machine read and interpret such a data?

Immediately you think about using some fuzzy neural logic that could work in natural language processing (NLP) across various languages. Is that it? There is another think dimension between the NLP and structured data and it is data that comes with its own metadata such as an XML without the schema or a JSON without a schema or No-SQL database entry where every entry has its own structure. This is the next immediate topic to be resolved where the structure is coming in a free flowing format which carry's its on metadata or descriptors along with it. We need to solve this format before moving to NLP.

-still in the works...keep watching

Tuesday, February 05, 2013

Social Media Trends in 2013

These are the Social Media trends in 2013:
  • Big data will get Social
  • Big data will augment data from transactional records to customer behavior and their social graph behavior on the web along with location and device/mobile generated information.
  • Social CRM
  • Social data will be added to the CRM and marketing tools to find trends in sentiment, behavior and individual preferences of customers.
  • Social media integration with other marketing
  • People spend increasing amounts of time on social networking sites and marketers want to tap into the social media distribution channel
  • Social media monitoring tools
  • Marketing ROI counted by measuring results
  • Social media monitoring tools combined with business metrics will lead to better understanding of the value of social interaction
  • Social media budgets will be much bigger
  • By finding new ways to interact and engage consumers.
  • Expecting the spending to be double by the end of 2016.