Tuesday, August 04, 2015

Simplify Data Wrangling Problem

I happened to read a news paper article on the data wrangling problem and how difficult is to prepare the right data for the right analytic or reporting. Many companies are trying to solve this issue by providing some glorified spread sheets to simplify the data wrangling work. Companies such as Trifacta, Paxata, Informatica Rev, Tamr etc have been building such glorified spread sheets on the browsers for simplifying the life of data scientists. However, I am not sure whether it should be the end goal for a data analyst looking at these glorified spread sheets and doing some data munging work.

I might have a different opinion. I think it is best to optimize the data at the source and filter them from the source with the right filter scheme and then try to categorize them to the right categorizes and then try to automatically merge them intelligently. So in summary, I think the classification and categorization of data should happen at different steps than just dumping all the data into a big data lake or push to cloud based glorified spread sheet.

Ref:

1. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html