Data quality, completeness and accuracy are important factors that for an analytical system. Analytics over erroreonus bad data can lead to imperfect business decision which may result in monetary loss to the organization. Quality of data is very critical factor and organisations must be able to trust the data residing in their data lake. Usually Data Scientist spends 60% of their efforts in cleaning the data so that they can create effective data science models over it.
Jumbune Data Analysis framework provides users much needed visibility into the quality and profiles of the data present on the Hadoop distributed file system. Users can assess the quality of the data within th e dataset over a period of time for consistancy and logic, also profile them to quickly categorise them based on the present data. The user need not write any code and entire functionality is carried out without any data movement.
Data Validation and quality time lines
Gain deep insights into the quality of the data present on your Data Lakes. HDFS data validation is a generic data validation framework that checks the data for anomalies and reports them with various details like line number, file name etc. It validates the data on the DFS based on custom defined set of rules as per the business. The rules can be in the form of null checks, data types and/or regular expressions expressing the business form of data. The data validation tasks can be scheduled and Data Quality Timelines are used to infer the health of the data over a period of time.
It supports various data formats such as plain text, JSON and XML documents as well.
HDFS data can also be profiled over a set of rules or without one to obtain some quick insights over the ingested data using the Data Profiling tool. The profiles can be used later for getting a high level view of the data values.