Marketplace
File and Table compare utility for HDFS and Hive

File and Table compare utility for HDFS and Hive

Posted by: Informatica Enterprise Data Integration

The code can compare two file/folder inside which is present inside Hadoop file system. If there is a difference in the files/folders the difference will be logged if the input is text.

Overview

Purpose: Even though HDFS claims to be as simple as regular file system, we don?t have many tools for this. This tool allows users to compare data on HDFS or Hive tables. The following is the purpose of the tool
  • To verify data on the HDFS or compare hive tables.
  • To be integrated with testing to be in line with the rest of our testing.
  • User controlled level of verification (for example: only checksum verification is needed or more detailed analysis is needed per comparison)
How is the comparison done? Compare two tables in Hive :
  • Drivers used: HIVE JDBC
  • Description: The utility has the ability to compare two HIVE tables by comparing row by row. It is not as if every row, every column is compared but the objects checksum is compared. This way even complex types or binary types can be compared.
  • The utility can reside outside of Hadoop. The comparison can happen by connection to the requisite Hive servers.
Note : This can be extended for any DB comparison. All our DB verification is done via a similar mechanism. Compare two files in HDFS : Binary (non-text and non-flat. For example- images) files : It compares the checksum of the files. The baseline expected is a set of checksums. The comparison is based on matching the checksum of the baseline to the target file. Please check the flow chart for more details. Execution details :   Please see the attached document in additional resources section for details.

Features

Ability to compare two files on HDFS.Ability to compare two directories on HDFS.Ability to compare two Hive tables.Ability to compare files one which is on HDFS and the other on a remote server (Windows/Linux etc.) Ability to compare two Hive tables.Ability to compare files one which is on HDFS and the other on a remote server (Windows/Linux etc.)

Support