File and Table compare utility for HDFS and Hive
Posted by: Informatica Enterprise Data Integration
The code can compare two file/folder inside which is present inside Hadoop file system. If there is a difference in the files/folders the difference will be logged if the input is text.
Overview
Purpose: Even though HDFS claims to be as simple as regular file system, we don?t have many tools for this. This tool allows users to compare data on HDFS or Hive tables. The following is the purpose of the tool
- To verify data on the HDFS or compare hive tables.
- To be integrated with testing to be in line with the rest of our testing.
- User controlled level of verification (for example: only checksum verification is needed or more detailed analysis is needed per comparison)
- Drivers used: HIVE JDBC
- Description: The utility has the ability to compare two HIVE tables by comparing row by row. It is not as if every row, every column is compared but the objects checksum is compared. This way even complex types or binary types can be compared.
- The utility can reside outside of Hadoop. The comparison can happen by connection to the requisite Hive servers.