Dataframe Comparison and Testing
Welcome to compare-datasets! This library is designed to help you compare two data sets and identify differences between them at a very granular level. The basic use case is validation and testing, however, it can be used for any other purpose where you need to compare two datasets and analyse the differences between them.
Why?
Ikigai (Japanese: 生き甲斐, lit. 'a reason for being')
- The most natural question would be, why do you need this library if there is already pandas.testing.assert_frame_equal().
- Unlike other existing tools that only perform row-by-row match (based on row number) which would often require you to do necessary re-arangement and wrangling before you compare them, this library automates the grunt work for you.
- Unlike other tools, which would only return a boolean, this library would generate a comprehensive report that would help you identify the differences between the two datasets. It would answer questions like:
- Is the schema of the two datasets the same? If not, what are the differences?
- What rows are missing in the other dataset?
- Is there a pattern in the missing rows?
- The comparison report is generated for humans to read and understand. For computers, it would also return a boolean value, which can be used in testing pipelines. You can even configure the conditions when the datasets are considered equal or not equal. For example: the datasets should be considered equal if they contain the same rows, but the order of the rows does not matter.