Pandas and Dask test helper methods with beautiful error messages.

Last update: Nov 28, 2022

Related tags

Overview

beavis

Pandas and Dask test helper methods with beautiful error messages.

test helpers

These test helper methods are meant to be used in test suites. They provide descriptive error messages to allow for a seamless development workflow.

The test helpers are inspired by chispa and spark-fast-tests, popular test helper libraries for the Spark ecosystem.

There are built-in Pandas testing methods that can also be used, but they don't provide error messages that are as easy to parse. The following sections compare the built-in Pandas output and what's output by Beavis, so you can choose for yourself.

Column comparisons

The built-in assert_series_equal method does not make it easy to decipher the rows that are equal and the rows that are different, so quickly fixing your tests and maintaining flow is hard.

Here's the built-in error message when comparing series that are not equal.

df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])

>   ???
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [index]: [0, 1, 2, 3]
E   [left]:  [1042, 2, 9, 6]
E   [right]: [5, 2, 7, 6]

Here's the beavis error message that aligns rows and highlights the mismatches in red.

import beavis

beavis.assert_pd_column_equality(df, "col1", "col2")

You can also compare columns in a Dask DataFrame.

ddf = dd.from_pandas(df, npartitions=2)
beavis.assert_dd_column_equality(ddf, "col1", "col2")

The assert_dd_column_equality error message is similarly descriptive.

DataFrame comparisons

The built-in pandas.testing.assert_frame_equal method doesn't output an error message that's easy to understand, see this example.

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
pd.testing.assert_frame_equal(df1, df2)

E   AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
E
E   DataFrame.iloc[:, 0] (column name="col1") values are different (50.0 %)
E   [index]: [0, 1]
E   [left]:  [1, 2]
E   [right]: [5, 2]

beavis provides a nicer error message.

beavis.assert_pd_equality(df1, df2)

DataFrame comparison options:

check_index (default True)
check_dtype (default True)

Let's convert the Pandas DataFrames to Dask DataFrames and use the assert_dd_equality function to check they're equal.

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
beavis.assert_dd_equality(ddf1, ddf2)

These DataFrames aren't equal, so we'll get a good error message that's easy to debug.

Development

Install Poetry and run poetry install to create a virtual environment with all the Beavis dependencies on your machine.

Other useful commands:

poetry run pytest tests runs the test suite
poetry run black . to format the code
poetry build packages the library in a wheel file
poetry publish releases the library in PyPi (need correct credentials)

Pandas and Dask test helper methods with beautiful error messages.

Related tags

Overview

beavis

test helpers

Column comparisons

DataFrame comparisons

Development

Owner

Matthew Powers

Data science/Analysis Health Care Portfolio

Python library for creating data pipelines with chain functional programming

Vaex library for Big Data Analytics of an Airline dataset

A Python package for the mathematical modeling of infectious diseases via compartmental models

pyETT: Python library for Eleven VR Table Tennis data

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

CINECA molecular dynamics tutorial set

A variant of LinUCB bandit algorithm with local differential privacy guarantee

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

Modular analysis tools for neurophysiology data

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

Full ELT process on GCP environment.

Wafer Fault Detection - Wafer circleci with python

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Project: Netflix Data Analysis and Visualization with Python

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

A DSL for data-driven computational pipelines

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

Airflow ETL With EKS EFS Sagemaker

ELFXtract is an automated analysis tool used for enumerating ELF binaries