Polars vs Pandas - Quantile Method

I set out this weekend to port the data processing pipeline for the BrewBots IoT data from Pandas to Polars. Specifically, this code processes the accelerometer data from the BrewBots and calculates the tilt angles.

We’ve recently been seeing some heavy memory spikes and I’ve heard the promise of Polars being much faster and more memory efficient than Pandas.

I wanted to see if this was true.

After a lot of just learning about DataFrames in general, and having lots of fits and starts, I got something very close but the max angle for each tilt was off for a decent number of tilts when comparing output side by side.

I zero’d in on one specific tilt’s data to step through it.

Up until this point, Polars and Pandas were producing the same results (with the exception that my computed tilt ids on Pandas were 1-based and Polars were 0-based).

┌────────────┬───────┬──────┬─────────┬───────────┐
│ tstamp     ┆ angle ┆ tilt ┆ tilt_id ┆ angle_bin │
│ ---        ┆ ---   ┆ ---  ┆ ---     ┆ ---       │
│ i64        ┆ f64   ┆ bool ┆ u32     ┆ cat       │
╞════════════╪═══════╪══════╪═════════╪═══════════╡
│ 1723809945 ┆ 44.6  ┆ true ┆ 110     ┆ 40_50     │
│ 1723809946 ┆ 63.3  ┆ true ┆ 110     ┆ 60_70     │
│ 1723809947 ┆ 38.5  ┆ true ┆ 110     ┆ 30_40     │
│ 1723809948 ┆ 150.3 ┆ true ┆ 110     ┆ 90_inf    │
│ 1723809949 ┆ 68.2  ┆ true ┆ 110     ┆ 60_70     │
│ 1723809950 ┆ 31.8  ┆ true ┆ 110     ┆ 30_40     │
│ 1723809951 ┆ 46.15 ┆ true ┆ 110     ┆ 40_50     │
│ 1723809952 ┆ 44.7  ┆ true ┆ 110     ┆ 40_50     │
│ 1723809953 ┆ 53.1  ┆ true ┆ 110     ┆ 50_60     │
│ 1723809954 ┆ 68.3  ┆ true ┆ 110     ┆ 60_70     │
│ 1723809955 ┆ 43.5  ┆ true ┆ 110     ┆ 40_50     │
│ 1723809956 ┆ 19.75 ┆ true ┆ 110     ┆ 10_20     │
│ 1723809957 ┆ 26.1  ┆ true ┆ 110     ┆ 20_30     │
│ 1723809958 ┆ 24.1  ┆ true ┆ 110     ┆ 20_30     │
│ 1723809959 ┆ 43.75 ┆ true ┆ 110     ┆ 40_50     │
│ 1723809960 ┆ 33.9  ┆ true ┆ 110     ┆ 30_40     │
│ 1723809961 ┆ 43.8  ┆ true ┆ 110     ┆ 40_50     │
│ 1723809962 ┆ 103.5 ┆ true ┆ 110     ┆ 90_inf    │
│ 1723809963 ┆ 50.2  ┆ true ┆ 110     ┆ 50_60     │
│ 1723809964 ┆ 37.9  ┆ true ┆ 110     ┆ 30_40     │
│ 1723809965 ┆ 34.95 ┆ true ┆ 110     ┆ 30_40     │
│ 1723809966 ┆ 36.7  ┆ true ┆ 110     ┆ 30_40     │
│ 1723809967 ┆ 43.5  ┆ true ┆ 110     ┆ 40_50     │
│ 1723809968 ┆ 39.85 ┆ true ┆ 110     ┆ 30_40     │
│ 1723809969 ┆ 36.8  ┆ true ┆ 110     ┆ 30_40     │
└────────────┴───────┴──────┴─────────┴───────────┘

We collapse this data down to a single BotTilt Django model instance by recording the max angle and recording in a JSON field, the binned counts of angles by the angle_bin column categories.

The bins between the two libraries were the same but the max angle was 150.3 in Polars and 127.8 in Pandas.

The Pandas calculation was:

angle_max = df["angle"].quantile(0.98)

While the Polars calculation, after a naive port by yours truly, was:

angle_max = df.select(pl.col("angle").quantile(0.98)).item()

Diving a bit deeper here, I seems that Pandas quantile method defaults to a “linear” interpolation method while Polars uses “nearest” by default.

I was able to update the Polars calculation to the following and get all the data matching up exactly:

angle_max = df.select(
  pl.col("angle").quantile(0.98, interpolation="linear")
).item()

I haven’t done any full benchmarking yet but on some one off payloads, I’m setting around 20% better memory usage and it being noticeably faster. Admittedly, about half of this improvement is due to fine tuning the data types I’m explicitly casting to within Polars.