Skip to content

fix out of bounds exception when calling aggregate and distinct#146

Open
mcoady wants to merge 5 commits intoDataHaskell:mainfrom
mcoady:main
Open

fix out of bounds exception when calling aggregate and distinct#146
mcoady wants to merge 5 commits intoDataHaskell:mainfrom
mcoady:main

Conversation

@mcoady
Copy link
Contributor

@mcoady mcoady commented Feb 14, 2026

…on a dataframe with no rows. Additionally, split out decodeSeparated from readSeparated

Description

A pipeline I was using was failing with some inputs. The proposed fix would make the behaviour the same as in polars and pandas (which is to return the dataframe with no rows).

I suspect there's a better fix to do at a lower level, but I'd have to dig into things a bit more from what I can see (as it wasn't immediately obvious).

Noticed a similar out of bounds issue calling distinct on a dataframe with no rows, fix is similar.

The other thing was to have decodeSeperated split out from readSeperated so as to be able to deal with ByteString directly (as that's what I'm looking at for my use case).

Behaviour before

Doing a basic groupBy and aggregate -

dataframe> df <- D.readCsv "test.csv"
dataframe> df
----------------
category | value
---------|------
  Int    |  Int
---------|------
1        | 10
1        | 10
2        | 50
dataframe> df |> D.groupBy ["category"] |> D.aggregate [F.sum @Int (F.col "value") `D.as` "sum(value)"]
---------------------
category | sum(value)
---------|-----------
  Int    |    Int
---------|-----------
2        | 50
1        | 20

So far so good.

The issue happens if the initial dataframe has no rows. groupBy succeeds, but aggregate fails -

dataframe> df |> D.drop 3 |> D.groupBy ["category"] |> D.aggregate [F.sum @Int (F.col "value") `D.as` "sum(value)"]
*** Exception: index out of bounds (0,0)
CallStack (from HasCallStack):
  error, called at src\Data\Vector\Internal\Check.hs:103:12 in vector-0.13.2.0-e4bb32128741c830a7d55f593a2ff07c218135aa:Data.Vector.Internal.Check
  checkError, called at src\Data\Vector\Internal\Check.hs:109:17 in vector-0.13.2.0-e4bb32128741c830a7d55f593a2ff07c218135aa:Data.Vector.Internal.Check
  check, called at src\Data\Vector\Internal\Check.hs:122:5 in vector-0.13.2.0-e4bb32128741c830a7d55f593a2ff07c218135aa:Data.Vector.Internal.Check
  checkIndex, called at src\Data\Vector\Generic.hs:249:12 in vector-0.13.2.0-e4bb32128741c830a7d55f593a2ff07c218135aa:Data.Vector.Generic
  !, called at src\Data\Vector\Unboxed.hs:315:7 in vector-0.13.2.0-e4bb32128741c830a7d55f593a2ff07c218135aa:Data.Vector.Unboxed

Behaviour after

dataframe> df |> D.drop 3 |> D.groupBy ["category"] |> D.aggregate [F.sum @Int (F.col "value") `D.as` "sum(value)"]
---------------------
category | sum(value)
---------|-----------
  Int    |    Int
---------|-----------

@daikonradish
Copy link
Contributor

@mcoady great catch. thanks for reporting bug and fixing!

Would it be possible to add a couple of tests? The examples that you give are pretty much good to go, just add it here: https://github.com/DataHaskell/dataframe/blob/48aa6cccd650864313489250a359eddf7e75b0d9/tests/Operations/GroupBy.hs

Just the case where groupByProducesNoRows :: Test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants