From 747cfa382e7186d162e8431a0dbf0f1fc4e91223 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Wed, 25 Feb 2026 21:08:08 +0000 Subject: [PATCH 1/6] WIP: Variant RFC Signed-off-by: Adam Gutglick --- proposals/0015-variant-type.md | 81 ++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 proposals/0015-variant-type.md diff --git a/proposals/0015-variant-type.md b/proposals/0015-variant-type.md new file mode 100644 index 0000000..7dc05a0 --- /dev/null +++ b/proposals/0015-variant-type.md @@ -0,0 +1,81 @@ +- Start Date: 2025-02-25 +- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/15) +- Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000) + +## Summary + +Vortex currently requires a strict schema, but real world data is often only semi-structured. Logs, traces and user-generated data often try to capture generally sparse data, which requires some processing to make it useful for most analytical systems. + +This proposal introduces a new type - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format. + +## Design + +We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art] section at the bottom of the page). + +The variant can be commonly described as the following rust type: + +```rust +enum Variant { + Value(Scalar), + List(Vec), + Object(BTreeMap), // Usually sorted to allow efficent key finding +} +``` + +Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible. + +I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type. + +In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful: + +1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype. +2. Extending the compressor to support writing variant columns, and making choices like "which columns should be shredded" either automatically based on a set of heuristics, or by user-provided configuration. +3. As different systems support different variations of this idea, we'll probably end up with multiple potential encodings. The most obvious one to start with is the `parquet-variant` arrow encoding, which is now a canonical Arrow extension type. + +## Prior Art + +### Parquet/Arrow + +The full details can be found in the [encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) and [shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) specification, but I'll try and capture it here to the best of my understanding. + +#### Un-shredded columns + +Parquet represents the columns is a group with two binary fields - `metadata` and `value`. The `metadata` array contains type information for arrays and objects, including field names and offsets. The `value` array contains the serialized values, each prefaced with a 1-byte header containing basic type information. +In Parquet - the variant type has its [own type system](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types), as they don't have a "scalar" concept, and that type system is also used when its loaded into arrow to save on serialization. + +#### Shredded columns + +When shredding columns, the data is stored in an optional `typed_value` column, which can be any type (including a `Variant`). +Depending on the level of nesting of the data, there are many cases that need to be considered to differentiate between null and missing values and support for various types. They are all described in the [Variant Shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) specification. + +#### Statistics + +Statistics are only stored for the shredded columns, at the file/row group or page level. + +#### In-Memory + +When loaded into memory, Arrow has defined a [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to support Parquet's variant type. Its stored as a struct array, which contains a mandatory `metadata` binary child, an optional binary `value` child, and an optional `typed_value` which can be a "variant primitive", list or struct, allowing for nested shredding. + +### Clickhouse + +As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#building-block-2---dynamic-type) fantastic blogpost, Clickhouse offers multiple features that build on top of each other to support similar data: +1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contains ints, strings and arrays of ints, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance. +2. [Dynamic](https://clickhouse.com/docs/sql-reference/data-types/dynamic) - Like variant, but types don't have to be declared in advance. Shreds a limited number of columns +3. [JSON](https://clickhouse.com/docs/sql-reference/data-types/newjson) - Builds on top of `Dynamic`, with a few specialized features - allowing users to specify known "typed paths", how many dynamic paths and types to support for untyped paths, and some JSON-specific configuration allowing skipping specific JSON paths on insert. +The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array. + +### Others + +- Iceberg seems to support the variant type (as described in [this](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) proposal), but the docs are minimal. +- Datafusion's variant support is being developed [here](https://github.com/datafusion-contrib/datafusion-variant), its unclear to me how much effort is going into it and whether its going to be merged upstream. +- DuckDB doesn't support a variant type. It does have a [Union](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues. +- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions). + +## Unresolved Questions + +- Do we want a JSON extension type that automatically compresses as variant? +- How do variant expressions operate over different variant encodings? + +## Future Possibilities + +What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. From 31a231daa1da669943bc5d6c86c0c37d8d054ab4 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Thu, 26 Feb 2026 12:26:02 +0000 Subject: [PATCH 2/6] More things Signed-off-by: Adam Gutglick --- proposals/0015-variant-type.md | 50 ++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 9 deletions(-) diff --git a/proposals/0015-variant-type.md b/proposals/0015-variant-type.md index 7dc05a0..1a8f5bc 100644 --- a/proposals/0015-variant-type.md +++ b/proposals/0015-variant-type.md @@ -1,7 +1,9 @@ - Start Date: 2025-02-25 -- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/15) +- RFC PR: [vortex-data/rfcs#15](https://github.com/vortex-data/rfcs/pull/15) - Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000) +# Variant Type + ## Summary Vortex currently requires a strict schema, but real world data is often only semi-structured. Logs, traces and user-generated data often try to capture generally sparse data, which requires some processing to make it useful for most analytical systems. @@ -18,7 +20,7 @@ The variant can be commonly described as the following rust type: enum Variant { Value(Scalar), List(Vec), - Object(BTreeMap), // Usually sorted to allow efficent key finding + Object(BTreeMap), // Usually sorted to allow efficient key finding } ``` @@ -26,11 +28,40 @@ Different systems have different variations of this idea, but at its core its a I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type. -In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful: +### Nullability + +In order to support data with a changing or unexpected schema, Variant arrays are always nullable, even for a specific key/path, its value might change type between items which will cause null values in shredded children. + +Combined with shredding, handling nulls can be complex and is encoding dependent (Like this [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) for handling arrays). + +### Expressions + +Variant columns are commonly accessed through a combination of column, path and the desired type, which are all required to extract a column with a known type. Our current `GetItem` has two issues: + +1. It assumes the input can be executed into a struct array. +2. Access is only based on name. + +I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD) which will support flexible paths and allow extracting children from variants. + +Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type. + +### Arrow representation -1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype. -2. Extending the compressor to support writing variant columns, and making choices like "which columns should be shredded" either automatically based on a set of heuristics, or by user-provided configuration. -3. As different systems support different variations of this idea, we'll probably end up with multiple potential encodings. The most obvious one to start with is the `parquet-variant` arrow encoding, which is now a canonical Arrow extension type. +Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types. + +Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I belive this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`. + +### Scalar + +While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above. + +Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype. + +### Constructing and writing scalars + +The API for creating variant arrays is complex, as shredding decisions need to be made either before hand based on data-specific knowledge, or on the fly during writes. + +In the medium/long term, I believe the compressor should support a JSON extension type, which will take JSON formatted UTF8 column, and parse it gradually into a binary formatted and typed variant encoding. ## Prior Art @@ -54,15 +85,16 @@ Statistics are only stored for the shredded columns, at the file/row group or pa #### In-Memory -When loaded into memory, Arrow has defined a [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to support Parquet's variant type. Its stored as a struct array, which contains a mandatory `metadata` binary child, an optional binary `value` child, and an optional `typed_value` which can be a "variant primitive", list or struct, allowing for nested shredding. +When loaded into memory, Arrow has defined a [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to support Parquet's variant type. Its stored as a struct array, which contains a mandatory `metadata` binary child, an optional binary `value` child, and an optional `typed_value` which can be a "variant primitive", list or struct, allowing for nested shredding. ### Clickhouse As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#building-block-2---dynamic-type) fantastic blogpost, Clickhouse offers multiple features that build on top of each other to support similar data: -1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contains ints, strings and arrays of ints, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance. + +1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contains ints, strings and arrays of ints, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance. 2. [Dynamic](https://clickhouse.com/docs/sql-reference/data-types/dynamic) - Like variant, but types don't have to be declared in advance. Shreds a limited number of columns 3. [JSON](https://clickhouse.com/docs/sql-reference/data-types/newjson) - Builds on top of `Dynamic`, with a few specialized features - allowing users to specify known "typed paths", how many dynamic paths and types to support for untyped paths, and some JSON-specific configuration allowing skipping specific JSON paths on insert. -The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array. + The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array. ### Others From 4ee291ee5a023016ee9a64a6a16b8e8385b4bc55 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Thu, 26 Feb 2026 12:43:48 +0000 Subject: [PATCH 3/6] more Signed-off-by: Adam Gutglick --- proposals/0015-variant-type.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/proposals/0015-variant-type.md b/proposals/0015-variant-type.md index 1a8f5bc..a0910bd 100644 --- a/proposals/0015-variant-type.md +++ b/proposals/0015-variant-type.md @@ -57,11 +57,14 @@ While there has been talk for a long time of converting the Vortex scalar system Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype. -### Constructing and writing scalars +### Path to usefulness -The API for creating variant arrays is complex, as shredding decisions need to be made either before hand based on data-specific knowledge, or on the fly during writes. +A key component of making variants useable will be making sure the experience of writing and using them , without forcing them to go through complex builders or serialization (unless they require it). -In the medium/long term, I believe the compressor should support a JSON extension type, which will take JSON formatted UTF8 column, and parse it gradually into a binary formatted and typed variant encoding. +I can see multiple things we can do: + +1. The compressor should support compressing arrays with the JSON extension type into variant columns, initially with a pre-configured policy and potentially with more complex heuristics, as seen in the [JSON Tiles paper](https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf). +2. Add expression to convert UTF-8 arrays formatted as JSON into variants, and vice versa. This can also include some other parsing and utilities to handle JSON. ## Prior Art From 6591115cab836103ade0f5651d3c66729911a2d8 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Thu, 26 Feb 2026 13:06:33 +0000 Subject: [PATCH 4/6] more Signed-off-by: Adam Gutglick --- proposals/0015-variant-type.md | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/proposals/0015-variant-type.md b/proposals/0015-variant-type.md index a0910bd..7ebd3f8 100644 --- a/proposals/0015-variant-type.md +++ b/proposals/0015-variant-type.md @@ -6,7 +6,7 @@ ## Summary -Vortex currently requires a strict schema, but real world data is often only semi-structured. Logs, traces and user-generated data often try to capture generally sparse data, which requires some processing to make it useful for most analytical systems. +Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields. This proposal introduces a new type - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format. @@ -24,7 +24,7 @@ enum Variant { } ``` -Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible. +Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type. @@ -41,15 +41,19 @@ Variant columns are commonly accessed through a combination of column, path and 1. It assumes the input can be executed into a struct array. 2. Access is only based on name. -I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD) which will support flexible paths and allow extracting children from variants. +I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD) which will support flexible paths and allow extracting children from variants. I use the `path` argument in this document loosely, but a subset of JSONPath might be appropriate here, see the [prior art](#prior-art) section to see how other systems handle it. Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type. +### Stats and pushdown + +Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we have currently have, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely. + ### Arrow representation Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types. -Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I belive this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`. +Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`. ### Scalar @@ -59,15 +63,19 @@ Just like when extracting child arrays, Variant's need to support an additional ### Path to usefulness -A key component of making variants useable will be making sure the experience of writing and using them , without forcing them to go through complex builders or serialization (unless they require it). +A key component of making variants useable will be making sure the experience of writing and using them, without forcing them to go through complex builders or serialization (unless they require it). I can see multiple things we can do: -1. The compressor should support compressing arrays with the JSON extension type into variant columns, initially with a pre-configured policy and potentially with more complex heuristics, as seen in the [JSON Tiles paper](https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf). +1. The compressor should support compressing arrays with the JSON extension type into variant columns, initially with a pre-configured policy and with more complex heuristics in the future, as seen in the [JSON Tiles paper](https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf). 2. Add expression to convert UTF-8 arrays formatted as JSON into variants, and vice versa. This can also include some other parsing and utilities to handle JSON. +Its important to note that while I suggest the canonical encoding will be basically opaque with regards to the specific encoding of the child array, we could still compress the children using our hierarchical compressor. + ## Prior Art +Many systems have a `Variant` type or similar concept, and they generally differ from each other both in implementation and meaning, I've tried to summarize some of the common ones, but I suggested reading the linked sources, especially [Clickhouse's](#clickhouse) blogpost about their variant, dynamic and JSON types. + ### Parquet/Arrow The full details can be found in the [encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) and [shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) specification, but I'll try and capture it here to the best of my understanding. @@ -94,7 +102,7 @@ When loaded into memory, Arrow has defined a [canonical extension type](https:// As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#building-block-2---dynamic-type) fantastic blogpost, Clickhouse offers multiple features that build on top of each other to support similar data: -1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contains ints, strings and arrays of ints, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance. +1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contain integers, strings and arrays of integers, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance. 2. [Dynamic](https://clickhouse.com/docs/sql-reference/data-types/dynamic) - Like variant, but types don't have to be declared in advance. Shreds a limited number of columns 3. [JSON](https://clickhouse.com/docs/sql-reference/data-types/newjson) - Builds on top of `Dynamic`, with a few specialized features - allowing users to specify known "typed paths", how many dynamic paths and types to support for untyped paths, and some JSON-specific configuration allowing skipping specific JSON paths on insert. The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array. @@ -103,7 +111,7 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type - Iceberg seems to support the variant type (as described in [this](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) proposal), but the docs are minimal. - Datafusion's variant support is being developed [here](https://github.com/datafusion-contrib/datafusion-variant), its unclear to me how much effort is going into it and whether its going to be merged upstream. -- DuckDB doesn't support a variant type. It does have a [Union](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues. +- DuckDB doesn't support a variant type. It does have a [Union](https://duckdb.org/docs/stable/sql/data_types/union) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues. - Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions). ## Unresolved Questions @@ -113,4 +121,4 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type ## Future Possibilities -What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. +In the future, we could add a Vortex-native encoding, but at this point in time it seems like 3rd-party integration is a more useful target. From e4449684b574e38fb7fa2953a98af74ea65d77ce Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Thu, 26 Feb 2026 13:15:52 +0000 Subject: [PATCH 5/6] more editing Signed-off-by: Adam Gutglick --- proposals/0015-variant-type.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/proposals/0015-variant-type.md b/proposals/0015-variant-type.md index 7ebd3f8..57ad801 100644 --- a/proposals/0015-variant-type.md +++ b/proposals/0015-variant-type.md @@ -8,13 +8,13 @@ Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields. -This proposal introduces a new type - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format. +This proposal introduces a new dtype - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format. ## Design -We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art] section at the bottom of the page). +We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art](#prior-art) section at the bottom of the page). -The variant can be commonly described as the following rust type: +The variant type can be commonly described as the following rust type: ```rust enum Variant { @@ -24,9 +24,15 @@ enum Variant { } ``` -Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. +Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. -I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type. +Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns. + +### Arrow representation + +Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types. + +Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`. ### Nullability @@ -45,25 +51,19 @@ I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type. -### Stats and pushdown - -Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we have currently have, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely. - -### Arrow representation - -Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types. - -Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`. - ### Scalar While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above. Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype. +### Stats and pushdown + +Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we currently support, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely. + ### Path to usefulness -A key component of making variants useable will be making sure the experience of writing and using them, without forcing them to go through complex builders or serialization (unless they require it). +A key component of making variants useable will be making sure the experience of writing and using them is as straightforward as possible, without forcing them to go through complex builders or serialization (unless they require it). I can see multiple things we can do: From fcd2808f96d49a9fbec966353ccd7b91e7b94d35 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Thu, 26 Feb 2026 13:50:38 +0000 Subject: [PATCH 6/6] . Signed-off-by: Adam Gutglick --- proposals/0015-variant-type.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/proposals/0015-variant-type.md b/proposals/0015-variant-type.md index 57ad801..7f0a989 100644 --- a/proposals/0015-variant-type.md +++ b/proposals/0015-variant-type.md @@ -122,3 +122,7 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type ## Future Possibilities In the future, we could add a Vortex-native encoding, but at this point in time it seems like 3rd-party integration is a more useful target. + +As mentioned above, I believe starting with a simple shredding policy in the compressor is the best way forward, but exploring things like JSON Tiles could prove to be useful. + +Integration with query engines will be an ongoing effort, depending on what features they support and how expressive they are.