Variant RFC by AdamGS · Pull Request #15 · vortex-data/rfcs

AdamGS · 2026-02-25T21:08:17Z

Still WIP, but starting to extract my local notes into a publicly share-able form

gatesn · 2026-02-25T21:11:14Z

proposals/0015-variant-type.md

+
+In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful:
+
+1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype.


Maybe also worth mentioning expressions that convert to/from other variant-like data, e.g. JSON as a DType::Utf8 can be parsed into a DType::Variant.

I wonder if our JSON extension type has storage type DType::UTf8? or storage type DType::Variant...?

In my mind JSON type is “string verified as JSON”, like a PG column.
So far my impression is that there’s no consistent naming, and any choice we make will end up conflicting with something

So we would just also implement the variant expressions over a JSON extension type array?

proposals/0015-variant-type.md

a10y · 2026-02-26T02:01:50Z

proposals/0015-variant-type.md

+
+Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible.
+
+I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.


how should we do execute_arrow for these, using the Parquet Variant? Or union?

added some thoughts on this point, it might require ~~a pretty big change~~ some changes/extending our arrow exporting logic

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

gatesn · 2026-02-26T15:57:04Z

I think this all makes sense, but I think it should explicitly call out what changes you want to make to DType enum / Scalar enum / Canonical enum / etc.

connortsui20

For something as complicated as the design space of variant, I think it would be worth putting together a few diagrams (could literally just be some text trees) that show the different kinds of variant and shredding designs as well as some concrete examples.

I'm happy to add this myself as well!

connortsui20 · 2026-02-27T19:07:13Z

proposals/0015-variant-type.md

+Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields.
+
+This proposal introduces a new dtype - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.
+


I know this is being pedantic but I think it would be good to have a motivation section here. Do we care about supporting all possible variant types, or just JSON, for example? And it might be good to say that we want this because other formats support this and people like that

connortsui20 · 2026-02-27T19:10:45Z

proposals/0015-variant-type.md

+
+Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema.
+
+Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns.


I think "accessed" is the wrong word here? You could motivate this by giving an example of json data where a majority has the same type (string) but sometimes there happens to be a different type (int)

gatesn reviewed Feb 25, 2026

View reviewed changes

a10y reviewed Feb 26, 2026

View reviewed changes

AdamGS added 5 commits February 26, 2026 13:16

WIP: Variant RFC

747cfa3

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

More things

31a231d

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

more

4ee291e

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

more

6591115

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

more editing

e444968

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

AdamGS force-pushed the adamg/variant branch from c5d2aff to e444968 Compare February 26, 2026 13:16

AdamGS changed the title ~~WIP: Variant RFC~~ Variant RFC Feb 26, 2026

.

fcd2808

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

connortsui20 reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant RFC#15

Variant RFC#15
AdamGS wants to merge 6 commits intodevelopfrom
adamg/variant

AdamGS commented Feb 25, 2026 •

edited

Loading

Uh oh!

gatesn Feb 25, 2026

Uh oh!

AdamGS Feb 25, 2026

Uh oh!

gatesn Feb 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

a10y Feb 26, 2026

Uh oh!

AdamGS Feb 26, 2026 •

edited

Loading

Uh oh!

gatesn commented Feb 26, 2026

Uh oh!

connortsui20 left a comment

Uh oh!

connortsui20 Feb 27, 2026

Uh oh!

connortsui20 Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful:

		1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype.


		Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible.

		I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.

		Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields.

		This proposal introduces a new dtype - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.


		Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema.

		Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns.

Conversation

AdamGS commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatesn Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

AdamGS Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gatesn Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

a10y Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

AdamGS Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatesn commented Feb 26, 2026

Uh oh!

connortsui20 left a comment

Choose a reason for hiding this comment

Uh oh!

connortsui20 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

connortsui20 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AdamGS commented Feb 25, 2026 •

edited

Loading

gatesn Feb 26, 2026 •

edited

Loading

AdamGS Feb 26, 2026 •

edited

Loading