# API design


golars borrows the shape of polars' API. The surface is polars-style expressions with Go naming conventions. This document records the conventions so the API stays consistent as it grows.

## Naming

* Exported identifiers are `PascalCase`. Unexported are `camelCase`.
* Acronyms are capitalized consistently: `CSV`, `JSON`, `URL`, `ID`.
* Methods prefer verbs: `Select`, `Filter`, `Join`, `Sort`. Polars uses snake\_case verbs; Go uses PascalCase verbs. Direct translation.
* Boolean predicates are prefixed `Is`: `IsNull`, `IsNotNull`, `IsUnique`, `IsInList`.
* Getters do not use `Get`: `df.Schema()`, `s.DType()`, `s.Len()`. This matches Go convention.

## Expressions

Polars in Python and Rust overloads operators. Go does not. We use method chaining:

| Polars (Python)                   | golars                              |
| --------------------------------- | ----------------------------------- |
| `pl.col("a") + pl.col("b")`       | `expr.Col("a").Add(expr.Col("b"))`  |
| `pl.col("a") * 2`                 | `expr.Col("a").MulLit(2)`           |
| `pl.col("a") > 3`                 | `expr.Col("a").GtLit(3)`            |
| `pl.col("a").is_null()`           | `expr.Col("a").IsNull()`            |
| `pl.col("a").alias("b")`          | `expr.Col("a").Alias("b")`          |
| `pl.when(c).then(a).otherwise(b)` | `expr.When(c).Then(a).Otherwise(b)` |
| `pl.col("a").sum().over("g")`     | `expr.Col("a").Sum().Over("g")`     |

Arithmetic with literals uses a `Lit` suffix to keep method signatures concrete. Arithmetic between two expressions is the plain verb. This avoids the type-switching overhead of accepting `any`.

## Scalar kernels (`compute.*Lit`)

The `compute` package provides imperative kernels that work directly on `*series.Series`. The `*Lit` variants compare/arithmetic-with a scalar literal and skip the allocation a broadcast Series would require:

| Method                       | Behaviour           |
| ---------------------------- | ------------------- |
| `compute.GtLit(ctx, s, 5)`   | mask where s > 5    |
| `compute.LtLit(ctx, s, 0)`   | mask where s \< 0   |
| `compute.EqLit(ctx, s, 42)`  | mask where s == 42  |
| `compute.GeLit(ctx, s, 0.5)` | mask where s >= 0.5 |

These accept any numeric Go literal type and coerce to the series dtype. Fast paths exist for int64 and float64; other dtypes fall back to a broadcast Series internally. Use them in hot loops where the expression compiler's overhead would dominate.

## Top-level re-exports

The root `golars` package re-exports the commonly used names so most user code imports a single package:

```go
import "github.com/Gaurav-Gosain/golars"

df, err := golars.ReadCSV(ctx, "data.csv")
out := df.Filter(golars.Col("x").GtLit(0)).
    GroupBy("k").
    Agg(golars.Col("v").Sum().Alias("v_sum"))
```

Deeper packages (`expr`, `lazy/plan`, `internal/hash`) are still importable for users who need them.

## Errors

* Every IO-bound or parse-bound operation returns `(T, error)`. No panics on user input.
* Pure-data operations that cannot fail given valid inputs return `T`. Invalid inputs (wrong dtype, nonexistent column) produce a descriptive error wrapped with `fmt.Errorf("golars: ...: %w", inner)`.
* We define sentinel errors for the common cases: `ErrColumnNotFound`, `ErrDTypeMismatch`, `ErrShapeMismatch`. User code can `errors.Is` on them.
* Expression build errors are deferred to `Collect()`. `expr.Col("a").Add(expr.Col("b"))` never fails at construction time, even if "a" does not exist in the target frame. The error surfaces when the plan is resolved.

## Context

Every operation that does IO, runs a plan, or might take non-trivial time accepts a `context.Context` as the first argument. Pure-data operations on already-materialized data do not. Examples:

* `df.ReadCSV(ctx, path)` takes ctx.
* `df.Filter(mask)` does not. Filter is in-process and fast.
* `lf.Collect(ctx)` takes ctx. Collect runs the plan.
* `lf.GroupBy("k")` does not. GroupBy is a plan builder.

The rule is: if a call can run arbitrary user-supplied IO, a plan, or a potentially long-running compute stage, it takes a context. Builder calls do not.

## Options

For operations with many optional parameters (ReadCSV, Join, GroupByDynamic), we use functional options:

```go
df, err := golars.ReadCSV(ctx, "data.csv",
    golars.WithDelimiter(','),
    golars.WithHasHeader(true),
    golars.WithNullValues([]string{"", "NA"}),
)
```

Options are functions with typed constructors. Option types are scoped to the operation (CSVOption, JoinOption) so the compiler enforces correct combinations.

## IO packages

Each supported file format lives in its own package so programs that only need one format don't pull the rest:

```
io/csv       // RFC 4180 CSV
io/parquet   // Apache Parquet via pqarrow
io/ipc       // Arrow IPC (feather)
io/json      // JSON array-of-objects, object-of-arrays, and NDJSON
io/sql       // database/sql bridge (any pure-Go driver)
```

Each package exposes `Read` (from `io.Reader`), `ReadFile`, and where it makes sense `ReadURL` (net/http-backed) and `ReadString`. Writers are symmetric: `Write`, `WriteFile`. The URL loaders accept `WithHTTPClient` so tests can inject a custom transport and production code can wire retry/auth middleware.

JSON type inference promotes numeric columns like polars: mixed int/float → float64; mixed anything/string → string. NaN and Inf round-trip. Nulls in input become null bitmap entries.

`io/sql` is the pragmatic pure-Go path for databases: plug in any `database/sql`-compatible driver (pgx, modernc.org/sqlite, go-sql-driver/mysql, go-mssqldb) and get a typed DataFrame. `ReadSQL` is eager; `NewReader` streams `WithBatchSize(n)` rows for result sets that exceed memory. Null values are preserved via the arrow validity bitmap. Apache ADBC is deliberately not wrapped: its drivers require cgo, which breaks the pure-Go invariant; the nested demo in `examples/sql/` shows the integration pattern with SQLite.

## Scripting (`script/` + `.glr` files)

`script.Runner` runs a tiny pipe-style language against any `Executor`. One statement per line, `#` for comments, the leading `.` on each command is optional:

```bash
# examples/script/demo.glr
load data/trades.csv
filter volume > 100
groupby symbol amount:sum:total
sort total desc
show
```

`cmd/golars` is the reference host: `golars run path.glr` runs a file one-shot and exits, `.source path.glr` runs one inline from the REPL. Third-party programs plug in via `script.ExecutorFunc`.

Multi-source: `load PATH as NAME` stages a frame in a registry without promoting it to focus; `use NAME` promotes it, parking the prior focus under its own name for later reuse; `join PATH|NAME on KEY` consumes a staged frame by name before trying it as a path. See the full language reference in [`docs/scripting.md`](scripting.md). A Tree-sitter grammar + highlight queries ship at [`editors/tree-sitter-golars/`](../editors/tree-sitter-golars/) for editor integrations.

## Nullability and zero values

Go has no null. Arrow has validity bitmaps. The API presents null the same way polars does:

* `s.Get(i)` returns `(value, valid bool)` for primitive dtypes.
* `s.GetStr(i)` returns `("", false)` for null.
* `s.IsNull()` returns a boolean mask Series.
* Aggregations skip nulls by default. `Sum` over `[1, null, 2]` is 3.
* Comparison operators produce null when either side is null. This matches SQL and polars.

## Iteration

golars does not encourage row-wise iteration. The idiomatic shape of a program is:

```go
result := df.
    Filter(golars.Col("price").GtLit(0)).
    WithColumns(
        golars.Col("price").Mul(golars.Col("qty")).Alias("total"),
    ).
    GroupBy("region").
    Agg(golars.Col("total").Sum())
```

Row iteration is available as `df.Rows()` returning a `RowIter` for cases where it is truly needed (writing to a non-columnar sink, debugging), but it is slow by design and documented as such.

## Versioning

We follow semver. Before v1.0.0 the API is unstable by convention (minor version bumps may break). After v1.0.0 we commit to semver strictly. Deprecations are marked with `// Deprecated:` comments and persist for at least one minor version before removal.

## What we do not export

* Concrete struct fields on `DataFrame`, `Series`, `LazyFrame`. All access is through methods.
* The plan and physical plan node types. They live under `lazy/plan` and `lazy/physical` but the API surface is what you build through the fluent `LazyFrame` API. Direct plan construction is not supported.
* Internal hash table and pool types.


# golars / polars API surface


Last synced: 2026-04-24.

Authoritative map between polars' Python API (`pl.*`) and golars' Go
surface. Status column values:

* `done` shipped and covered by tests
* `partial` shipped for a common subset; documented gaps
* `todo` not yet

The package-level facade at the repo root (`import "github.com/Gaurav-Gosain/golars"`)
re-exports the most commonly needed symbols so casual users do not need
to know which sub-package each name lives in.

## Top-level functions (`pl.*` / `golars.*`)

| polars                                                                                | golars                                  | Status | Notes                                                                                                    |
| ------------------------------------------------------------------------------------- | --------------------------------------- | ------ | -------------------------------------------------------------------------------------------------------- |
| `pl.DataFrame({...})`                                                                 | `golars.FromMap(...)`                   | done   | slice-of-go map constructor                                                                              |
| `pl.from_arrow(tbl)`                                                                  | `dataframe.FromArrowTable(tbl)`         | done   |                                                                                                          |
| `pl.concat([...])`                                                                    | `dataframe.Concat(...)`                 | done   | vertical                                                                                                 |
| `pl.col(name)`                                                                        | `golars.Col(name)`                      | done   | also typed: `expr.C[T]`, `expr.Int`, `expr.Float`, `expr.Str`, `expr.Bool`, `expr.Int32`, `expr.Float32` |
| `pl.lit(v)`                                                                           | `golars.Lit(v)`                         | done   | plus `LitInt64`/`LitFloat64`/`LitString`/`LitBool`; generic `expr.LitOf[T]`                              |
| `pl.when(p).then(a).otherwise(b)`                                                     | `golars.When(p).Then(a).Otherwise(b)`   | done   | executor uses compute.Where with dtype promotion                                                         |
| `pl.sum(col)`, `mean`, `min`, `max`, `count`, `first`, `last`, `median`, `std`, `var` | `golars.Sum(col)` and friends           | done   | col-scoped agg sugar                                                                                     |
| `pl.read_csv`                                                                         | `golars.ReadCSV`                        | done   |                                                                                                          |
| `pl.read_parquet`                                                                     | `golars.ReadParquet`                    | done   |                                                                                                          |
| `pl.read_ipc` / `read_arrow`                                                          | `golars.ReadIPC`                        | done   |                                                                                                          |
| `pl.read_json` / `read_ndjson`                                                        | `golars.ReadJSON` / `golars.ReadNDJSON` | done   |                                                                                                          |
| `pl.read_database`                                                                    | `io/sql.ReadSQL`                        | done   |                                                                                                          |
| `pl.read_clipboard`                                                                   | `io/clipboard.Read`                     | done   | CSV transport                                                                                            |
| `pl.read_avro`                                                                        |                                         | todo   | defer                                                                                                    |
| `pl.read_excel`                                                                       |                                         | todo   | needs xuri/excelize dep                                                                                  |
| `pl.read_delta` / `read_iceberg`                                                      |                                         | todo   | RFC required                                                                                             |
| `pl.scan_csv`                                                                         | `io/csv.Scan`                           | done   | LazyFrame source                                                                                         |
| `pl.scan_parquet`                                                                     | `io/parquet.Scan`                       | done   |                                                                                                          |
| `pl.scan_ipc`                                                                         | `io/ipc.Scan`                           | done   |                                                                                                          |
| `pl.scan_ndjson`                                                                      | `io/json.ScanNDJSON`                    | done   |                                                                                                          |

## DataFrame methods

| polars                                                                         | golars                                                                    | Status                |                                                                             |
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------- | --------------------- | --------------------------------------------------------------------------- |
| `df.shape`                                                                     | `df.Shape()`                                                              | done                  |                                                                             |
| `df.height` / `.width`                                                         | `df.Height()` / `.Width()`                                                | done                  |                                                                             |
| `df.columns`                                                                   | `df.ColumnNames()`                                                        | done                  |                                                                             |
| `df.dtypes`                                                                    | `df.DTypes()`                                                             | done                  |                                                                             |
| `df.schema`                                                                    | `df.Schema()`                                                             | done                  |                                                                             |
| `df.is_empty`                                                                  | `df.IsEmpty()`                                                            | done                  |                                                                             |
| `df.estimated_size`                                                            | `df.EstimatedSize()`                                                      | done                  |                                                                             |
| `df.equals`                                                                    | `df.Equals(other)`                                                        | done                  |                                                                             |
| `df.glimpse`                                                                   | `df.Glimpse(n)`                                                           | done                  |                                                                             |
| `df.head` / `.tail` / `.limit`                                                 | `df.Head` / `.Tail` / `.Limit`                                            | done                  |                                                                             |
| `df.slice`                                                                     | `df.Slice(offset, length)`                                                | done                  |                                                                             |
| `df.select(exprs)`                                                             | `df.Select(names...)` / `golars.SelectExpr(ctx, df, exprs...)`            | done                  |                                                                             |
| `df.with_columns(exprs)`                                                       | `df.WithColumns(series...)` / `golars.WithColumnsExpr(ctx, df, exprs...)` | done                  |                                                                             |
| `df.with_column`                                                               | `df.WithColumn(series)`                                                   | done                  |                                                                             |
| `df.rename`                                                                    | `df.Rename(old, new)`                                                     | done                  |                                                                             |
| `df.drop`                                                                      | `df.Drop(names...)`                                                       | done                  |                                                                             |
| `df.filter(mask)`                                                              | `df.Filter(ctx, mask)`                                                    | done                  |                                                                             |
| `df.sort(by)`                                                                  | `df.Sort` / `.SortBy`                                                     | done                  |                                                                             |
| `df.reverse`                                                                   | `df.Reverse(ctx)`                                                         | done                  |                                                                             |
| `df.sample(n)`                                                                 | `df.Sample(ctx, n, replacement, seed)`                                    | done                  |                                                                             |
| `df.shuffle`                                                                   | `df.Shuffle(ctx, seed)`                                                   | done                  |                                                                             |
| `df.clone`                                                                     | `df.Clone()`                                                              | done                  |                                                                             |
| `df.clear`                                                                     | `df.Clear()`                                                              | done                  |                                                                             |
| `df.group_by(...).agg(...)`                                                    | `df.GroupBy(...).Agg(ctx, [expr...])`                                     | done                  |                                                                             |
| `df.join(other, on, how)`                                                      | `df.Join(ctx, right, on, how)`                                            | done                  |                                                                             |
| `df.vstack` / `.hstack`                                                        | `df.VStack(other)` / `df.HStack(other)`                                   | done                  |                                                                             |
| `df.concat`                                                                    | top-level `dataframe.Concat(...)`                                         | done                  |                                                                             |
| `df.describe`                                                                  | `df.Describe(ctx)`                                                        | done                  |                                                                             |
| `df.null_count`                                                                | `df.NullCount()`                                                          | done                  |                                                                             |
| `df.row(i)`                                                                    | `df.Row(i)`                                                               | done                  |                                                                             |
| `df.rows()`                                                                    | `df.Rows()`                                                               | done                  |                                                                             |
| `df.to_dict`                                                                   | `df.ToMap()`                                                              | done                  |                                                                             |
| `df.to_arrow` / `to_pandas`                                                    | `df.ToArrow()` / `df.ToArrowTable()`                                      | done (n/a for pandas) |                                                                             |
| `df.write_csv` / `write_parquet` / `write_ipc` / `write_json` / `write_ndjson` | `io/*.WriteFile(ctx, path, df, ...)`                                      | done                  |                                                                             |
| `df.gather(indices)`                                                           | `df.Gather(ctx, indices)`                                                 | done                  |                                                                             |
| `df.pivot`                                                                     | `df.Pivot(ctx, index, on, values, PivotAgg)`                              | done                  |                                                                             |
| `df.unpivot` (melt)                                                            | `df.Unpivot(ctx, idVars, valueVars)`                                      | done                  |                                                                             |
| `df.transpose`                                                                 | `df.Transpose(ctx, headerCol, prefix)`                                    | done (numeric/bool)   |                                                                             |
| `df.partition_by`                                                              | `df.PartitionBy(ctx, keys...)`                                            | done                  |                                                                             |
| `df.top_k` / `df.bottom_k`                                                     | `df.TopK(ctx, k, col)` / `df.BottomK(ctx, k, col)`                        | done                  |                                                                             |
| `df.pipe(fn)`                                                                  | `df.Pipe(fn)`                                                             | done                  |                                                                             |
| `df.corr` / `df.cov`                                                           | `df.Corr(ctx)` / `df.Cov(ctx, ddof)`                                      | done                  |                                                                             |
| `df.explode`                                                                   | `df.Explode(ctx, col)`                                                    | done                  | list-typed column; null/empty lists become a single null row                |
| `df.unnest`                                                                    | `df.Unnest(ctx, col)`                                                     | done                  | struct-typed column; field names must not collide with existing cols        |
| `df.upsample`                                                                  | `df.Upsample(ctx, col, every)`                                            | done                  | timestamp col must be sorted; intervals: `ns`/`us`/`ms`/`s`/`m`/`h`/`d`/`w` |

## Series methods

### Inspection

| polars         | golars              | Status |
| -------------- | ------------------- | ------ |
| `s.dtype`      | `s.DType()`         | done   |
| `s.name`       | `s.Name()`          | done   |
| `s.len`        | `s.Len()`           | done   |
| `s.null_count` | `s.NullCount()`     | done   |
| `s.has_nulls`  | `s.HasNulls()`      | done   |
| `s.is_empty`   | `s.IsEmpty()`       | done   |
| `s.n_chunks`   | `s.NumChunks()`     | done   |
| `s.n_unique`   | `s.NUnique()`       | done   |
| `s.is_sorted`  | `s.IsSorted(order)` | done   |

### Math (scalar)

| polars                  | golars                    | Status |
| ----------------------- | ------------------------- | ------ |
| `s.abs`                 | `s.Abs()`                 | done   |
| `s.sqrt`                | `s.Sqrt()`                | done   |
| `s.exp`                 | `s.Exp()`                 | done   |
| `s.log`                 | `s.Log()`                 | done   |
| `s.log2` / `log10`      | `s.Log2` / `.Log10`       | done   |
| `s.sin` / `cos` / `tan` | `s.Sin` / `.Cos` / `.Tan` | done   |
| `s.round(d)`            | `s.Round(d)`              | done   |
| `s.floor` / `ceil`      | `s.Floor` / `.Ceil`       | done   |
| `s.sign`                | `s.Sign()`                | done   |
| `s.clip(lo, hi)`        | `s.Clip(lo, hi)`          | done   |
| `s.pow(exp)`            | `s.Pow(exp)`              | done   |

### Aggregations (scalar-returning)

| polars          | golars               | Status |
| --------------- | -------------------- | ------ |
| `s.sum`         | `s.Sum()`            | done   |
| `s.mean`        | `s.Mean()`           | done   |
| `s.min` / `max` | `s.Min()` / `.Max()` | done   |
| `s.median`      | `s.Median()`         | done   |
| `s.std` / `var` | `s.Std()` / `.Var()` | done   |
| `s.quantile(q)` | `s.Quantile(q)`      | done   |
| `s.any` / `all` | `s.Any()` / `.All()` | done   |
| `s.product`     | `s.Product()`        | done   |

### Position / argsort

| polars                    | golars                      | Status |
| ------------------------- | --------------------------- | ------ |
| `s.arg_min` / `arg_max`   | `s.ArgMin()` / `.ArgMax()`  | done   |
| `s.arg_sort`              | `s.ArgSort()`               | done   |
| `s.top_k(k)` / `bottom_k` | `s.TopK(k)` / `.BottomK(k)` | done   |

### Transform

| polars                 | golars                                      | Status |
| ---------------------- | ------------------------------------------- | ------ |
| `s.head` / `tail`      | `s.Head` / `.Tail`                          | done   |
| `s.slice`              | `s.Slice`                                   | done   |
| `s.reverse`            | `s.Reverse`                                 | done   |
| `s.sample`             | `s.Sample`                                  | done   |
| `s.shuffle`            | `s.Shuffle`                                 | done   |
| `s.sort`               | `compute.Sort(ctx, s, opts)`                | done   |
| `s.unique`             | `s.Unique()`                                | done   |
| `s.value_counts`       | `s.ValueCounts(sort)`                       | done   |
| `s.rename` / `alias`   | `s.Rename(name)`                            | done   |
| `s.clone`              | `s.Clone()`                                 | done   |
| `s.cast`               | `compute.Cast(ctx, s, dt)`                  | done   |
| `s.rechunk` / `chunks` | `s.Rechunk()` / `s.Chunk(i)` / `.Chunked()` | done   |

### Null handling

| polars                                   | golars                                        | Status |
| ---------------------------------------- | --------------------------------------------- | ------ |
| `s.is_null` / `is_not_null`              | `s.IsNull()` / `.IsNotNull()`                 | done   |
| `s.is_nan` / `is_finite` / `is_infinite` | `s.IsNaN()` / `.IsFinite()` / `.IsInfinite()` | done   |
| `s.fill_null(v)`                         | `s.FillNull(v)`                               | done   |
| `s.drop_nulls`                           | `s.DropNulls()`                               | done   |

### Cumulative

| polars                  | golars                     | Status |
| ----------------------- | -------------------------- | ------ |
| `s.cum_sum`             | `s.CumSum()`               | done   |
| `s.cum_min` / `cum_max` | `s.CumMin()` / `.CumMax()` | done   |
| `s.cum_prod`            | `s.CumProd()`              | done   |
| `s.cum_count`           | `s.CumCount()`             | done   |
| `s.diff(periods)`       | `s.Diff(periods)`          | done   |
| `s.shift(periods)`      | `s.Shift(periods)`         | done   |
| `s.pct_change(periods)` | `s.PctChange(periods)`     | done   |
| `s.mode`                | `s.Mode()`                 | done   |

### Rolling / windowed

| polars                                                                                           | golars                                                   | Status |                                                  |
| ------------------------------------------------------------------------------------------------ | -------------------------------------------------------- | ------ | ------------------------------------------------ |
| `s.rolling_sum` / `rolling_mean` / `rolling_min` / `rolling_max` / `rolling_std` / `rolling_var` | matching methods with `RollingOptions`                   | done   |                                                  |
| `s.ewm_mean` / `s.ewm_var` / `s.ewm_std`                                                         | `s.EWMMean(alpha)` / `.EWMVar(alpha)` / `.EWMStd(alpha)` | done   | adjusted form; integer inputs promote to float64 |

### String namespace (`s.str.*`)

| polars                                                         | golars                                                                                                        | Status |
| -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | ------ |
| `str.len_bytes` / `len_chars`                                  | `.Str().LenBytes()` / `.LenChars()`                                                                           | done   |
| `str.to_uppercase` / `to_lowercase`                            | `.Str().Upper()` / `.Lower()`                                                                                 | done   |
| `str.title`                                                    | `.Str().Title()`                                                                                              | done   |
| `str.contains`                                                 | `.Str().Contains(sub)`                                                                                        | done   |
| `str.starts_with` / `ends_with`                                | `.Str().StartsWith()` / `.EndsWith()`                                                                         | done   |
| `str.replace` / `replace_all`                                  | `.Str().Replace()` / `.ReplaceAll()`                                                                          | done   |
| `str.strip_chars`                                              | `.Str().Trim()` / `.LStrip()` / `.RStrip()`                                                                   | done   |
| `str.strip_prefix` / `strip_suffix`                            | `.Str().StripPrefix()` / `.StripSuffix()`                                                                     | done   |
| `str.pad_start` / `pad_end` / `zfill`                          | `.Str().PadStart()` / `.PadEnd()` / `.ZFill()`                                                                | done   |
| `str.reverse`                                                  | `.Str().Reverse()`                                                                                            | done   |
| `str.slice`                                                    | `.Str().Slice(start, len)`                                                                                    | done   |
| `str.count_matches`                                            | `.Str().CountMatches()`                                                                                       | done   |
| `str.concat`                                                   | `.Str().Concat()` / `.Str().Prefix()`                                                                         | done   |
| `str.extract`                                                  | `.Str().Extract(pattern, group)`                                                                              | done   |
| `str.contains_regex` / `count_matches_regex` / `replace_regex` | matching `.Str().*Regex`                                                                                      | done   |
| `str.split_exact`                                              | `.Str().SplitExact(sep)` (List\<String>) / `.Str().SplitN(sep, idx)` / `.Str().SplitExactNullShort(sep, idx)` | done   |

### Arrow interop

| polars               | golars                                       | Status |
| -------------------- | -------------------------------------------- | ------ |
| `s.to_arrow`         | `s.ToArrow()` / `s.ToArrowChunked()`         | done   |
| `pl.from_arrow(arr)` | `series.FromArrowArray` / `FromArrowChunked` | done   |
| `df.to_arrow`        | `df.ToArrow()` / `df.ToArrowTable()`         | done   |
| `pl.from_arrow(tbl)` | `dataframe.FromArrowTable(tbl)`              | done   |

## Expr methods

| polars                                                   | golars                                  | Status                                                    |
| -------------------------------------------------------- | --------------------------------------- | --------------------------------------------------------- |
| `pl.col(x)` with `+ - * /`                               | `Col(x).Add/Sub/Mul/Div(...)`           | done                                                      |
| `== != < <= > >=`                                        | `Eq/Ne/Lt/Le/Gt/Ge`                     | done                                                      |
| `and / or / not`                                         | `And / Or / Not`                        | done                                                      |
| `.alias(name)`                                           | `.Alias(name)`                          | done                                                      |
| `.cast(dt)`                                              | `.Cast(dt)`                             | done                                                      |
| `.is_null` / `.is_not_null`                              | `.IsNull()` / `.IsNotNull()`            | done                                                      |
| `.sum .mean .min .max .count .first .last .null_count`   | matching methods                        | done                                                      |
| `.median .std .var .any .all .product .quantile`         | matching methods                        | done                                                      |
| `.abs .sqrt .exp .log .log2 .log10 .sin .cos .tan .sign` | matching methods                        | done                                                      |
| `.round(d) .floor .ceil .clip(lo, hi) .pow(x)`           | matching methods                        | done                                                      |
| `.fill_null(v)`                                          | `.FillNull(v)` / `.FillNullExpr(e)`     | done                                                      |
| `.reverse .head(n) .tail(n) .slice(off, len) .shift(p)`  | matching methods                        | done                                                      |
| `.between(lo, hi)`                                       | `.Between(lo, hi)`                      | done                                                      |
| `.is_in([...])`                                          | `.IsIn(v...)`                           | done                                                      |
| `.over(partition)`                                       | `.Over(keys...)`                        | done (scalar-agg fast path + generic gather-eval-scatter) |
| `.sort`                                                  | `.Sort(desc)` (fluent via FunctionNode) | done                                                      |
| `.rank(method)`                                          | `.Rank(method)`                         | done                                                      |
| `.rolling_sum` / `.rolling_mean` / ...                   | matching `.Rolling*(size, minPeriods)`  | done                                                      |
| `.ewm_mean` / `.ewm_var` / `.ewm_std`                    | matching `.EWM*(alpha)`                 | done                                                      |

## LazyFrame methods

| polars                                                | golars                                                 | Status |                                                                              |
| ----------------------------------------------------- | ------------------------------------------------------ | ------ | ---------------------------------------------------------------------------- |
| `lf.filter` / `.select` / `.with_columns`             | `lf.Filter / .Select / .WithColumns`                   | done   |                                                                              |
| `lf.with_column`                                      | `lf.WithColumn(expr)`                                  | done   |                                                                              |
| `lf.sort / .group_by / .agg / .join`                  | matching                                               | done   |                                                                              |
| `lf.slice / .head / .limit / .tail`                   | matching                                               | done   |                                                                              |
| `lf.reverse`                                          | `lf.Reverse()`                                         | done   |                                                                              |
| `lf.unique`                                           | `lf.Unique()`                                          | done   |                                                                              |
| `lf.drop / .rename`                                   | matching                                               | done   |                                                                              |
| `lf.collect` / `.collect_unoptimized`                 | matching                                               | done   |                                                                              |
| `lf.explain / .show_graph`                            | `lf.Explain()` / `lf.ExplainTree()` / `lf.ShowGraph()` | done   | `ShowGraph` emits Mermaid; the REPL `.graph` command adds lipgloss colouring |
| `lf.sink_csv / sink_parquet / sink_ipc / sink_ndjson` | `lf.Sink(ctx, writer)` plus `io/*.WriteFile` in writer | done   |                                                                              |

## dtype

| polars                                                                  | golars                                                            | Status                 |
| ----------------------------------------------------------------------- | ----------------------------------------------------------------- | ---------------------- |
| Bool, Int8, Int16, Int32, Int64, UInt8 through UInt64, Float32, Float64 | `dtype.Bool()`, `dtype.Int64()` and friends                       | done                   |
| String / Utf8                                                           | `dtype.String()`                                                  | done                   |
| Binary                                                                  | `dtype.Binary()`                                                  | done                   |
| Null                                                                    | `dtype.Null()`                                                    | done                   |
| Date / Datetime(unit, tz) / Duration / Time                             | `dtype.Date / Datetime / Duration / Time` + `series.FromTime`     | done                   |
| List(inner) / Array / Struct / Field                                    | `dtype.List / FixedList / Struct` + `series.{ListOps, StructOps}` | done                   |
| Categorical / Enum                                                      |                                                                   | todo                   |
| Decimal / Float16 / Int128                                              |                                                                   | todo                   |
| Object / Unknown                                                        |                                                                   | todo (polars-internal) |

See [roadmap.md](roadmap.md) for the perf side of the same picture
(throughput ratios vs polars 1.39 on the bench suite).


# Architecture overview


golars is a layered query engine over Apache Arrow memory. The user writes code against the `DataFrame` (eager) or `LazyFrame` (lazy) facades. Lazy code flows through an expression AST, a logical plan, an optimizer, and a physical plan before reaching the executor. Both eager and lazy paths converge on the same compute kernels and Series primitives.

## Layered component map

<Mermaid
  chart="flowchart TD
    subgraph UserAPI[&#x22;User API&#x22;]
        DF[&#x22;DataFrame (eager)&#x22;]
        LF[&#x22;LazyFrame (lazy)&#x22;]
        SQL[&#x22;SQL frontend&#x22;]
    end

    subgraph PlanLayer[&#x22;Plan layer&#x22;]
        Expr[&#x22;Expression AST&#x22;]
        Logical[&#x22;Logical plan&#x22;]
        Optimizer[&#x22;Optimizer passes&#x22;]
        Physical[&#x22;Physical plan&#x22;]
    end

    subgraph Exec[&#x22;Execution&#x22;]
        Stream[&#x22;Streaming executor (morsel-driven)&#x22;]
        InMem[&#x22;In-memory executor&#x22;]
    end

    subgraph Compute[&#x22;Compute layer&#x22;]
        Kernels[&#x22;Dtype-specialized kernels&#x22;]
        Row[&#x22;Row encoding (sort, join keys)&#x22;]
    end

    subgraph Data[&#x22;Data layer&#x22;]
        Series[&#x22;Series (chunked)&#x22;]
        Buffer[&#x22;Buffers and validity masks&#x22;]
        Arrow[&#x22;apache/arrow-go arrays&#x22;]
    end

    subgraph IO[&#x22;IO&#x22;]
        CSV[&#x22;CSV&#x22;]
        Parquet[&#x22;Parquet&#x22;]
        IPC[&#x22;Arrow IPC and Flight&#x22;]
        JSON[&#x22;JSON and NDJSON&#x22;]
    end

    DF --> Kernels
    LF --> Expr
    SQL --> Logical
    Expr --> Logical
    Logical --> Optimizer
    Optimizer --> Physical
    Physical --> Stream
    Physical --> InMem
    Stream --> Kernels
    InMem --> Kernels
    Kernels --> Series
    Kernels --> Row
    Series --> Buffer
    Buffer --> Arrow
    IO --> Series
    Series --> IO"
/>

## Boundary between golars and arrow-go

golars uses `apache/arrow-go/v18` as its memory and IO substrate. The line is:

**arrow-go owns:**

* Array implementations (`arrow.Array`, all typed arrays)
* Memory allocation, buffers, reference counts (`memory.Allocator`, `memory.Buffer`)
* Arrow IPC reader and writer
* Parquet reader and writer
* CSV reader (we wrap it)
* Schema primitives at the physical level (`arrow.Schema`, `arrow.Field`)

**golars owns:**

* A logical dtype model on top of arrow dtypes, carrying polars semantics (for example logical dates, categoricals, enums, and the `Null` dtype)
* `Series` as a named, chunked, nullable column with dtype-aware methods
* `DataFrame` composition and transformation operations
* Expression AST, logical and physical plan, optimizer
* Streaming executor
* Group-by, join, sort, and pivot algorithms
* SQL frontend

When a polars feature exists in arrow-go with the right semantics, we wrap rather than reimplement. When polars' semantics differ from arrow's (null handling edge cases, dtype promotion, string comparisons), we implement in golars and document the choice.

## Data flow in an eager pipeline

<Mermaid
  chart="sequenceDiagram
    participant User
    participant DataFrame
    participant Compute
    participant Series
    participant Arrow

    User->>DataFrame: df.Filter(mask).GroupBy(&#x22;k&#x22;).Agg(Sum(&#x22;v&#x22;))
    DataFrame->>Compute: filter kernel (chunked, parallel)
    Compute->>Series: per-chunk filter
    Series->>Arrow: allocate output buffers
    Arrow-->>Series: filtered chunks
    DataFrame->>Compute: hash-partition on &#x22;k&#x22;
    Compute->>Compute: partial aggregate per partition (goroutines)
    Compute-->>DataFrame: merged result
    DataFrame-->>User: result DataFrame"
/>

## Data flow in a lazy pipeline

<Mermaid
  chart="sequenceDiagram
    participant User
    participant LazyFrame
    participant Plan
    participant Optimizer
    participant Physical
    participant Engine

    User->>LazyFrame: scan, filter, groupby, agg, collect
    LazyFrame->>Plan: build logical plan tree
    Plan->>Optimizer: apply passes
    Optimizer-->>Optimizer: projection pushdown
    Optimizer-->>Optimizer: predicate pushdown
    Optimizer-->>Optimizer: slice pushdown
    Optimizer-->>Optimizer: common subexpr elimination
    Optimizer-->>Optimizer: simplify expr
    Optimizer->>Physical: choose operators, partitioning
    Physical->>Engine: morsel-driven execution
    Engine-->>User: collected DataFrame"
/>

The streaming executor reads morsels (record batches of bounded row count) from sources, pipes them through operator goroutines over buffered channels, and terminates at a sink. Each operator stage scales horizontally with `GOMAXPROCS`.

## Conformance strategy

We use py-polars as the behavioral oracle during development. The `internal/testutil` package holds helpers that:

* Generate fixture DataFrames from JSON or parquet files under `testdata/`.
* Compare golars output to a golden file produced by a py-polars script committed alongside the fixture.
* Fail with a human-readable diff on drift.

This keeps us honest about semantics without pulling Python into CI. Python is only needed when regenerating golden files.


# Cookbook


End-to-end recipes for common tasks. Every snippet compiles and
assumes `import "github.com/Gaurav-Gosain/golars"` plus whatever
sub-package a particular line needs.

## Typed columns for compile-time literal checks

The `expr` package ships a typed facade (`expr.C[T]`, `expr.Int`,
`expr.Float`, `expr.Str`, `expr.Bool`) that lets Go infer literal
types from method arguments, eliminating the `expr.Lit(int64(...))`
boilerplate:

```go
import "github.com/Gaurav-Gosain/golars/expr"

qty := expr.Int("qty")
price := expr.Float("price")

out, _ := lazy.FromDataFrame(df).
    Filter(expr.All(qty.Gt(2), price.Lt(50))).
    WithColumns(
        price.MulCol(qty.CastFloat64()).As("total").Expr,
        qty.Between(2, 5).Alias("in_range"),
    ).
    Collect(ctx)
```

The runtime plan is identical to the untyped `expr.Col("qty").
GtLit(int64(2))` form. Passing a string literal to an int-typed
column fails at build time rather than panicking at evaluation.
See `examples/*/generic/` in the repository for side-by-side
comparisons.

## List and struct namespaces

Expression-level helpers mirror polars' `.list.*` and `.struct.*`:

```go
import "github.com/Gaurav-Gosain/golars/expr"

lazy.FromDataFrame(df).Select(
    expr.Col("tags").List().Len().Alias("tag_count"),
    expr.Col("payload").Struct().Field("x").Alias("x"),
    expr.Col("csv").Str().SplitExact(",").List().Get(0).Alias("first"),
).Collect(ctx)
```

Supported list reducers: `Len`, `Sum`, `Mean`, `Min`, `Max`, `First`,
`Last`, `Get(idx)`, `Contains(needle)`, `Join(sep)` (string lists).
Supported struct ops: `Field(name)`.

## Unnest / explode / upsample

Unnest a struct column:

```go
out, _ := df.Unnest(ctx, "payload")
// struct {x:i64, y:str} becomes two top-level cols `x` and `y`.
```

Explode a list column (null and empty lists become a single null row):

```go
out, _ := df.Explode(ctx, "tags")
// [[a, b, c], [], NULL, [d]] produces 3 + 1 + 1 + 1 = 6 rows.
```

Upsample a sorted timestamp column to a dense grid:

```go
out, _ := df.Upsample(ctx, "ts", "1d")
// rows spaced > 1d apart get filled with null-valued neighbours.
```

Accepted intervals: `ns`, `us`, `ms`, `s`, `m`, `h`, `d`, `w`.
Calendar units (`mo`, `y`) are rejected.

## Pretty-print a logical plan

```go
fmt.Println(lazy.ExplainTree(plan.Plan()))
// SORT [total desc]
// └── AGG keys=[dept] aggs=[...]
//     └── FILTER (col("salary") > 75)
//         └── SCAN df
```

`lazy.ExplainTreeASCII` swaps the box-drawing glyphs for ASCII
fallbacks. `lf.ExplainTree()` is the full three-section report
(logical, optimiser, optimised) rendered as a tree.

## Read a CSV, filter, write Parquet

```go
df, _ := golars.ReadCSV("trades.csv")
defer df.Release()

out, _ := golars.Lazy(df).
    Filter(golars.Col("volume").GtLit(int64(100))).
    Collect(ctx)
defer out.Release()

golars.WriteParquet(out, "heavy_trades.parquet")
```

## Group + aggregate multiple columns in one pass

```go
agg, _ := golars.Lazy(df).
    GroupBy("symbol").
    Agg(
        golars.Sum("qty"),
        golars.Mean("price").Alias("avg_price"),
        golars.Max("price").Alias("hi"),
    ).
    Sort("qty_sum", true).
    Collect(ctx)
```

golars detects that every agg targets the same column in a single
bucket and fuses them through `groupby_fused.go`, so four aggregations
cost one scan.

## Join a CSV against a Parquet lookup table

```go
trades, _ := golars.ReadCSV("trades.csv")
defer trades.Release()
lookup, _ := golars.ReadParquet("symbols.parquet")
defer lookup.Release()

out, _ := golars.Lazy(trades).
    Join(golars.Lazy(lookup), []string{"symbol"}, golars.InnerJoin).
    Collect(ctx)
```

## Scan (lazy I/O) plus predicate pushdown

```go
import iocsv "github.com/Gaurav-Gosain/golars/io/csv"

// iocsv.Scan returns a LazyFrame that opens the file only when
// Collect runs. Combined with Filter + Select, the optimiser pushes
// the projection down through the scan.
lf := iocsv.Scan("/tmp/huge.csv").
    Filter(golars.Col("region").EqLit("us")).
    Select(golars.Col("symbol"), golars.Col("price"))

for batch, err := range lf.IterBatches(ctx) {
    if err != nil { log.Fatal(err) }
    defer batch.Release()
    // stream-process each batch here
}
```

## Null handling: drop, fill, or flag

```go
clean, _ := golars.Lazy(df).DropNulls("price", "qty").Collect(ctx)
filled, _ := golars.Lazy(df).FillNull(int64(0)).Collect(ctx)

mask, _ := df.AnyNullMask(ctx)
defer mask.Release()
// `mask` is a boolean Series you can plug back into Filter to flag
// bad rows without dropping them.
```

## Select by dtype or name predicate

```go
import "github.com/Gaurav-Gosain/golars/selector"

numericOnly, _ := df.SelectBy(selector.Numeric())
noTimes := df.DropBy(selector.EndsWith("_ts"))

// Combinators: intersect, union, minus.
usdCols, _ := df.SelectBy(selector.Intersect(
    selector.Float(),
    selector.StartsWith("price_usd"),
))
```

## Cross-language with Arrow

```go
rec := df.ToArrow()          // arrow.RecordBatch
tbl := df.ToArrowTable()     // arrow.Table (multi-chunk)
roundtrip, _ := dataframe.FromArrowTable(tbl)
```

Both sides are Arrow IPC format-compatible. Write with `io/ipc.Write`
and read in PyArrow, pola.rs, DuckDB, or any other Arrow-aware tool
without format conversion.

## String munging

```go
out, _ := df.Apply(func(s *series.Series) (*series.Series, error) {
    if s.Name() != "email" { return s.Clone(), nil }
    return s.Str().Before("@")
})
```

`.Str().Before` / `.After` / `.SplitNth` cover the common parsing
cases; `.SplitWide` returns multiple Series so you can stitch them
into a DataFrame with extra columns.

## Cache an intermediate pipeline

```go
base := golars.Lazy(df).
    Filter(golars.Col("active").EqLit(true)).
    Cache()

// Two downstream pipelines share the same filtered base.
top, _ := base.Sort("score", true).Head(10).Collect(ctx)
flag, _ := base.Filter(golars.Col("score").LtLit(0.5)).Collect(ctx)
```

Cache memoises the first Collect result; subsequent collects reuse
it. The cached frame is released automatically when the cache's
LazyFrame handle is garbage-collected.

## When / then / otherwise

```go
out, _ := golars.Lazy(df).
    Select(golars.When(golars.Col("age").Gt(golars.Lit(18))).
        Then(golars.Lit("adult")).
        Otherwise(golars.Lit("minor")).
        Alias("category")).
    Collect(ctx)
```

Mixed numeric dtypes are promoted (int then + float otherwise -> float64 out).
Null cond values are treated as false (polars semantics).

## Rolling operations

```go
// Rolling sum/mean/min/max/std/var with a fixed window.
out, _ := golars.Lazy(df).
    Select(
        golars.Col("price").RollingMean(30, 1).Alias("ma30"),
        golars.Col("price").RollingStd(30, 5).Alias("vol30"),
    ).Collect(ctx)
```

Second argument is `min_periods` (0 = require full window). Int64 inputs
with no nulls take a SIMD-friendly O(n) slide (two-phase warmup +
4-way unrolled step).

## Regex on strings

```go
// Boolean mask for regex hits.
mask, _ := series.FromString("s", []string{"a1", "xx", "b22"}, nil).
    Str().ContainsRegex(`\d+`)

// Extract first capture group.
ids, _ := emails.Str().Extract(`@([a-z.]+)$`, 1)

// Count matches per row.
counts, _ := tokens.Str().CountMatchesRegex(`\w+`)
```

## Pivot (long -> wide)

```go
// Mirror of polars' df.pivot(index="id", on="cat", values="v").
wide, _ := df.Pivot(ctx, []string{"id"}, "cat", "v", dataframe.PivotSum)
```

Aggregators: `PivotFirst`, `PivotSum`, `PivotMean`, `PivotMin`, `PivotMax`, `PivotCount`.

## Window functions with `.Over(...)`

```go
// Per-group total broadcast back to every row.
out, _ := golars.Lazy(df).
    Select(golars.Col("revenue").Sum().Over("region").Alias("region_total")).
    Collect(ctx)

// Per-group rank.
ranked, _ := golars.Lazy(df).
    Select(golars.Col("score").Rank("dense").Over("cohort").Alias("rank_in_cohort")).
    Collect(ctx)
```

## Forward / backward fill + NaN

```go
// Replace every NaN with 0 in float columns (integer cols pass through).
filled, _ := golars.Lazy(df).FillNan(0).Collect(ctx)

// Carry the last non-null value forward through consecutive nulls.
// Pass limit=3 to stop after three consecutive fills; limit=0 means unlimited.
ff, _ := golars.Lazy(df).ForwardFill(0).Collect(ctx)
bf, _ := golars.Lazy(df).BackwardFill(0).Collect(ctx)

// Per-column variant via Expr:
out, _ := golars.Lazy(df).
    WithColumns(golars.Col("price").ForwardFill(0).Alias("price")).
    Collect(ctx)
```

## Reshape: transpose, unpivot, partition

```go
// Transpose: each input column becomes a row; output value type is
// the promoted float64 (polars' object-dtype fallback is not yet
// supported, so non-numeric inputs error).
wide, _ := df.Transpose(ctx, "column", "row")

// Unpivot (melt): turn value columns into two long-form columns:
//   variable (the original column name) + value.
long, _ := df.Unpivot(ctx, []string{"id"}, nil /* default: all non-id */)

// Partition: one DataFrame per distinct key tuple; ordered by first
// appearance. Caller releases each partition.
parts, _ := df.PartitionBy(ctx, "region", "symbol")
for _, p := range parts {
    defer p.Release()
}
```

## Top-K / Bottom-K / Pipe

```go
topSellers, _ := df.TopK(ctx, 10, "revenue")
worstLatency, _ := df.BottomK(ctx, 5, "p99_ms")

// Pipe keeps chained code flat:
out, _ := df.Pipe(func(d *DataFrame) (*DataFrame, error) {
    return d.Filter(ctx, mask)
})
```

## Stats: skew, kurtosis, corr, cov, approx\_n\_unique

```go
sk, _ := col.Skew()                    // polars default (biased)
sku, _ := col.SkewUnbiased()           // scipy bias=False
kk, _ := col.Kurtosis()                // excess kurtosis

c, _ := a.PearsonCorr(b)               // Pearson r
cov, _ := a.Covariance(b, 1)           // ddof=1

corrMat, _ := df.Corr(ctx)             // k-by-k frame
covMat, _ := df.Cov(ctx, 1)

approx, _ := col.ApproxNUnique()       // HLL estimate
```

## Extra math helpers

```go
// Trig + hyperbolic family: atan2, cbrt, sinh/cosh/tanh, log1p, expm1,
// radians/degrees, arccos/arcsin/arctan, cot, arcsinh/arccosh/arctanh.
radians, _ := col.Radians()
y, _ := colY.Arctan2(colX)
```

## Coalesce, concat\_str, ones/zeros/int\_range

```go
picked, _ := golars.Lazy(df).
    Select(golars.Coalesce(golars.Col("primary"), golars.Col("fallback"))).
    Collect(ctx)

joined, _ := golars.Lazy(df).
    Select(golars.ConcatStr("-", golars.Col("sym"), golars.Col("year"))).
    Collect(ctx)

// Build Series out of thin air:
ids, _ := golars.Lazy(df).
    WithColumns(golars.IntRange(0, 100, 1).Alias("idx")).
    Collect(ctx)
```

## Arrow IPC streaming (cross-language)

```go
sw, _ := golars.NewIPCStreamWriter(conn, firstBatch)
for batch := range batches {
    sw.Write(ctx, batch)
}
sw.Close()

// Consumer side (also polars/pyarrow/DuckDB-compatible):
sr, _ := golars.NewIPCStreamReader(conn)
defer sr.Close()
for batch, err := range sr.Iter(ctx) {
    if err != nil { log.Fatal(err) }
    process(batch)
    batch.Release()
}
```

## Row-wise (horizontal) reductions

```go
// Append a column that sums three others on a per-row basis.
withTotal, _ := golars.Lazy(df).
    SumHorizontal("total", "q1", "q2", "q3").
    Collect(ctx)
defer withTotal.Release()

// Or compute the reduction directly as a standalone Series.
total, _ := golars.SumHorizontal(ctx, df, "q1", "q2", "q3")
defer total.Release()
```

Variants: `SumHorizontal`, `MeanHorizontal`, `MinHorizontal`,
`MaxHorizontal`, `AllHorizontal`, `AnyHorizontal`. Omit the column
list to span every numeric (or boolean) column. Null handling
defaults to `IgnoreNulls`; pass `dataframe.PropagateNulls` on the
frame-level method for polars' strict semantics.

## One-row frame-level aggregates

```go
sums, _ := df.SumAll(ctx)     // one row, one column per numeric input
means, _ := df.MeanAll(ctx)
counts, _ := df.CountAll(ctx) // counts non-nulls for every column
nulls, _ := df.NullCountAll(ctx)
```

These mirror polars' `df.sum()`, `df.mean()`, `df.count()`. They are
convenient for dashboards, describe-style summaries, and streamed
ETL checkpoints.

## Lazy scans (pushdown-friendly I/O)

```go
lf := golars.ScanCSV("huge.csv").
    Filter(golars.Col("region").EqLit("us")).
    Select(golars.Col("symbol"), golars.Col("price"))

out, _ := lf.Collect(ctx)
```

Every format has a scan entry point: `ScanCSV`, `ScanParquet`,
`ScanIPC`, `ScanJSON`, `ScanNDJSON`. Compared to `Read*`, a scan
defers opening the file until `Collect` so the optimiser can
push projections + filters into the reader.

## Rank and percent-change

```go
r, _ := golars.Lazy(df).
    Select(
        golars.Col("score").Rank("dense").Alias("rank"),
        golars.Col("score").PctChange(1).Alias("delta"),
    ).
    Collect(ctx)
```

## Apply a custom Go function

```go
out, _ := df.Apply(func(s *series.Series) (*series.Series, error) {
    switch s.DType().String() {
    case "i64":
        return s.ApplyInt64(func(v int64) int64 { return v * 2 })
    case "str":
        return s.Str().Upper()
    }
    return s.Clone(), nil
})
```

## REPL / scripting quickies

Inside `golars` or in a `.glr` file:

```
load data/trades.csv
filter volume > 100
with_row_index row
cast price f64
fill_null 0
rename volume as qty
sum qty
write out.parquet
```

Scalar-only prints: `.sum COL`, `.mean COL`, `.min COL`, etc. write
one-line results instead of a table, convenient for quick spot
checks.

## Per-language equivalents

| polars (Python)                     | golars (Go)                                                      |
| ----------------------------------- | ---------------------------------------------------------------- |
| `pl.read_csv(p)`                    | `golars.ReadCSV(p)`                                              |
| `df.filter(pl.col("x") > 5)`        | `df.Filter(ctx, mask)` or `lazy` with `golars.Col("x").GtLit(5)` |
| `df.group_by("k").agg(pl.sum("v"))` | `df.GroupBy("k").Agg(ctx, []expr.Expr{expr.Col("v").Sum()})`     |
| `df.unique()`                       | `df.Unique(ctx)`                                                 |
| `df.sample(n=10)`                   | `df.Sample(ctx, 10, false, seed)`                                |
| `df.with_row_index()`               | `df.WithRowIndex("index", 0)`                                    |
| `pl.col("s").str.to_uppercase()`    | `s.Str().Upper()`                                                |
| `df.select(cs.numeric())`           | `df.SelectBy(selector.Numeric())`                                |
| `pl.sum_horizontal("a", "b")`       | `golars.SumHorizontal(ctx, df, "a", "b")`                        |
| `df.fill_nan(0)`                    | `lf.FillNan(0)`                                                  |
| `df.fill_null(strategy="forward")`  | `lf.ForwardFill(0)`                                              |
| `df.fill_null(strategy="backward")` | `lf.BackwardFill(0)`                                             |
| `df.top_k(10, by="x")`              | `df.TopK(ctx, 10, "x")`                                          |
| `df.transpose()`                    | `df.Transpose(ctx, "column", "row")`                             |
| `df.unpivot(index=["id"])`          | `df.Unpivot(ctx, []string{"id"}, nil)`                           |
| `df.partition_by("k")`              | `df.PartitionBy(ctx, "k")`                                       |
| `df.corr()`                         | `df.Corr(ctx)`                                                   |
| `pl.coalesce(...)`                  | `golars.Coalesce(...)`                                           |
| `pl.concat_str(..., sep)`           | `golars.ConcatStr(sep, ...)`                                     |
| `pl.int_range(0, n)`                | `golars.IntRange(0, int64(n), 1)`                                |
| `pl.when(p).then(a).otherwise(b)`   | `golars.When(p).Then(a).Otherwise(b)`                            |
| `pl.col("x").rolling_sum(w)`        | `golars.Col("x").RollingSum(w, 0)`                               |
| `pl.col("x").sum().over("k")`       | `golars.Col("x").Sum().Over("k")`                                |
| `s.str.extract(pattern, i)`         | `s.Str().Extract(pattern, i)`                                    |
| `s.str.contains_regex(p)`           | `s.Str().ContainsRegex(p)`                                       |
| `df.pivot(index=, on=, values=)`    | `df.Pivot(ctx, index, on, values, agg)`                          |
| `df.sum()`                          | `df.SumAll(ctx)`                                                 |
| `pl.scan_csv(p)`                    | `golars.ScanCSV(p)`                                              |
| `pl.scan_parquet(p)`                | `golars.ScanParquet(p)`                                          |

See `docs/api-surface.md` for the full cross-reference.


# Getting Started


## Install as a library

```sh
go get github.com/Gaurav-Gosain/golars@latest
```

## Install the CLI

```sh
go install github.com/Gaurav-Gosain/golars/cmd/golars@latest
```

Type `golars help` to see every subcommand:

```
golars                       start interactive REPL
golars run SCRIPT            execute a .glr script
golars fmt [-w] FILE         canonicalize a .glr script
golars lint FILE             report common .glr mistakes
golars schema FILE           print column names + dtypes
golars stats FILE            print describe() stats
golars head FILE [N]         print first N rows (default 10)
golars diff A B              show row-level diff between two files
golars sql QUERY [FILE...]   run a SQL query against files
golars browse FILE           interactive TUI table viewer
golars explain SCRIPT        print the lazy plan
```

## Your first query

Programmatically:

```go
package main

import (
    "context"
    "fmt"
    "log"

    "github.com/Gaurav-Gosain/golars/dataframe"
    "github.com/Gaurav-Gosain/golars/expr"
    "github.com/Gaurav-Gosain/golars/lazy"
    "github.com/Gaurav-Gosain/golars/series"
)

func main() {
    ctx := context.Background()

    dept, _ := series.FromString("dept", []string{"eng", "eng", "sales", "ops"}, nil)
    salary, _ := series.FromInt64("salary", []int64{100, 120, 80, 70}, nil)
    df, _ := dataframe.New(dept, salary)
    defer df.Release()

    plan := lazy.FromDataFrame(df).
        Filter(expr.Col("salary").Gt(expr.Lit(int64(75)))).
        GroupBy("dept").
        Agg(expr.Col("salary").Sum().Alias("total")).
        Sort("total", true)

    out, err := plan.Collect(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer out.Release()
    fmt.Println(out)
}
```

From the shell:

```sh
# Read a CSV, describe it
golars stats trades.csv

# Run SQL against it
golars sql "SELECT symbol, SUM(volume) AS vol FROM trades GROUP BY symbol ORDER BY vol DESC LIMIT 5" trades.csv

# Interactively browse it
golars browse trades.csv
```

## The REPL

Run `golars` with no arguments to open the interactive REPL:

```
golars » load trades.csv
ok  loaded trades.csv (1,234,567 × 6)

golars » filter volume > 100
ok  added FILTER to pipeline: col("volume") > 100

golars » groupby symbol amount:sum:vol
ok  added GROUP BY [symbol] with 1 aggs

golars » sort vol desc
ok  added SORT vol desc to pipeline

golars » head 10
```

The REPL ships with inline ghost-text completions, command history, and tab completion for paths and column names.

## Next

<Cards>
  <Card href="/docs/cookbook" title="Cookbook" description="End-to-end recipes." />

  <Card href="/docs/scripting" title=".glr scripting" description="Automate from a script." />
</Cards>


# Introduction


**golars** is a pure-Go DataFrame library modeled on [polars](https://github.com/pola-rs/polars) and built directly on arrow-go. No cgo. Single `go build` cross-compiles.

```go
import (
    "context"
    "fmt"

    "github.com/Gaurav-Gosain/golars/compute"
    "github.com/Gaurav-Gosain/golars/dataframe"
    "github.com/Gaurav-Gosain/golars/series"
)

ctx := context.Background()

names, _ := series.FromString("name", []string{"ada", "brian", "carl"}, nil)
ages, _ := series.FromInt64("age", []int64{27, 34, 19}, nil)
df, _ := dataframe.New(names, ages)
defer df.Release()

mask, _ := compute.GtLit(ctx, ages, int64(20))
adults, _ := df.Filter(ctx, mask)
defer adults.Release()
fmt.Println(adults)
```

## Highlights

* **Eager + lazy execution.** Build pipelines as logical plans, let the optimizer fuse projections/filters, then `Collect(ctx)`.
* **Streaming engine.** Morsel-driven execution for datasets that don't fit in memory.
* **Polars-grade performance.** Matches or beats polars 1.39 on most polars-compare workloads.
* **I/O included.** CSV, Parquet, IPC, JSON, NDJSON readers/writers; `io/sql` bridge for any `database/sql` driver.
* **Scripting + REPL.** `.glr` scripts run via `golars run my.glr` or inside the interactive REPL with inline ghost-text completions.
* **LLM-native.** MCP server exposes golars tools to Claude Desktop, Cursor, Windsurf, and other MCP hosts.

## Install

```sh
go get github.com/Gaurav-Gosain/golars@latest
```

The CLI ships separately:

```sh
go install github.com/Gaurav-Gosain/golars/cmd/golars@latest
```

## Where next

<Cards>
  <Card href="/docs/getting-started" title="Getting Started" description="Install the library, open the REPL, run a query." />

  <Card href="/docs/cookbook" title="Cookbook" description="End-to-end recipes: read, filter, group, write." />

  <Card href="/docs/scripting" title=".glr scripting" description="The one-command-per-line pipeline language." />

  <Card href="/docs/sql" title="SQL frontend" description="SELECT / FROM / WHERE / GROUP BY / ORDER BY / LIMIT." />

  <Card href="/docs/mcp" title="MCP integration" description="Plug golars into Claude Desktop and co." />

  <Card href="/docs/api-surface" title="API reference" description="polars to golars method-level status table." />
</Cards>


# MCP: golars as a tool for your LLM host


`golars-mcp` is a Model Context Protocol server that exposes a
read-only subset of golars as tools an LLM host can invoke. Works
with Claude Desktop, Cursor, Windsurf, and any other MCP-aware
client.

## Install

```sh
go install github.com/Gaurav-Gosain/golars/cmd/golars-mcp@latest
```

The binary lives in `$GOBIN` (or `$HOME/go/bin` by default).

## Configure Claude Desktop

Edit `~/Library/Application Support/Claude/claude_desktop_config.json`
(or `%APPDATA%\Claude\claude_desktop_config.json` on Windows; the
Linux equivalent lives under `~/.config/Claude/`) and add a
`mcpServers` entry:

```json
{
  "mcpServers": {
    "golars": {
      "command": "/absolute/path/to/golars-mcp"
    }
  }
}
```

Restart the Claude Desktop app. You should see a hammer icon in the
conversation pane letting you enable the `golars` server.

## Configure Cursor / Windsurf

Both editors read the same JSON format. Add the same snippet to
your workspace MCP config (`~/.cursor/mcp.json` for Cursor). No
restart needed in Cursor; the server picks up automatically.

## Available tools

| Tool          | What it does                                            |
| ------------- | ------------------------------------------------------- |
| `schema`      | Return column names + dtypes for a data file            |
| `head`        | Return the first N rows (CSV/Parquet/Arrow/JSON/NDJSON) |
| `describe`    | Return describe()-style summary stats                   |
| `sql`         | Run a SQL query against one or more files               |
| `row_count`   | Cheap "how many rows × cols" probe                      |
| `null_counts` | Per-column null counts                                  |

Every tool returns *both* a plain-text fallback (for hosts that only
render text) and a `structuredContent` payload with `columns` +
`rows` arrays so richer UIs can render a table.

## Example session

After configuring the server, ask your LLM host something like:

> "What's the schema of `~/data/trades.csv`? If any column has more
> than 10% nulls, summarise it with describe."

The host picks up the tool catalogue from `tools/list`, calls
`schema` then `null_counts` then `describe`, and the model answers
using the structured results.

## Protocol notes

* Protocol version: `2025-06-18`. We only implement the tools
  capability; resources and prompts are not served (yet).
* Transport: stdio JSON-RPC 2.0, one object per line.
* No authentication: `golars-mcp` reads files the user running the
  host process has read access to.

## Security

The MCP server is **read-only**. It cannot write files, start
subprocesses, or reach the network. The tools only accept a path
string and execute a query against its contents; SQL is compiled to
a lazy plan with a whitelist of operators (no arbitrary expressions
or DDL). That said, it *will* read any file the caller names :
don't point a host LLM at secrets.

## Extending

`cmd/golars-mcp/tools.go` registers tools into a flat slice. Add a
new `Tool{Name, Description, InputSchema, Run}` entry and the
server picks it up. Keep tools pure (no state outside the local
session) so concurrent calls are safe.


# Memory model


A columnar analytics library lives or dies by how it manages memory. Go's garbage collector is capable but unforgiving of allocation-heavy hot loops. This document describes the rules golars follows to keep memory predictable.

## Buffer ownership

All column data ultimately sits in `memory.Buffer` (from arrow-go). A buffer is a reference-counted, opaque handle to a contiguous byte region. We never copy buffer contents when we can share them.

Reference counting rules:

* Creating a `Series` from an `arrow.Array` retains the array's buffers.
* Cloning a `Series` shares buffers, not copies them.
* Slicing a `Series` shares buffers with an offset and length.
* A `DataFrame` holds retains on every Series it owns.
* Release happens through `Series.Release()` and `DataFrame.Release()`. Without an explicit release, the GC collects the wrapper and the underlying arrow buffer release runs via a finalizer. Finalizers are a safety net, not the intended path. Release explicitly in tight loops.

## Allocator

Every Series, array, and kernel output is allocated through a `memory.Allocator`. The default is `memory.DefaultAllocator`. For benchmarks and tests we use `memory.NewCheckedAllocator` which panics on unreleased buffers. Every test that constructs Series must use a checked allocator and verify zero leaks at teardown.

Allocator choice flows through a `ctx.Context` in plan execution. Expression evaluation takes the allocator from the surrounding execution context, not from a global.

## Immutability

`Series` and `DataFrame` are immutable to the user. Every mutating-looking method returns a new value. Under the hood we exploit ref-counted buffer sharing so that `df.Rename("a", "b")` is O(1) and does not copy column data.

This buys us a few things:

* Safe concurrent reads without locks.
* Easier reasoning about plan transformations.
* The optimizer can reorder and eliminate subplans without worrying about side effects.

## The chunked model

A Series is a sequence of chunks. Each chunk is an `arrow.Array` of the same dtype. Properties:

* All chunks share one dtype and one validity bitmap format.
* Total length is the sum of chunk lengths.
* Chunk boundaries are an implementation detail. Kernels must not depend on specific chunk sizes for correctness, only for scheduling.

Why chunks:

* Natural unit of parallelism.
* Natural unit of streaming (a morsel is a DataFrame-shaped collection of chunks, one per column).
* Enables append-without-copy: appending two Series concatenates chunk lists instead of copying.

The downside is that kernels must iterate over chunks. We mitigate this with a `series.Iter()` helper that yields `(chunk arrow.Array, offset int)` pairs.

## Null masks

Arrow's validity bitmap is one bit per row, ones for valid, zeros for null. golars never materializes a null to a sentinel value. All kernels operate on `(data, bitmap)` pairs. Aggregations skip null positions. Comparisons propagate null per polars' semantics: `null == null` is null, `null < 1` is null, and so on.

Bitmaps are shared via buffer refcounts just like data buffers.

## Hot-loop allocation rules

Performance work follows a small set of rules:

1. **No allocation in inner loops.** Pre-allocate result buffers sized to the input. Use `memory.Allocator.Allocate(n)` once per chunk, not per row.
2. **No interface boxing in inner loops.** Hot kernels dispatch on dtype once at the outer level and then work on concrete `[]T` slices. We rely on generated code (`go generate`) to produce dtype-specialized kernels rather than paying interface dispatch cost per row.
3. **No map operations in inner loops.** Hash tables used by groupby and join are dedicated open-addressing implementations under `internal/hash`. No `map[K]V` in aggregation critical paths.
4. **Reuse buffers across morsels.** The streaming executor keeps a free-list of `memory.Buffer` per operator and reuses them across morsels where size permits.
5. **Bounded per-operator memory.** Operators declare a memory budget and spill to disk when they exceed it.

## Cross-operator sharing

Projection pushdown and common subexpression elimination mean that the same underlying column appears in multiple operator outputs. We never copy: the output Series of a projection shares buffers with the input. Refcounting ensures correctness.

## GC pressure management

Go's GC is concurrent and low-latency, but allocation pressure still drives pause frequency and throughput cost. golars keeps pressure low by:

* Working in large `[]T` slices instead of many small objects.
* Using `sync.Pool` for short-lived per-morsel scratch buffers (hash temp arrays, partition index buffers).
* Avoiding string allocation on the hot path. String columns are kept in arrow's native offset-plus-buffer layout and operated on as byte slices.
* Keeping the `Chunk` struct small (a few pointers) so that slices of chunks fit in cache.

We run the test suite under `GODEBUG=gctrace=1` in CI and watch for surprise allocation.

## A note on off-heap

We do not use off-heap memory (mmap backed by anonymous regions) by default. arrow-go's `memory.GoAllocator` returns Go-managed slices. We switch to `memory.CgoArrowAllocator` only if profiling shows GC overhead is a problem on real workloads, and only if we decide to relax the no-cgo constraint. For now, staying on-heap is simpler and fast enough.

Spill-to-disk for OOC is different and uses mmap on regular files. That is not off-heap allocation; that is swapping to disk.


# Parallelism model


Go's concurrency primitives fit analytical query execution well. Goroutines are cheap, channels provide back-pressure for free, and `select` handles cancellation cleanly. This document describes how golars uses them.

## Two kinds of parallelism

golars distinguishes two parallelism patterns:

1. **Data parallelism inside a single operator.** A filter over a column with 64 chunks can run all chunks in parallel. The result preserves chunk order. This is what the eager executor uses for every kernel.

2. **Pipeline parallelism across operators.** A groupby-agg over a parquet file runs the scan, the hash partition, the partial aggregate, and the final merge as separate stages, each in its own goroutine, communicating via channels. This is what the streaming executor uses.

Both patterns compose. A single stage in the streaming executor may itself run data-parallel kernels internally.

## Worker pool

A single process-wide worker pool (default size `GOMAXPROCS`, overridable) dispatches chunk-level work for the eager executor. Benefits over spawning goroutines ad hoc:

* Predictable concurrency. We cap simultaneous work, which keeps memory bounded.
* Cheap cancellation. Closing the pool's context stops all in-flight chunks without leaking goroutines.
* Natural place to plug instrumentation (rows processed, time per chunk).

The pool lives in `internal/pool`. Its public type is `*Pool` with methods `Submit(ctx, func(ctx) error)` and `Wait()`. Internally it uses a bounded `chan func()` fed by a fixed set of worker goroutines.

We do not use `sync.WaitGroup` directly in user-facing code. `errgroup.Group` from `golang.org/x/sync/errgroup` handles error propagation and cancellation in the usual Go idiom.

## Morsel-driven streaming

The streaming executor borrows the morsel-driven model from HyPer and DuckDB and adapts it to Go. The implementation lives in the `stream` package; `lazy.WithStreaming()` compiles streaming-friendly plan prefixes into a `stream.Pipeline`.

Primitives currently available:

* `Source`, `Stage`, `Sink` as plain function types.
* `DataFrameSource` slices an in-memory frame into morsels. A row-partitioned parallel source is the next delivery.
* `FilterStage`, `ProjectStage`, `WithColumnsStage`, `RenameStage`, `DropStage`, `SliceStage` (state-carrying, tracks the running row counter across morsels).
* `ParallelMapStage` is the combinator that turns a per-morsel function into an order-preserving fan-out. It tags morsels on ingress with a sequence number, dispatches to a small worker pool, and uses a reorder buffer on egress so downstream stages see input order regardless of worker count. `ParallelFilterStage`, `ParallelProjectStage`, `ParallelWithColumnsStage` are thin wrappers.
* `CollectSink` concatenates morsel chunks column-wise into a single DataFrame.

Hybrid execution: `lazy.Collect(ctx, lazy.WithStreaming())` runs the longest streaming-friendly prefix through the pipeline executor. When a blocker node (Sort, Aggregate, Join) appears above that prefix, the upstream DataFrame is materialized first and the blocker runs eagerly. This keeps the surface simple (one `Collect` call) while letting streaming pay off for scan + filter + project chains and not regress for blockers.

<Mermaid
  chart="flowchart LR
    Source[&#x22;Source\nparquet scan&#x22;] -->|batch chan| F[&#x22;Filter stage&#x22;]
    F -->|batch chan| E[&#x22;Hash exchange&#x22;]
    E -->|partitioned chan 0| G0[&#x22;Group-agg worker 0&#x22;]
    E -->|partitioned chan 1| G1[&#x22;Group-agg worker 1&#x22;]
    E -->|partitioned chan N| GN[&#x22;Group-agg worker N&#x22;]
    G0 --> M[&#x22;Merge sink&#x22;]
    G1 --> M
    GN --> M"
/>

**Morsel.** A morsel is an `arrow.Record` with a bounded number of rows (default 64K, tuned by workload). All inter-stage communication is in morsels.

**Channel back-pressure.** Every inter-stage channel has a small buffer (default 4). When a downstream stage is slow, its input channel fills, blocking the producer. This is the back-pressure mechanism, and it costs no allocation.

**Exchange.** Partition-parallel operators (hash groupby, hash join) insert a hash exchange: a stage that takes one input channel and fans out to N output channels, one per partition, by hashing the keys. Downstream workers own their partition end-to-end.

**Pipeline breakers.** Sort and groupby-agg with no suitable partition key are pipeline breakers. They buffer, compute, and then emit. The streaming executor tracks breakers explicitly so planners can decide when spilling is necessary.

**Cancellation.** Every stage takes a `context.Context`. When the context cancels (user abort, downstream error, sink closed), stages drain their input channels, release references to any morsels they hold, and return.

## Why goroutines over thread pools

polars uses Rayon, which is ideal for CPU-bound data-parallel loops in Rust. Go's goroutine scheduler does the same job for our workloads with less ceremony:

* Goroutines are cheap enough that we do not need a join-on-completion primitive. We just launch them.
* Channels are zero-allocation queues (for values up to channel element size). We do not reinvent bounded queues.
* `select` handles timeouts, cancellation, and multi-source reads in one construct. No event loop needed.

The tradeoff is that Go does not give us work-stealing the way Rayon does. In practice this matters less than it seems because our work units (morsels) are large enough that simple FIFO dispatch from the pool's channel is rarely the bottleneck. If profiling proves otherwise, we introduce work-stealing at the pool level without changing the operator interface.

## Determinism

Parallel execution must not change results. Rules:

* Chunked operations preserve chunk order in the output. A parallel filter produces chunks in the same order as input, even if they finished out of order.
* Aggregation results are deterministic modulo the associativity of the aggregation. Sum, min, max, count are exact. Mean, std, var use a numerically stable parallel algorithm (Welford-Chan) and are reproducible.
* Sort is stable.
* Row order in a DataFrame is preserved across operations unless an operation explicitly reorders (sort, join, groupby).

Tests cover determinism explicitly: the same input run N times produces byte-identical output.

## Scheduling heuristics

The planner picks chunk size and partition count based on:

* Estimated row count of the input
* Number of group-by keys (for partition count)
* `GOMAXPROCS`

If estimates are unavailable (lazy input from a scan with no statistics), we default to a morsel size of 64K rows and a partition count of `2 * GOMAXPROCS`. These defaults are tunable per-session.

## Profiling and observability

Every operator records rows in, rows out, bytes processed, and wall time. These metrics are available via `df.Profile()` on eager calls and `lf.Profile()` on lazy calls. Under the hood, the pool exposes `expvar` counters so long-running programs can scrape them.


# golars scripting language (.glr)


`.glr` files are a tiny, line-oriented language for pipeline-style
DataFrame work. Designed to feel like "your REPL session in a file",
nothing more. Every REPL command you know is also a script statement.

```bash
# trades-daily.glr
load data/trades.csv          as trades
load data/symbols.csv         as symbols

use trades
filter volume > 100
groupby symbol amount:sum:total
join symbols on symbol
sort total desc
limit 10
show
```

Run it:

```sh
golars run trades-daily.glr       # one-shot
```

From inside the REPL:

```
golars » .source trades-daily.glr
```

***

## Grammar

```text
program      = { statement NL } ;
statement    = empty
             | comment
             | command ;
comment      = "#" { any-char-until-NL } ;
command      = [ "." ] identifier { arg } ;
arg          = identifier | number | string | operator | "as" | "on" ;
string       = '"' { any-char } '"' ;
identifier   = ( letter | "_" ) { letter | digit | "_" | "." | "-" | "/" | ":" } ;
number       = [ "-" ] digit { digit } [ "." digit { digit } ] ;
operator     = "==" | "!=" | "<=" | ">=" | "<" | ">" | "and" | "or" ;
```

Statements are line-terminated. A trailing `\` (after any trailing
whitespace) continues onto the next physical line, useful for long
filter predicates:

```bash
filter salary > 100000 \
  and dept == "eng" \
  and tenure_years >= 2
```

The leading `.` on every command is optional. A `#` inside a `"..."`
string is treated as a literal - only unquoted `#` starts a comment.
Typos get a `did you mean?` hint from the runner.

***

## Statement reference

| Statement                                                                             | What it does                                                                                                                             |
| ------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `load PATH`                                                                           | Focus a new frame (csv/tsv/parquet/ipc/arrow).                                                                                           |
| `load PATH as NAME`                                                                   | Stage a frame under `NAME` without touching focus.                                                                                       |
| `use NAME`                                                                            | Switch focus to a clone of `NAME`. `NAME` stays staged so repeated `use` branches off the same base; prior focus is discarded.           |
| `stash NAME`                                                                          | Materialize the focus and save it under `NAME`; focus continues with the snapshot.                                                       |
| `frames`                                                                              | List loaded frames. The focused one is marked `*`.                                                                                       |
| `drop_frame NAME`                                                                     | Release `NAME` from the registry.                                                                                                        |
| `save PATH`                                                                           | Materialize the focused pipeline and write to disk.                                                                                      |
| `select COL [, COL...]`                                                               | Project columns (lazy).                                                                                                                  |
| `drop COL [, COL...]`                                                                 | Drop columns (lazy).                                                                                                                     |
| `filter PRED`                                                                         | Add a filter predicate (lazy). See [predicate grammar](#predicate-grammar).                                                              |
| `sort COL [asc\|desc]`                                                                | Sort by one column (lazy).                                                                                                               |
| `limit N`                                                                             | Keep the first N rows (lazy).                                                                                                            |
| `head [N]`                                                                            | Collect and print first N rows (default 10).                                                                                             |
| `tail [N]`                                                                            | Collect and print last N rows.                                                                                                           |
| `show`                                                                                | Alias for `head 10`.                                                                                                                     |
| `ishow` / `browse`                                                                    | Open the focused pipeline in the interactive browse TUI on the alt screen. Quit with `q` to return to the REPL.                          |
| `schema`                                                                              | Print column names + dtypes.                                                                                                             |
| `describe`                                                                            | count/null\_count/mean/std/min/25%/50%/75%/max per column.                                                                               |
| `groupby KEYS AGG [AGG...]`                                                           | Group + aggregate. KEYS is comma-separated. AGG is `col:op[:alias]`; op is `sum`/`mean`/`min`/`max`/`count`/`null_count`/`first`/`last`. |
| `join PATH\|NAME on KEY [TYPE]`                                                       | Join the focus with a file or named frame. TYPE ∈ `inner`/`left`/`cross` (default inner).                                                |
| `explain`                                                                             | Print logical plan, optimiser trace, optimised plan.                                                                                     |
| `explain_tree` / `tree`                                                               | Same three-section report rendered as a box-drawn tree.                                                                                  |
| `graph` / `show_graph`                                                                | Styled plan tree with lipgloss colour coding.                                                                                            |
| `mermaid`                                                                             | Emit the plan as a Mermaid flowchart. Pipe into `mmdc` for PNG/SVG.                                                                      |
| `collect`                                                                             | Materialize the pipeline back into the focused frame's source.                                                                           |
| `reset`                                                                               | Discard the lazy pipeline; keep the source.                                                                                              |
| `source PATH`                                                                         | Run another `.glr` file inline.                                                                                                          |
| `reverse`                                                                             | Reverse row order of the focus.                                                                                                          |
| `sample N [seed]`                                                                     | Uniform-random sample of N rows without replacement.                                                                                     |
| `shuffle [seed]`                                                                      | Randomly reorder every row.                                                                                                              |
| `unique`                                                                              | Drop duplicate rows across every column.                                                                                                 |
| `null_count`                                                                          | Per-column null count as a 1-row frame.                                                                                                  |
| `glimpse [N]`                                                                         | Compact peek at the first N rows (default 5).                                                                                            |
| `size`                                                                                | Estimated Arrow byte size of the pipeline result.                                                                                        |
| `timing`                                                                              | Toggle per-statement timing.                                                                                                             |
| `info`                                                                                | Runtime info: Go version, heap, uptime, row counts.                                                                                      |
| `clear`                                                                               | Clear the screen.                                                                                                                        |
| `exit` / `quit`                                                                       | Quit the REPL (no-op in `golars run` mode).                                                                                              |
| `cast COL TYPE`                                                                       | Cast COL to `i64`/`i32`/`f64`/`f32`/`bool`/`str`.                                                                                        |
| `fill_null VALUE`                                                                     | Replace nulls across compatible columns with VALUE.                                                                                      |
| `drop_null [COL...]`                                                                  | Drop rows with nulls in any (or the listed) columns.                                                                                     |
| `rename OLD as NEW`                                                                   | Rename one column.                                                                                                                       |
| `sum COL` / `mean COL` / `min COL` / `max COL` / `median COL` / `std COL`             | Print one scalar for COL.                                                                                                                |
| `write PATH`                                                                          | Alias for `save`. Supported sinks: `.csv`, `.tsv`, `.parquet`, `.arrow`, `.ipc`, `.json`, `.ndjson`/`.jsonl`.                            |
| `with_row_index NAME [OFFSET]`                                                        | Prepend an int64 row index.                                                                                                              |
| `sum_horizontal OUT [COL...]`                                                         | Append a row-wise sum column (nulls ignored).                                                                                            |
| `mean_horizontal OUT [COL...]`                                                        | Append a row-wise mean column.                                                                                                           |
| `min_horizontal OUT [COL...]`                                                         | Row-wise min.                                                                                                                            |
| `max_horizontal OUT [COL...]`                                                         | Row-wise max.                                                                                                                            |
| `all_horizontal OUT [COL...]`                                                         | Row-wise boolean AND.                                                                                                                    |
| `any_horizontal OUT [COL...]`                                                         | Row-wise boolean OR.                                                                                                                     |
| `sum_all` / `mean_all` / `min_all` / `max_all` / `std_all` / `var_all` / `median_all` | One-row per-column aggregate over every numeric column.                                                                                  |
| `count_all` / `null_count_all`                                                        | One-row per-column (null-)count.                                                                                                         |
| `scan_csv PATH [as NAME]`                                                             | Register a lazy CSV scan (push-down friendly).                                                                                           |
| `scan_parquet PATH [as NAME]`                                                         | Lazy Parquet scan.                                                                                                                       |
| `scan_ipc PATH [as NAME]`                                                             | Lazy Arrow IPC scan.                                                                                                                     |
| `scan_json PATH [as NAME]`                                                            | Lazy JSON scan.                                                                                                                          |
| `scan_ndjson PATH [as NAME]`                                                          | Lazy NDJSON scan.                                                                                                                        |
| `scan_auto PATH [as NAME]`                                                            | Infer the scan format from the file extension.                                                                                           |
| `fill_nan VALUE`                                                                      | Replace NaN with VALUE in every float column.                                                                                            |
| `forward_fill [LIMIT]`                                                                | Forward-fill nulls per column (LIMIT=0 is unlimited). Leading nulls stay null.                                                           |
| `backward_fill [LIMIT]`                                                               | Backward-fill nulls per column. Trailing nulls stay null.                                                                                |
| `top_k K COL`                                                                         | Keep K rows with the largest values in COL.                                                                                              |
| `bottom_k K COL`                                                                      | Keep K rows with the smallest values in COL.                                                                                             |
| `transpose [HEADER_COL] [PREFIX]`                                                     | Transpose the focus (numeric/bool columns).                                                                                              |
| `unpivot IDS [VALS]`                                                                  | Wide-to-long reshape. IDS/VALS are comma-separated lists.                                                                                |
| `partition_by KEYS`                                                                   | Print a summary of per-key-combination row counts.                                                                                       |
| `skew COL` / `kurtosis COL`                                                           | Scalar skewness / excess kurtosis.                                                                                                       |
| `approx_n_unique COL`                                                                 | HyperLogLog estimate of distinct-value count.                                                                                            |
| `corr COL1 COL2` / `cov COL1 COL2`                                                    | Pair-wise Pearson corr / sample cov.                                                                                                     |
| `pivot INDEX ON VALUES [AGG]`                                                         | Long-to-wide pivot. AGG: first/sum/mean/min/max/count.                                                                                   |
| `pwd` / `ls [PATH]` / `cd [PATH]`                                                     | Working-directory helpers.                                                                                                               |
| `with NAME = EXPR`                                                                    | Append a derived column. EXPR is a real expression: arithmetic, comparisons, logical ops, string methods, aggregates, rolling windows.   |
| `unnest COL`                                                                          | Project fields of a struct-typed column as top-level columns.                                                                            |
| `explode COL`                                                                         | Fan out each element of a list-typed column into its own row.                                                                            |
| `upsample COL EVERY`                                                                  | Interpolate a sorted timestamp column at `ns`/`us`/`ms`/`s`/`m`/`h`/`d`/`w` intervals.                                                   |

### String operations

String-column ops are reachable via `col(x).str.<op>()` on the expression
API, and inside `.filter` via keyword operators that desugar to the same
Exprs. Every op below is backed by the series kernel of the same name; the
expression layer is a thin dispatch on the function name.

| Filter keyword     | Expr method                    | Notes                                        |
| ------------------ | ------------------------------ | -------------------------------------------- |
| `contains "sub"`   | `col(x).str.contains("sub")`   | Literal substring, no regex                  |
| `starts_with "p"`  | `col(x).str.starts_with("p")`  | Byte-prefix                                  |
| `ends_with "s"`    | `col(x).str.ends_with("s")`    | Byte-suffix                                  |
| `like "%pat%"`     | `col(x).str.like("%pat%")`     | SQL wildcards: `%` any, `_` one, `\\` escape |
| `not_like "%pat%"` | `col(x).str.not_like("%pat%")` | Negation fused into the kernel               |

Non-predicate string ops available on Expr (no filter-grammar sugar, used
via `.select`, `.with_column`, or in aggregations):

`str.to_lower`, `str.to_upper`, `str.trim`, `str.strip_prefix(p)`,
`str.strip_suffix(s)`, `str.replace(o, n)`, `str.replace_all(o, n)`,
`str.len_bytes`, `str.len_chars`, `str.count_matches(s)`, `str.find(s)`,
`str.head(n)`, `str.tail(n)`, `str.slice(start, length)`,
`str.contains_regex(pat)`.

### Expression grammar (for `with`)

`with NAME = EXPR` accepts a full expression tree, not just the
filter DSL. Bare identifiers resolve to column references; string
methods hang off `.str.*`; aggregates, rolling, and EWM are reachable
via fluent method calls; `col(...)`, `lit(...)`, `sum(...)`, and
`coalesce(...)` are available as top-level functions.

Examples:

```bash
with bulk       = amount > 1000
with name_upper = name.str.upper()
with revenue    = price * qty
with trend      = amount.rolling_mean(7, 1)
with score      = coalesce(primary, backup).str.trim()
with ewm        = value.ewm_mean(0.3)
```

Supported string methods: `upper`, `lower`, `trim`, `reverse`,
`contains`, `contains_regex`, `starts_with`, `ends_with`, `like`,
`not_like`, `replace`, `replace_all`, `strip_prefix`, `strip_suffix`,
`len_bytes`, `len_chars`, `slice`, `head`, `tail`, `find`,
`count_matches`, `split_exact`.

Supported aggregates: `sum`, `mean`, `min`, `max`, `count`,
`null_count`, `first`, `last`, `median`, `std`, `var`, `quantile`,
`skew`, `kurtosis`, `n_unique`, `approx_n_unique`.

Supported shape ops: `abs`, `neg`, `not`, `round`, `floor`, `ceil`,
`sqrt`, `exp`, `log`, `log2`, `log10`, `sign`, `reverse`, `shift`,
`diff`, `cum_sum`, `cum_min`, `cum_max`, `fill_null`, `alias`,
`cast`, `between`, `forward_fill`, `backward_fill`.

Supported windows: `rolling_sum`, `rolling_mean`, `rolling_min`,
`rolling_max`, `rolling_std`, `rolling_var`, `ewm_mean`, `ewm_std`,
`ewm_var`.

### Predicate grammar

For `filter`:

```
col op value [and|or col op value]...
```

* No parentheses, left-to-right evaluation.
* Ops: `==`, `!=`, `<`, `<=`, `>`, `>=`, `is_null`, `is_not_null`.
* Values: integers, floats, double-quoted strings, `true`, `false`.

Examples:

```bash
filter age >= 21 and salary > 50000
filter symbol == "AAPL"
filter is_active and created_at > 1704067200000000
filter note is_null
```

***

## Multi-source workflows

Scripts regularly need N frames. The `as NAME` / `use NAME` /
`.frames` trio is the whole story: there's no hidden namespace:

```bash
# Stage every input up front. None of these promote themselves to
# focus, so we can read them in any order.
load data/trades.csv    as trades
load data/symbols.csv   as symbols
load data/users.csv     as users

# Work on one, stash it, work on the next.
use trades
filter volume > 100
groupby user_id amount:sum:total_bought
stash trade_totals

# `use` is non-consuming: trade_totals stays staged, and so does the
# original trades frame: we could `use trades` again to branch off a
# different filter.
use users
filter region == "US"
join trade_totals on user_id
join symbols on symbol
sort total_bought desc
show
```

`stash` is the "save into a variable" move: it materializes whatever
lazy pipeline is on the focus and parks a copy under `NAME` so later
`use NAME` gives you that snapshot. The focus itself keeps going
from the snapshot, so the idiomatic branching pattern is:

```bash
load data/trades.csv
filter volume > 100
stash base

filter side == "buy"
stash buys

use base
filter side == "sell"
stash sells

use buys
join sells on symbol
```

When `.join` sees a name that exists in the frame registry, it
consumes that frame (keeping it in the registry for reuse) instead
of treating the argument as a path. Paths win only when no frame
matches.

### Anonymous `load PATH`

The short form `load PATH` (no `as`) is equivalent to `use NAME`
where `NAME` is empty. It's the "single-frame script" ergonomic:

```bash
load data/trades.csv
filter volume > 100
show
```

No registry, no juggling: just pipe.

***

## Transpile to Go

`golars transpile SCRIPT.glr [-o OUT.go] [--package NAME]` emits a
standalone Go program that reproduces the pipeline through the lazy
API. The generated source is piped through `go/format` and has its
imports pruned by `go/ast`, so the output is always gofmt'd and free
of unused imports.

```sh
golars transpile examples/script/pipeline.glr -o main.go --package main
go run main.go
```

Mapping:

| `.glr`                                  | Go                                                       |
| --------------------------------------- | -------------------------------------------------------- |
| `load PATH`                             | `golars.ReadCSV(PATH, csv.WithNullValues(""))`           |
| `load PATH as NAME`                     | stashes the LazyFrame in an internal map for later `use` |
| `use NAME`                              | retargets `focus` onto the stashed frame                 |
| `filter EXPR`                           | `.Filter(EXPR)`                                          |
| `with NAME = EXPR`                      | `.WithColumns(EXPR.Alias("NAME"))`                       |
| `groupby KEY COL:OP[:ALIAS] ...`        | `.GroupBy(KEY).Agg(...)`                                 |
| `sort COL [desc]`                       | `.Sort(COL, desc)`                                       |
| `limit N`                               | `.Limit(N)`                                              |
| `head N`                                | `.Limit(N)` + `.Collect` + `fmt.Println`                 |
| `join NAME on KEY [inner\|left\|cross]` | `.Join(other, []string{KEY}, dataframe.InnerJoin)`       |
| `show`                                  | `.Head(10).Collect` + `fmt.Println`                      |
| `save PATH`                             | `golars.WriteCSV(df, PATH)` (or matching writer)         |

Commands without a direct lazy equivalent (`.tree`, `.graph`,
`.mermaid`, `.reset`, `.frames`) emit a `TODO(glr):` comment so the
file still compiles. If no `show` / `head` / `collect` / `save`
appears in the script, transpile adds an implicit final `Collect` +
`fmt.Println` so the generated binary prints something instead of
exiting silently.

See [`examples/script/transpiled/`](https://github.com/Gaurav-Gosain/golars/tree/main/examples/script/transpiled)
for a transpiled copy of every bundled `.glr` example.

***

## Interop with code

Anything that implements `script.Executor` can host the language.
`cmd/golars` is the reference, but the package ships a generic
runner:

```go
import "github.com/Gaurav-Gosain/golars/script"

r := script.Runner{
    Exec:  script.ExecutorFunc(func(line string) error { /* … */ return nil }),
    Trace: func(line string) { fmt.Println(">", line) },
    ContinueOnErr: true,
    ErrOut: os.Stderr,
}
if err := r.RunFile("pipeline.glr"); err != nil {
    log.Fatal(err)
}
```

* `Trace` receives every normalised statement just before execution.
* `ContinueOnErr` + `ErrOut` emits errors inline and keeps running.
* `script.Normalize(raw)` is exported so third parties can apply the
  same parsing rules (comment stripping, leading `.` insertion).

***

## Editor support

Tree-sitter grammar + highlight queries live at
[`editors/tree-sitter-golars/`](../editors/tree-sitter-golars/).
Install notes for Neovim (`nvim-treesitter`) and VS Code in that
directory's README.

### LSP

<img src="/golars_lsp.png" alt="golars-lsp in Neovim: inlay hints showing frame shape after every .glr statement" width="360" style={{ display: "block", margin: "0 auto" }} />

[`cmd/golars-lsp`](../cmd/golars-lsp/) is a minimal Language Server
that ships:

* **Inline completions** for commands, staged-frame names, file
  paths, and column names read from loaded CSV files.
* **Inlay hints** showing each pipeline step's output shape -
  `→ 5 rows × 3 cols` appears at the end of every shape-changing
  statement. Row counts propagate as upper bounds: `limit N` clamps
  to `N`, left joins preserve the left side's count, filters and
  inner joins mark rows `?`.
* **Hover docs** with signature + long description on any command
  token.
* **Diagnostics** for unknown commands and files that don't resolve.

### `# ^?` probe: live table previews

Drop `# ^?` on its own line anywhere in a script and the Neovim
plugin renders the focused frame's current table as virtual text
below the comment (via a `golars --preview` subprocess). This is the
scripting equivalent of Twoslash/Quokka probes: a live peek at the
data at that pipeline position:

```bash
load data/trades.csv
filter volume > 100
sort amount desc
limit 5
# ^?
```

The preview updates on save + debounced text changes; configure via
`require("golars").setup({ preview_cmd = { "/path/to/golars" },
preview_rows = 20, preview_timeout_ms = 3000 })`. Set `preview = false`
to disable.

### `golars --preview <path>`

Invoke the preview pipeline from any editor, or manually via:

```sh
golars --preview path/to/script.glr
golars --preview path/to/script.glr --preview-rows 25
```

Runs the script silently (no banner, no trace, no success chrome)
and prints exactly one rendered table: the focused pipeline's head.
Exit code 0 on success, non-zero on script error (message on stderr).

***

## What this language is NOT

* No variables beyond the named-frame registry. If you need
  branching or reusable expressions, write a Go program that
  drives `script.Runner` with your own logic.
* No control flow (no `if`, no loops). The idiom for conditional
  runs is shell scripting around `golars run`, or a Go host with
  an `Executor` that dispatches.
* No expression language beyond the filter predicate DSL and
  groupby agg spec. Polars-style `pl.col("a") + pl.col("b")` is a
  `cmd/golars` feature we might add later, but the base language
  stays small.

The design target is "drop a day of REPL work into a file and have
it run again tomorrow." Everything else is out of scope.


# SQL frontend


`golars.sql` exposes a subset SQL frontend that compiles into the
same lazy plan as the Go API. Run queries from the shell, from the
REPL, or programmatically.

## From the CLI

```sh
golars sql "SELECT dept, SUM(amount) AS total FROM sales GROUP BY dept ORDER BY total DESC" sales.csv
```

Each file becomes a table whose name is the filename stem. Multiple
files register as separate tables so cross-file queries Just Work:

```sh
golars sql "SELECT t.*, s.name FROM trades t" trades.csv symbols.csv
```

## From Go

```go
import "github.com/Gaurav-Gosain/golars/sql"

session := sql.NewSession()
defer session.Close()
session.Register("people", df)

out, err := session.Query(ctx, "SELECT name FROM people WHERE age > 25")
defer out.Release()
```

## Grammar

```
SELECT [DISTINCT] projection_list
FROM table_name
[WHERE predicate]
[GROUP BY col_list]
[ORDER BY col_list [ASC|DESC]]
[LIMIT n]
```

`projection_list`:

* `*`
* `col[, col...]` (optionally each with `AS name`)
* `agg(col)[, ...]` (agg: `SUM`, `MIN`, `MAX`, `AVG`, `MEAN`, `COUNT`, `FIRST`, `LAST`)
* any mix of the above when there is a `GROUP BY`

`predicate`:

* `col OP value [AND|OR col OP value]...`
* `OP` is one of `=`, `!=`, `<`, `<=`, `>`, `>=`

`value`:

* integer literal (`42`)
* float literal (`3.14`)
* single-quoted or double-quoted string (`'us'`, `"ops"`)
* `true`, `false`

## Limitations

* No JOIN clause yet (use the `df.Join(...)` API directly).
* No window functions (use `Expr.Over(keys...)` in Go).
* No subqueries.
* No arithmetic in SELECT expressions (use `WithColumns` in Go).

All of these come free with golars' Go API; the SQL frontend
focuses on what ad-hoc shell queries need.

## From the MCP server

`golars-mcp` exposes the same SQL compiler as a tool named `sql`.
Host LLMs invoke it with `{query, files}` arguments. See
[MCP integration](/docs/mcp).