# API design golars borrows the shape of polars' API. The surface is polars-style expressions with Go naming conventions. This document records the conventions so the API stays consistent as it grows. ## Naming * Exported identifiers are `PascalCase`. Unexported are `camelCase`. * Acronyms are capitalized consistently: `CSV`, `JSON`, `URL`, `ID`. * Methods prefer verbs: `Select`, `Filter`, `Join`, `Sort`. Polars uses snake\_case verbs; Go uses PascalCase verbs. Direct translation. * Boolean predicates are prefixed `Is`: `IsNull`, `IsNotNull`, `IsUnique`, `IsInList`. * Getters do not use `Get`: `df.Schema()`, `s.DType()`, `s.Len()`. This matches Go convention. ## Expressions Polars in Python and Rust overloads operators. Go does not. We use method chaining: | Polars (Python) | golars | | --------------------------------- | ----------------------------------- | | `pl.col("a") + pl.col("b")` | `expr.Col("a").Add(expr.Col("b"))` | | `pl.col("a") * 2` | `expr.Col("a").MulLit(2)` | | `pl.col("a") > 3` | `expr.Col("a").GtLit(3)` | | `pl.col("a").is_null()` | `expr.Col("a").IsNull()` | | `pl.col("a").alias("b")` | `expr.Col("a").Alias("b")` | | `pl.when(c).then(a).otherwise(b)` | `expr.When(c).Then(a).Otherwise(b)` | | `pl.col("a").sum().over("g")` | `expr.Col("a").Sum().Over("g")` | Arithmetic with literals uses a `Lit` suffix to keep method signatures concrete. Arithmetic between two expressions is the plain verb. This avoids the type-switching overhead of accepting `any`. ## Scalar kernels (`compute.*Lit`) The `compute` package provides imperative kernels that work directly on `*series.Series`. The `*Lit` variants compare/arithmetic-with a scalar literal and skip the allocation a broadcast Series would require: | Method | Behaviour | | ---------------------------- | ------------------- | | `compute.GtLit(ctx, s, 5)` | mask where s > 5 | | `compute.LtLit(ctx, s, 0)` | mask where s \< 0 | | `compute.EqLit(ctx, s, 42)` | mask where s == 42 | | `compute.GeLit(ctx, s, 0.5)` | mask where s >= 0.5 | These accept any numeric Go literal type and coerce to the series dtype. Fast paths exist for int64 and float64; other dtypes fall back to a broadcast Series internally. Use them in hot loops where the expression compiler's overhead would dominate. ## Top-level re-exports The root `golars` package re-exports the commonly used names so most user code imports a single package: ```go import "github.com/Gaurav-Gosain/golars" df, err := golars.ReadCSV(ctx, "data.csv") out := df.Filter(golars.Col("x").GtLit(0)). GroupBy("k"). Agg(golars.Col("v").Sum().Alias("v_sum")) ``` Deeper packages (`expr`, `lazy/plan`, `internal/hash`) are still importable for users who need them. ## Errors * Every IO-bound or parse-bound operation returns `(T, error)`. No panics on user input. * Pure-data operations that cannot fail given valid inputs return `T`. Invalid inputs (wrong dtype, nonexistent column) produce a descriptive error wrapped with `fmt.Errorf("golars: ...: %w", inner)`. * We define sentinel errors for the common cases: `ErrColumnNotFound`, `ErrDTypeMismatch`, `ErrShapeMismatch`. User code can `errors.Is` on them. * Expression build errors are deferred to `Collect()`. `expr.Col("a").Add(expr.Col("b"))` never fails at construction time, even if "a" does not exist in the target frame. The error surfaces when the plan is resolved. ## Context Every operation that does IO, runs a plan, or might take non-trivial time accepts a `context.Context` as the first argument. Pure-data operations on already-materialized data do not. Examples: * `df.ReadCSV(ctx, path)` takes ctx. * `df.Filter(mask)` does not. Filter is in-process and fast. * `lf.Collect(ctx)` takes ctx. Collect runs the plan. * `lf.GroupBy("k")` does not. GroupBy is a plan builder. The rule is: if a call can run arbitrary user-supplied IO, a plan, or a potentially long-running compute stage, it takes a context. Builder calls do not. ## Options For operations with many optional parameters (ReadCSV, Join, GroupByDynamic), we use functional options: ```go df, err := golars.ReadCSV(ctx, "data.csv", golars.WithDelimiter(','), golars.WithHasHeader(true), golars.WithNullValues([]string{"", "NA"}), ) ``` Options are functions with typed constructors. Option types are scoped to the operation (CSVOption, JoinOption) so the compiler enforces correct combinations. ## IO packages Each supported file format lives in its own package so programs that only need one format don't pull the rest: ``` io/csv // RFC 4180 CSV io/parquet // Apache Parquet via pqarrow io/ipc // Arrow IPC (feather) io/json // JSON array-of-objects, object-of-arrays, and NDJSON io/sql // database/sql bridge (any pure-Go driver) ``` Each package exposes `Read` (from `io.Reader`), `ReadFile`, and where it makes sense `ReadURL` (net/http-backed) and `ReadString`. Writers are symmetric: `Write`, `WriteFile`. The URL loaders accept `WithHTTPClient` so tests can inject a custom transport and production code can wire retry/auth middleware. JSON type inference promotes numeric columns like polars: mixed int/float → float64; mixed anything/string → string. NaN and Inf round-trip. Nulls in input become null bitmap entries. `io/sql` is the pragmatic pure-Go path for databases: plug in any `database/sql`-compatible driver (pgx, modernc.org/sqlite, go-sql-driver/mysql, go-mssqldb) and get a typed DataFrame. `ReadSQL` is eager; `NewReader` streams `WithBatchSize(n)` rows for result sets that exceed memory. Null values are preserved via the arrow validity bitmap. Apache ADBC is deliberately not wrapped: its drivers require cgo, which breaks the pure-Go invariant; the nested demo in `examples/sql/` shows the integration pattern with SQLite. ## Scripting (`script/` + `.glr` files) `script.Runner` runs a tiny pipe-style language against any `Executor`. One statement per line, `#` for comments, the leading `.` on each command is optional: ```bash # examples/script/demo.glr load data/trades.csv filter volume > 100 groupby symbol amount:sum:total sort total desc show ``` `cmd/golars` is the reference host: `golars run path.glr` runs a file one-shot and exits, `.source path.glr` runs one inline from the REPL. Third-party programs plug in via `script.ExecutorFunc`. Multi-source: `load PATH as NAME` stages a frame in a registry without promoting it to focus; `use NAME` promotes it, parking the prior focus under its own name for later reuse; `join PATH|NAME on KEY` consumes a staged frame by name before trying it as a path. See the full language reference in [`docs/scripting.md`](scripting.md). A Tree-sitter grammar + highlight queries ship at [`editors/tree-sitter-golars/`](../editors/tree-sitter-golars/) for editor integrations. ## Nullability and zero values Go has no null. Arrow has validity bitmaps. The API presents null the same way polars does: * `s.Get(i)` returns `(value, valid bool)` for primitive dtypes. * `s.GetStr(i)` returns `("", false)` for null. * `s.IsNull()` returns a boolean mask Series. * Aggregations skip nulls by default. `Sum` over `[1, null, 2]` is 3. * Comparison operators produce null when either side is null. This matches SQL and polars. ## Iteration golars does not encourage row-wise iteration. The idiomatic shape of a program is: ```go result := df. Filter(golars.Col("price").GtLit(0)). WithColumns( golars.Col("price").Mul(golars.Col("qty")).Alias("total"), ). GroupBy("region"). Agg(golars.Col("total").Sum()) ``` Row iteration is available as `df.Rows()` returning a `RowIter` for cases where it is truly needed (writing to a non-columnar sink, debugging), but it is slow by design and documented as such. ## Versioning We follow semver. Before v1.0.0 the API is unstable by convention (minor version bumps may break). After v1.0.0 we commit to semver strictly. Deprecations are marked with `// Deprecated:` comments and persist for at least one minor version before removal. ## What we do not export * Concrete struct fields on `DataFrame`, `Series`, `LazyFrame`. All access is through methods. * The plan and physical plan node types. They live under `lazy/plan` and `lazy/physical` but the API surface is what you build through the fluent `LazyFrame` API. Direct plan construction is not supported. * Internal hash table and pool types. # golars / polars API surface Last synced: 2026-04-24. Authoritative map between polars' Python API (`pl.*`) and golars' Go surface. Status column values: * `done` shipped and covered by tests * `partial` shipped for a common subset; documented gaps * `todo` not yet The package-level facade at the repo root (`import "github.com/Gaurav-Gosain/golars"`) re-exports the most commonly needed symbols so casual users do not need to know which sub-package each name lives in. ## Top-level functions (`pl.*` / `golars.*`) | polars | golars | Status | Notes | | ------------------------------------------------------------------------------------- | --------------------------------------- | ------ | -------------------------------------------------------------------------------------------------------- | | `pl.DataFrame({...})` | `golars.FromMap(...)` | done | slice-of-go map constructor | | `pl.from_arrow(tbl)` | `dataframe.FromArrowTable(tbl)` | done | | | `pl.concat([...])` | `dataframe.Concat(...)` | done | vertical | | `pl.col(name)` | `golars.Col(name)` | done | also typed: `expr.C[T]`, `expr.Int`, `expr.Float`, `expr.Str`, `expr.Bool`, `expr.Int32`, `expr.Float32` | | `pl.lit(v)` | `golars.Lit(v)` | done | plus `LitInt64`/`LitFloat64`/`LitString`/`LitBool`; generic `expr.LitOf[T]` | | `pl.when(p).then(a).otherwise(b)` | `golars.When(p).Then(a).Otherwise(b)` | done | executor uses compute.Where with dtype promotion | | `pl.sum(col)`, `mean`, `min`, `max`, `count`, `first`, `last`, `median`, `std`, `var` | `golars.Sum(col)` and friends | done | col-scoped agg sugar | | `pl.read_csv` | `golars.ReadCSV` | done | | | `pl.read_parquet` | `golars.ReadParquet` | done | | | `pl.read_ipc` / `read_arrow` | `golars.ReadIPC` | done | | | `pl.read_json` / `read_ndjson` | `golars.ReadJSON` / `golars.ReadNDJSON` | done | | | `pl.read_database` | `io/sql.ReadSQL` | done | | | `pl.read_clipboard` | `io/clipboard.Read` | done | CSV transport | | `pl.read_avro` | | todo | defer | | `pl.read_excel` | | todo | needs xuri/excelize dep | | `pl.read_delta` / `read_iceberg` | | todo | RFC required | | `pl.scan_csv` | `io/csv.Scan` | done | LazyFrame source | | `pl.scan_parquet` | `io/parquet.Scan` | done | | | `pl.scan_ipc` | `io/ipc.Scan` | done | | | `pl.scan_ndjson` | `io/json.ScanNDJSON` | done | | ## DataFrame methods | polars | golars | Status | | | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------- | --------------------- | --------------------------------------------------------------------------- | | `df.shape` | `df.Shape()` | done | | | `df.height` / `.width` | `df.Height()` / `.Width()` | done | | | `df.columns` | `df.ColumnNames()` | done | | | `df.dtypes` | `df.DTypes()` | done | | | `df.schema` | `df.Schema()` | done | | | `df.is_empty` | `df.IsEmpty()` | done | | | `df.estimated_size` | `df.EstimatedSize()` | done | | | `df.equals` | `df.Equals(other)` | done | | | `df.glimpse` | `df.Glimpse(n)` | done | | | `df.head` / `.tail` / `.limit` | `df.Head` / `.Tail` / `.Limit` | done | | | `df.slice` | `df.Slice(offset, length)` | done | | | `df.select(exprs)` | `df.Select(names...)` / `golars.SelectExpr(ctx, df, exprs...)` | done | | | `df.with_columns(exprs)` | `df.WithColumns(series...)` / `golars.WithColumnsExpr(ctx, df, exprs...)` | done | | | `df.with_column` | `df.WithColumn(series)` | done | | | `df.rename` | `df.Rename(old, new)` | done | | | `df.drop` | `df.Drop(names...)` | done | | | `df.filter(mask)` | `df.Filter(ctx, mask)` | done | | | `df.sort(by)` | `df.Sort` / `.SortBy` | done | | | `df.reverse` | `df.Reverse(ctx)` | done | | | `df.sample(n)` | `df.Sample(ctx, n, replacement, seed)` | done | | | `df.shuffle` | `df.Shuffle(ctx, seed)` | done | | | `df.clone` | `df.Clone()` | done | | | `df.clear` | `df.Clear()` | done | | | `df.group_by(...).agg(...)` | `df.GroupBy(...).Agg(ctx, [expr...])` | done | | | `df.join(other, on, how)` | `df.Join(ctx, right, on, how)` | done | | | `df.vstack` / `.hstack` | `df.VStack(other)` / `df.HStack(other)` | done | | | `df.concat` | top-level `dataframe.Concat(...)` | done | | | `df.describe` | `df.Describe(ctx)` | done | | | `df.null_count` | `df.NullCount()` | done | | | `df.row(i)` | `df.Row(i)` | done | | | `df.rows()` | `df.Rows()` | done | | | `df.to_dict` | `df.ToMap()` | done | | | `df.to_arrow` / `to_pandas` | `df.ToArrow()` / `df.ToArrowTable()` | done (n/a for pandas) | | | `df.write_csv` / `write_parquet` / `write_ipc` / `write_json` / `write_ndjson` | `io/*.WriteFile(ctx, path, df, ...)` | done | | | `df.gather(indices)` | `df.Gather(ctx, indices)` | done | | | `df.pivot` | `df.Pivot(ctx, index, on, values, PivotAgg)` | done | | | `df.unpivot` (melt) | `df.Unpivot(ctx, idVars, valueVars)` | done | | | `df.transpose` | `df.Transpose(ctx, headerCol, prefix)` | done (numeric/bool) | | | `df.partition_by` | `df.PartitionBy(ctx, keys...)` | done | | | `df.top_k` / `df.bottom_k` | `df.TopK(ctx, k, col)` / `df.BottomK(ctx, k, col)` | done | | | `df.pipe(fn)` | `df.Pipe(fn)` | done | | | `df.corr` / `df.cov` | `df.Corr(ctx)` / `df.Cov(ctx, ddof)` | done | | | `df.explode` | `df.Explode(ctx, col)` | done | list-typed column; null/empty lists become a single null row | | `df.unnest` | `df.Unnest(ctx, col)` | done | struct-typed column; field names must not collide with existing cols | | `df.upsample` | `df.Upsample(ctx, col, every)` | done | timestamp col must be sorted; intervals: `ns`/`us`/`ms`/`s`/`m`/`h`/`d`/`w` | ## Series methods ### Inspection | polars | golars | Status | | -------------- | ------------------- | ------ | | `s.dtype` | `s.DType()` | done | | `s.name` | `s.Name()` | done | | `s.len` | `s.Len()` | done | | `s.null_count` | `s.NullCount()` | done | | `s.has_nulls` | `s.HasNulls()` | done | | `s.is_empty` | `s.IsEmpty()` | done | | `s.n_chunks` | `s.NumChunks()` | done | | `s.n_unique` | `s.NUnique()` | done | | `s.is_sorted` | `s.IsSorted(order)` | done | ### Math (scalar) | polars | golars | Status | | ----------------------- | ------------------------- | ------ | | `s.abs` | `s.Abs()` | done | | `s.sqrt` | `s.Sqrt()` | done | | `s.exp` | `s.Exp()` | done | | `s.log` | `s.Log()` | done | | `s.log2` / `log10` | `s.Log2` / `.Log10` | done | | `s.sin` / `cos` / `tan` | `s.Sin` / `.Cos` / `.Tan` | done | | `s.round(d)` | `s.Round(d)` | done | | `s.floor` / `ceil` | `s.Floor` / `.Ceil` | done | | `s.sign` | `s.Sign()` | done | | `s.clip(lo, hi)` | `s.Clip(lo, hi)` | done | | `s.pow(exp)` | `s.Pow(exp)` | done | ### Aggregations (scalar-returning) | polars | golars | Status | | --------------- | -------------------- | ------ | | `s.sum` | `s.Sum()` | done | | `s.mean` | `s.Mean()` | done | | `s.min` / `max` | `s.Min()` / `.Max()` | done | | `s.median` | `s.Median()` | done | | `s.std` / `var` | `s.Std()` / `.Var()` | done | | `s.quantile(q)` | `s.Quantile(q)` | done | | `s.any` / `all` | `s.Any()` / `.All()` | done | | `s.product` | `s.Product()` | done | ### Position / argsort | polars | golars | Status | | ------------------------- | --------------------------- | ------ | | `s.arg_min` / `arg_max` | `s.ArgMin()` / `.ArgMax()` | done | | `s.arg_sort` | `s.ArgSort()` | done | | `s.top_k(k)` / `bottom_k` | `s.TopK(k)` / `.BottomK(k)` | done | ### Transform | polars | golars | Status | | ---------------------- | ------------------------------------------- | ------ | | `s.head` / `tail` | `s.Head` / `.Tail` | done | | `s.slice` | `s.Slice` | done | | `s.reverse` | `s.Reverse` | done | | `s.sample` | `s.Sample` | done | | `s.shuffle` | `s.Shuffle` | done | | `s.sort` | `compute.Sort(ctx, s, opts)` | done | | `s.unique` | `s.Unique()` | done | | `s.value_counts` | `s.ValueCounts(sort)` | done | | `s.rename` / `alias` | `s.Rename(name)` | done | | `s.clone` | `s.Clone()` | done | | `s.cast` | `compute.Cast(ctx, s, dt)` | done | | `s.rechunk` / `chunks` | `s.Rechunk()` / `s.Chunk(i)` / `.Chunked()` | done | ### Null handling | polars | golars | Status | | ---------------------------------------- | --------------------------------------------- | ------ | | `s.is_null` / `is_not_null` | `s.IsNull()` / `.IsNotNull()` | done | | `s.is_nan` / `is_finite` / `is_infinite` | `s.IsNaN()` / `.IsFinite()` / `.IsInfinite()` | done | | `s.fill_null(v)` | `s.FillNull(v)` | done | | `s.drop_nulls` | `s.DropNulls()` | done | ### Cumulative | polars | golars | Status | | ----------------------- | -------------------------- | ------ | | `s.cum_sum` | `s.CumSum()` | done | | `s.cum_min` / `cum_max` | `s.CumMin()` / `.CumMax()` | done | | `s.cum_prod` | `s.CumProd()` | done | | `s.cum_count` | `s.CumCount()` | done | | `s.diff(periods)` | `s.Diff(periods)` | done | | `s.shift(periods)` | `s.Shift(periods)` | done | | `s.pct_change(periods)` | `s.PctChange(periods)` | done | | `s.mode` | `s.Mode()` | done | ### Rolling / windowed | polars | golars | Status | | | ------------------------------------------------------------------------------------------------ | -------------------------------------------------------- | ------ | ------------------------------------------------ | | `s.rolling_sum` / `rolling_mean` / `rolling_min` / `rolling_max` / `rolling_std` / `rolling_var` | matching methods with `RollingOptions` | done | | | `s.ewm_mean` / `s.ewm_var` / `s.ewm_std` | `s.EWMMean(alpha)` / `.EWMVar(alpha)` / `.EWMStd(alpha)` | done | adjusted form; integer inputs promote to float64 | ### String namespace (`s.str.*`) | polars | golars | Status | | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | ------ | | `str.len_bytes` / `len_chars` | `.Str().LenBytes()` / `.LenChars()` | done | | `str.to_uppercase` / `to_lowercase` | `.Str().Upper()` / `.Lower()` | done | | `str.title` | `.Str().Title()` | done | | `str.contains` | `.Str().Contains(sub)` | done | | `str.starts_with` / `ends_with` | `.Str().StartsWith()` / `.EndsWith()` | done | | `str.replace` / `replace_all` | `.Str().Replace()` / `.ReplaceAll()` | done | | `str.strip_chars` | `.Str().Trim()` / `.LStrip()` / `.RStrip()` | done | | `str.strip_prefix` / `strip_suffix` | `.Str().StripPrefix()` / `.StripSuffix()` | done | | `str.pad_start` / `pad_end` / `zfill` | `.Str().PadStart()` / `.PadEnd()` / `.ZFill()` | done | | `str.reverse` | `.Str().Reverse()` | done | | `str.slice` | `.Str().Slice(start, len)` | done | | `str.count_matches` | `.Str().CountMatches()` | done | | `str.concat` | `.Str().Concat()` / `.Str().Prefix()` | done | | `str.extract` | `.Str().Extract(pattern, group)` | done | | `str.contains_regex` / `count_matches_regex` / `replace_regex` | matching `.Str().*Regex` | done | | `str.split_exact` | `.Str().SplitExact(sep)` (List\) / `.Str().SplitN(sep, idx)` / `.Str().SplitExactNullShort(sep, idx)` | done | ### Arrow interop | polars | golars | Status | | -------------------- | -------------------------------------------- | ------ | | `s.to_arrow` | `s.ToArrow()` / `s.ToArrowChunked()` | done | | `pl.from_arrow(arr)` | `series.FromArrowArray` / `FromArrowChunked` | done | | `df.to_arrow` | `df.ToArrow()` / `df.ToArrowTable()` | done | | `pl.from_arrow(tbl)` | `dataframe.FromArrowTable(tbl)` | done | ## Expr methods | polars | golars | Status | | -------------------------------------------------------- | --------------------------------------- | --------------------------------------------------------- | | `pl.col(x)` with `+ - * /` | `Col(x).Add/Sub/Mul/Div(...)` | done | | `== != < <= > >=` | `Eq/Ne/Lt/Le/Gt/Ge` | done | | `and / or / not` | `And / Or / Not` | done | | `.alias(name)` | `.Alias(name)` | done | | `.cast(dt)` | `.Cast(dt)` | done | | `.is_null` / `.is_not_null` | `.IsNull()` / `.IsNotNull()` | done | | `.sum .mean .min .max .count .first .last .null_count` | matching methods | done | | `.median .std .var .any .all .product .quantile` | matching methods | done | | `.abs .sqrt .exp .log .log2 .log10 .sin .cos .tan .sign` | matching methods | done | | `.round(d) .floor .ceil .clip(lo, hi) .pow(x)` | matching methods | done | | `.fill_null(v)` | `.FillNull(v)` / `.FillNullExpr(e)` | done | | `.reverse .head(n) .tail(n) .slice(off, len) .shift(p)` | matching methods | done | | `.between(lo, hi)` | `.Between(lo, hi)` | done | | `.is_in([...])` | `.IsIn(v...)` | done | | `.over(partition)` | `.Over(keys...)` | done (scalar-agg fast path + generic gather-eval-scatter) | | `.sort` | `.Sort(desc)` (fluent via FunctionNode) | done | | `.rank(method)` | `.Rank(method)` | done | | `.rolling_sum` / `.rolling_mean` / ... | matching `.Rolling*(size, minPeriods)` | done | | `.ewm_mean` / `.ewm_var` / `.ewm_std` | matching `.EWM*(alpha)` | done | ## LazyFrame methods | polars | golars | Status | | | ----------------------------------------------------- | ------------------------------------------------------ | ------ | ---------------------------------------------------------------------------- | | `lf.filter` / `.select` / `.with_columns` | `lf.Filter / .Select / .WithColumns` | done | | | `lf.with_column` | `lf.WithColumn(expr)` | done | | | `lf.sort / .group_by / .agg / .join` | matching | done | | | `lf.slice / .head / .limit / .tail` | matching | done | | | `lf.reverse` | `lf.Reverse()` | done | | | `lf.unique` | `lf.Unique()` | done | | | `lf.drop / .rename` | matching | done | | | `lf.collect` / `.collect_unoptimized` | matching | done | | | `lf.explain / .show_graph` | `lf.Explain()` / `lf.ExplainTree()` / `lf.ShowGraph()` | done | `ShowGraph` emits Mermaid; the REPL `.graph` command adds lipgloss colouring | | `lf.sink_csv / sink_parquet / sink_ipc / sink_ndjson` | `lf.Sink(ctx, writer)` plus `io/*.WriteFile` in writer | done | | ## dtype | polars | golars | Status | | ----------------------------------------------------------------------- | ----------------------------------------------------------------- | ---------------------- | | Bool, Int8, Int16, Int32, Int64, UInt8 through UInt64, Float32, Float64 | `dtype.Bool()`, `dtype.Int64()` and friends | done | | String / Utf8 | `dtype.String()` | done | | Binary | `dtype.Binary()` | done | | Null | `dtype.Null()` | done | | Date / Datetime(unit, tz) / Duration / Time | `dtype.Date / Datetime / Duration / Time` + `series.FromTime` | done | | List(inner) / Array / Struct / Field | `dtype.List / FixedList / Struct` + `series.{ListOps, StructOps}` | done | | Categorical / Enum | | todo | | Decimal / Float16 / Int128 | | todo | | Object / Unknown | | todo (polars-internal) | See [roadmap.md](roadmap.md) for the perf side of the same picture (throughput ratios vs polars 1.39 on the bench suite). # Architecture overview golars is a layered query engine over Apache Arrow memory. The user writes code against the `DataFrame` (eager) or `LazyFrame` (lazy) facades. Lazy code flows through an expression AST, a logical plan, an optimizer, and a physical plan before reaching the executor. Both eager and lazy paths converge on the same compute kernels and Series primitives. ## Layered component map ## Boundary between golars and arrow-go golars uses `apache/arrow-go/v18` as its memory and IO substrate. The line is: **arrow-go owns:** * Array implementations (`arrow.Array`, all typed arrays) * Memory allocation, buffers, reference counts (`memory.Allocator`, `memory.Buffer`) * Arrow IPC reader and writer * Parquet reader and writer * CSV reader (we wrap it) * Schema primitives at the physical level (`arrow.Schema`, `arrow.Field`) **golars owns:** * A logical dtype model on top of arrow dtypes, carrying polars semantics (for example logical dates, categoricals, enums, and the `Null` dtype) * `Series` as a named, chunked, nullable column with dtype-aware methods * `DataFrame` composition and transformation operations * Expression AST, logical and physical plan, optimizer * Streaming executor * Group-by, join, sort, and pivot algorithms * SQL frontend When a polars feature exists in arrow-go with the right semantics, we wrap rather than reimplement. When polars' semantics differ from arrow's (null handling edge cases, dtype promotion, string comparisons), we implement in golars and document the choice. ## Data flow in an eager pipeline ## Data flow in a lazy pipeline The streaming executor reads morsels (record batches of bounded row count) from sources, pipes them through operator goroutines over buffered channels, and terminates at a sink. Each operator stage scales horizontally with `GOMAXPROCS`. ## Conformance strategy We use py-polars as the behavioral oracle during development. The `internal/testutil` package holds helpers that: * Generate fixture DataFrames from JSON or parquet files under `testdata/`. * Compare golars output to a golden file produced by a py-polars script committed alongside the fixture. * Fail with a human-readable diff on drift. This keeps us honest about semantics without pulling Python into CI. Python is only needed when regenerating golden files. # Cookbook End-to-end recipes for common tasks. Every snippet compiles and assumes `import "github.com/Gaurav-Gosain/golars"` plus whatever sub-package a particular line needs. ## Typed columns for compile-time literal checks The `expr` package ships a typed facade (`expr.C[T]`, `expr.Int`, `expr.Float`, `expr.Str`, `expr.Bool`) that lets Go infer literal types from method arguments, eliminating the `expr.Lit(int64(...))` boilerplate: ```go import "github.com/Gaurav-Gosain/golars/expr" qty := expr.Int("qty") price := expr.Float("price") out, _ := lazy.FromDataFrame(df). Filter(expr.All(qty.Gt(2), price.Lt(50))). WithColumns( price.MulCol(qty.CastFloat64()).As("total").Expr, qty.Between(2, 5).Alias("in_range"), ). Collect(ctx) ``` The runtime plan is identical to the untyped `expr.Col("qty"). GtLit(int64(2))` form. Passing a string literal to an int-typed column fails at build time rather than panicking at evaluation. See `examples/*/generic/` in the repository for side-by-side comparisons. ## List and struct namespaces Expression-level helpers mirror polars' `.list.*` and `.struct.*`: ```go import "github.com/Gaurav-Gosain/golars/expr" lazy.FromDataFrame(df).Select( expr.Col("tags").List().Len().Alias("tag_count"), expr.Col("payload").Struct().Field("x").Alias("x"), expr.Col("csv").Str().SplitExact(",").List().Get(0).Alias("first"), ).Collect(ctx) ``` Supported list reducers: `Len`, `Sum`, `Mean`, `Min`, `Max`, `First`, `Last`, `Get(idx)`, `Contains(needle)`, `Join(sep)` (string lists). Supported struct ops: `Field(name)`. ## Unnest / explode / upsample Unnest a struct column: ```go out, _ := df.Unnest(ctx, "payload") // struct {x:i64, y:str} becomes two top-level cols `x` and `y`. ``` Explode a list column (null and empty lists become a single null row): ```go out, _ := df.Explode(ctx, "tags") // [[a, b, c], [], NULL, [d]] produces 3 + 1 + 1 + 1 = 6 rows. ``` Upsample a sorted timestamp column to a dense grid: ```go out, _ := df.Upsample(ctx, "ts", "1d") // rows spaced > 1d apart get filled with null-valued neighbours. ``` Accepted intervals: `ns`, `us`, `ms`, `s`, `m`, `h`, `d`, `w`. Calendar units (`mo`, `y`) are rejected. ## Pretty-print a logical plan ```go fmt.Println(lazy.ExplainTree(plan.Plan())) // SORT [total desc] // └── AGG keys=[dept] aggs=[...] // └── FILTER (col("salary") > 75) // └── SCAN df ``` `lazy.ExplainTreeASCII` swaps the box-drawing glyphs for ASCII fallbacks. `lf.ExplainTree()` is the full three-section report (logical, optimiser, optimised) rendered as a tree. ## Read a CSV, filter, write Parquet ```go df, _ := golars.ReadCSV("trades.csv") defer df.Release() out, _ := golars.Lazy(df). Filter(golars.Col("volume").GtLit(int64(100))). Collect(ctx) defer out.Release() golars.WriteParquet(out, "heavy_trades.parquet") ``` ## Group + aggregate multiple columns in one pass ```go agg, _ := golars.Lazy(df). GroupBy("symbol"). Agg( golars.Sum("qty"), golars.Mean("price").Alias("avg_price"), golars.Max("price").Alias("hi"), ). Sort("qty_sum", true). Collect(ctx) ``` golars detects that every agg targets the same column in a single bucket and fuses them through `groupby_fused.go`, so four aggregations cost one scan. ## Join a CSV against a Parquet lookup table ```go trades, _ := golars.ReadCSV("trades.csv") defer trades.Release() lookup, _ := golars.ReadParquet("symbols.parquet") defer lookup.Release() out, _ := golars.Lazy(trades). Join(golars.Lazy(lookup), []string{"symbol"}, golars.InnerJoin). Collect(ctx) ``` ## Scan (lazy I/O) plus predicate pushdown ```go import iocsv "github.com/Gaurav-Gosain/golars/io/csv" // iocsv.Scan returns a LazyFrame that opens the file only when // Collect runs. Combined with Filter + Select, the optimiser pushes // the projection down through the scan. lf := iocsv.Scan("/tmp/huge.csv"). Filter(golars.Col("region").EqLit("us")). Select(golars.Col("symbol"), golars.Col("price")) for batch, err := range lf.IterBatches(ctx) { if err != nil { log.Fatal(err) } defer batch.Release() // stream-process each batch here } ``` ## Null handling: drop, fill, or flag ```go clean, _ := golars.Lazy(df).DropNulls("price", "qty").Collect(ctx) filled, _ := golars.Lazy(df).FillNull(int64(0)).Collect(ctx) mask, _ := df.AnyNullMask(ctx) defer mask.Release() // `mask` is a boolean Series you can plug back into Filter to flag // bad rows without dropping them. ``` ## Select by dtype or name predicate ```go import "github.com/Gaurav-Gosain/golars/selector" numericOnly, _ := df.SelectBy(selector.Numeric()) noTimes := df.DropBy(selector.EndsWith("_ts")) // Combinators: intersect, union, minus. usdCols, _ := df.SelectBy(selector.Intersect( selector.Float(), selector.StartsWith("price_usd"), )) ``` ## Cross-language with Arrow ```go rec := df.ToArrow() // arrow.RecordBatch tbl := df.ToArrowTable() // arrow.Table (multi-chunk) roundtrip, _ := dataframe.FromArrowTable(tbl) ``` Both sides are Arrow IPC format-compatible. Write with `io/ipc.Write` and read in PyArrow, pola.rs, DuckDB, or any other Arrow-aware tool without format conversion. ## String munging ```go out, _ := df.Apply(func(s *series.Series) (*series.Series, error) { if s.Name() != "email" { return s.Clone(), nil } return s.Str().Before("@") }) ``` `.Str().Before` / `.After` / `.SplitNth` cover the common parsing cases; `.SplitWide` returns multiple Series so you can stitch them into a DataFrame with extra columns. ## Cache an intermediate pipeline ```go base := golars.Lazy(df). Filter(golars.Col("active").EqLit(true)). Cache() // Two downstream pipelines share the same filtered base. top, _ := base.Sort("score", true).Head(10).Collect(ctx) flag, _ := base.Filter(golars.Col("score").LtLit(0.5)).Collect(ctx) ``` Cache memoises the first Collect result; subsequent collects reuse it. The cached frame is released automatically when the cache's LazyFrame handle is garbage-collected. ## When / then / otherwise ```go out, _ := golars.Lazy(df). Select(golars.When(golars.Col("age").Gt(golars.Lit(18))). Then(golars.Lit("adult")). Otherwise(golars.Lit("minor")). Alias("category")). Collect(ctx) ``` Mixed numeric dtypes are promoted (int then + float otherwise -> float64 out). Null cond values are treated as false (polars semantics). ## Rolling operations ```go // Rolling sum/mean/min/max/std/var with a fixed window. out, _ := golars.Lazy(df). Select( golars.Col("price").RollingMean(30, 1).Alias("ma30"), golars.Col("price").RollingStd(30, 5).Alias("vol30"), ).Collect(ctx) ``` Second argument is `min_periods` (0 = require full window). Int64 inputs with no nulls take a SIMD-friendly O(n) slide (two-phase warmup + 4-way unrolled step). ## Regex on strings ```go // Boolean mask for regex hits. mask, _ := series.FromString("s", []string{"a1", "xx", "b22"}, nil). Str().ContainsRegex(`\d+`) // Extract first capture group. ids, _ := emails.Str().Extract(`@([a-z.]+)$`, 1) // Count matches per row. counts, _ := tokens.Str().CountMatchesRegex(`\w+`) ``` ## Pivot (long -> wide) ```go // Mirror of polars' df.pivot(index="id", on="cat", values="v"). wide, _ := df.Pivot(ctx, []string{"id"}, "cat", "v", dataframe.PivotSum) ``` Aggregators: `PivotFirst`, `PivotSum`, `PivotMean`, `PivotMin`, `PivotMax`, `PivotCount`. ## Window functions with `.Over(...)` ```go // Per-group total broadcast back to every row. out, _ := golars.Lazy(df). Select(golars.Col("revenue").Sum().Over("region").Alias("region_total")). Collect(ctx) // Per-group rank. ranked, _ := golars.Lazy(df). Select(golars.Col("score").Rank("dense").Over("cohort").Alias("rank_in_cohort")). Collect(ctx) ``` ## Forward / backward fill + NaN ```go // Replace every NaN with 0 in float columns (integer cols pass through). filled, _ := golars.Lazy(df).FillNan(0).Collect(ctx) // Carry the last non-null value forward through consecutive nulls. // Pass limit=3 to stop after three consecutive fills; limit=0 means unlimited. ff, _ := golars.Lazy(df).ForwardFill(0).Collect(ctx) bf, _ := golars.Lazy(df).BackwardFill(0).Collect(ctx) // Per-column variant via Expr: out, _ := golars.Lazy(df). WithColumns(golars.Col("price").ForwardFill(0).Alias("price")). Collect(ctx) ``` ## Reshape: transpose, unpivot, partition ```go // Transpose: each input column becomes a row; output value type is // the promoted float64 (polars' object-dtype fallback is not yet // supported, so non-numeric inputs error). wide, _ := df.Transpose(ctx, "column", "row") // Unpivot (melt): turn value columns into two long-form columns: // variable (the original column name) + value. long, _ := df.Unpivot(ctx, []string{"id"}, nil /* default: all non-id */) // Partition: one DataFrame per distinct key tuple; ordered by first // appearance. Caller releases each partition. parts, _ := df.PartitionBy(ctx, "region", "symbol") for _, p := range parts { defer p.Release() } ``` ## Top-K / Bottom-K / Pipe ```go topSellers, _ := df.TopK(ctx, 10, "revenue") worstLatency, _ := df.BottomK(ctx, 5, "p99_ms") // Pipe keeps chained code flat: out, _ := df.Pipe(func(d *DataFrame) (*DataFrame, error) { return d.Filter(ctx, mask) }) ``` ## Stats: skew, kurtosis, corr, cov, approx\_n\_unique ```go sk, _ := col.Skew() // polars default (biased) sku, _ := col.SkewUnbiased() // scipy bias=False kk, _ := col.Kurtosis() // excess kurtosis c, _ := a.PearsonCorr(b) // Pearson r cov, _ := a.Covariance(b, 1) // ddof=1 corrMat, _ := df.Corr(ctx) // k-by-k frame covMat, _ := df.Cov(ctx, 1) approx, _ := col.ApproxNUnique() // HLL estimate ``` ## Extra math helpers ```go // Trig + hyperbolic family: atan2, cbrt, sinh/cosh/tanh, log1p, expm1, // radians/degrees, arccos/arcsin/arctan, cot, arcsinh/arccosh/arctanh. radians, _ := col.Radians() y, _ := colY.Arctan2(colX) ``` ## Coalesce, concat\_str, ones/zeros/int\_range ```go picked, _ := golars.Lazy(df). Select(golars.Coalesce(golars.Col("primary"), golars.Col("fallback"))). Collect(ctx) joined, _ := golars.Lazy(df). Select(golars.ConcatStr("-", golars.Col("sym"), golars.Col("year"))). Collect(ctx) // Build Series out of thin air: ids, _ := golars.Lazy(df). WithColumns(golars.IntRange(0, 100, 1).Alias("idx")). Collect(ctx) ``` ## Arrow IPC streaming (cross-language) ```go sw, _ := golars.NewIPCStreamWriter(conn, firstBatch) for batch := range batches { sw.Write(ctx, batch) } sw.Close() // Consumer side (also polars/pyarrow/DuckDB-compatible): sr, _ := golars.NewIPCStreamReader(conn) defer sr.Close() for batch, err := range sr.Iter(ctx) { if err != nil { log.Fatal(err) } process(batch) batch.Release() } ``` ## Row-wise (horizontal) reductions ```go // Append a column that sums three others on a per-row basis. withTotal, _ := golars.Lazy(df). SumHorizontal("total", "q1", "q2", "q3"). Collect(ctx) defer withTotal.Release() // Or compute the reduction directly as a standalone Series. total, _ := golars.SumHorizontal(ctx, df, "q1", "q2", "q3") defer total.Release() ``` Variants: `SumHorizontal`, `MeanHorizontal`, `MinHorizontal`, `MaxHorizontal`, `AllHorizontal`, `AnyHorizontal`. Omit the column list to span every numeric (or boolean) column. Null handling defaults to `IgnoreNulls`; pass `dataframe.PropagateNulls` on the frame-level method for polars' strict semantics. ## One-row frame-level aggregates ```go sums, _ := df.SumAll(ctx) // one row, one column per numeric input means, _ := df.MeanAll(ctx) counts, _ := df.CountAll(ctx) // counts non-nulls for every column nulls, _ := df.NullCountAll(ctx) ``` These mirror polars' `df.sum()`, `df.mean()`, `df.count()`. They are convenient for dashboards, describe-style summaries, and streamed ETL checkpoints. ## Lazy scans (pushdown-friendly I/O) ```go lf := golars.ScanCSV("huge.csv"). Filter(golars.Col("region").EqLit("us")). Select(golars.Col("symbol"), golars.Col("price")) out, _ := lf.Collect(ctx) ``` Every format has a scan entry point: `ScanCSV`, `ScanParquet`, `ScanIPC`, `ScanJSON`, `ScanNDJSON`. Compared to `Read*`, a scan defers opening the file until `Collect` so the optimiser can push projections + filters into the reader. ## Rank and percent-change ```go r, _ := golars.Lazy(df). Select( golars.Col("score").Rank("dense").Alias("rank"), golars.Col("score").PctChange(1).Alias("delta"), ). Collect(ctx) ``` ## Apply a custom Go function ```go out, _ := df.Apply(func(s *series.Series) (*series.Series, error) { switch s.DType().String() { case "i64": return s.ApplyInt64(func(v int64) int64 { return v * 2 }) case "str": return s.Str().Upper() } return s.Clone(), nil }) ``` ## REPL / scripting quickies Inside `golars` or in a `.glr` file: ``` load data/trades.csv filter volume > 100 with_row_index row cast price f64 fill_null 0 rename volume as qty sum qty write out.parquet ``` Scalar-only prints: `.sum COL`, `.mean COL`, `.min COL`, etc. write one-line results instead of a table, convenient for quick spot checks. ## Per-language equivalents | polars (Python) | golars (Go) | | ----------------------------------- | ---------------------------------------------------------------- | | `pl.read_csv(p)` | `golars.ReadCSV(p)` | | `df.filter(pl.col("x") > 5)` | `df.Filter(ctx, mask)` or `lazy` with `golars.Col("x").GtLit(5)` | | `df.group_by("k").agg(pl.sum("v"))` | `df.GroupBy("k").Agg(ctx, []expr.Expr{expr.Col("v").Sum()})` | | `df.unique()` | `df.Unique(ctx)` | | `df.sample(n=10)` | `df.Sample(ctx, 10, false, seed)` | | `df.with_row_index()` | `df.WithRowIndex("index", 0)` | | `pl.col("s").str.to_uppercase()` | `s.Str().Upper()` | | `df.select(cs.numeric())` | `df.SelectBy(selector.Numeric())` | | `pl.sum_horizontal("a", "b")` | `golars.SumHorizontal(ctx, df, "a", "b")` | | `df.fill_nan(0)` | `lf.FillNan(0)` | | `df.fill_null(strategy="forward")` | `lf.ForwardFill(0)` | | `df.fill_null(strategy="backward")` | `lf.BackwardFill(0)` | | `df.top_k(10, by="x")` | `df.TopK(ctx, 10, "x")` | | `df.transpose()` | `df.Transpose(ctx, "column", "row")` | | `df.unpivot(index=["id"])` | `df.Unpivot(ctx, []string{"id"}, nil)` | | `df.partition_by("k")` | `df.PartitionBy(ctx, "k")` | | `df.corr()` | `df.Corr(ctx)` | | `pl.coalesce(...)` | `golars.Coalesce(...)` | | `pl.concat_str(..., sep)` | `golars.ConcatStr(sep, ...)` | | `pl.int_range(0, n)` | `golars.IntRange(0, int64(n), 1)` | | `pl.when(p).then(a).otherwise(b)` | `golars.When(p).Then(a).Otherwise(b)` | | `pl.col("x").rolling_sum(w)` | `golars.Col("x").RollingSum(w, 0)` | | `pl.col("x").sum().over("k")` | `golars.Col("x").Sum().Over("k")` | | `s.str.extract(pattern, i)` | `s.Str().Extract(pattern, i)` | | `s.str.contains_regex(p)` | `s.Str().ContainsRegex(p)` | | `df.pivot(index=, on=, values=)` | `df.Pivot(ctx, index, on, values, agg)` | | `df.sum()` | `df.SumAll(ctx)` | | `pl.scan_csv(p)` | `golars.ScanCSV(p)` | | `pl.scan_parquet(p)` | `golars.ScanParquet(p)` | See `docs/api-surface.md` for the full cross-reference. # Getting Started ## Install as a library ```sh go get github.com/Gaurav-Gosain/golars@latest ``` ## Install the CLI ```sh go install github.com/Gaurav-Gosain/golars/cmd/golars@latest ``` Type `golars help` to see every subcommand: ``` golars start interactive REPL golars run SCRIPT execute a .glr script golars fmt [-w] FILE canonicalize a .glr script golars lint FILE report common .glr mistakes golars schema FILE print column names + dtypes golars stats FILE print describe() stats golars head FILE [N] print first N rows (default 10) golars diff A B show row-level diff between two files golars sql QUERY [FILE...] run a SQL query against files golars browse FILE interactive TUI table viewer golars explain SCRIPT print the lazy plan ``` ## Your first query Programmatically: ```go package main import ( "context" "fmt" "log" "github.com/Gaurav-Gosain/golars/dataframe" "github.com/Gaurav-Gosain/golars/expr" "github.com/Gaurav-Gosain/golars/lazy" "github.com/Gaurav-Gosain/golars/series" ) func main() { ctx := context.Background() dept, _ := series.FromString("dept", []string{"eng", "eng", "sales", "ops"}, nil) salary, _ := series.FromInt64("salary", []int64{100, 120, 80, 70}, nil) df, _ := dataframe.New(dept, salary) defer df.Release() plan := lazy.FromDataFrame(df). Filter(expr.Col("salary").Gt(expr.Lit(int64(75)))). GroupBy("dept"). Agg(expr.Col("salary").Sum().Alias("total")). Sort("total", true) out, err := plan.Collect(ctx) if err != nil { log.Fatal(err) } defer out.Release() fmt.Println(out) } ``` From the shell: ```sh # Read a CSV, describe it golars stats trades.csv # Run SQL against it golars sql "SELECT symbol, SUM(volume) AS vol FROM trades GROUP BY symbol ORDER BY vol DESC LIMIT 5" trades.csv # Interactively browse it golars browse trades.csv ``` ## The REPL Run `golars` with no arguments to open the interactive REPL: ``` golars » load trades.csv ok loaded trades.csv (1,234,567 × 6) golars » filter volume > 100 ok added FILTER to pipeline: col("volume") > 100 golars » groupby symbol amount:sum:vol ok added GROUP BY [symbol] with 1 aggs golars » sort vol desc ok added SORT vol desc to pipeline golars » head 10 ``` The REPL ships with inline ghost-text completions, command history, and tab completion for paths and column names. ## Next # Introduction **golars** is a pure-Go DataFrame library modeled on [polars](https://github.com/pola-rs/polars) and built directly on arrow-go. No cgo. Single `go build` cross-compiles. ```go import ( "context" "fmt" "github.com/Gaurav-Gosain/golars/compute" "github.com/Gaurav-Gosain/golars/dataframe" "github.com/Gaurav-Gosain/golars/series" ) ctx := context.Background() names, _ := series.FromString("name", []string{"ada", "brian", "carl"}, nil) ages, _ := series.FromInt64("age", []int64{27, 34, 19}, nil) df, _ := dataframe.New(names, ages) defer df.Release() mask, _ := compute.GtLit(ctx, ages, int64(20)) adults, _ := df.Filter(ctx, mask) defer adults.Release() fmt.Println(adults) ``` ## Highlights * **Eager + lazy execution.** Build pipelines as logical plans, let the optimizer fuse projections/filters, then `Collect(ctx)`. * **Streaming engine.** Morsel-driven execution for datasets that don't fit in memory. * **Polars-grade performance.** Matches or beats polars 1.39 on most polars-compare workloads. * **I/O included.** CSV, Parquet, IPC, JSON, NDJSON readers/writers; `io/sql` bridge for any `database/sql` driver. * **Scripting + REPL.** `.glr` scripts run via `golars run my.glr` or inside the interactive REPL with inline ghost-text completions. * **LLM-native.** MCP server exposes golars tools to Claude Desktop, Cursor, Windsurf, and other MCP hosts. ## Install ```sh go get github.com/Gaurav-Gosain/golars@latest ``` The CLI ships separately: ```sh go install github.com/Gaurav-Gosain/golars/cmd/golars@latest ``` ## Where next # MCP: golars as a tool for your LLM host `golars-mcp` is a Model Context Protocol server that exposes a read-only subset of golars as tools an LLM host can invoke. Works with Claude Desktop, Cursor, Windsurf, and any other MCP-aware client. ## Install ```sh go install github.com/Gaurav-Gosain/golars/cmd/golars-mcp@latest ``` The binary lives in `$GOBIN` (or `$HOME/go/bin` by default). ## Configure Claude Desktop Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (or `%APPDATA%\Claude\claude_desktop_config.json` on Windows; the Linux equivalent lives under `~/.config/Claude/`) and add a `mcpServers` entry: ```json { "mcpServers": { "golars": { "command": "/absolute/path/to/golars-mcp" } } } ``` Restart the Claude Desktop app. You should see a hammer icon in the conversation pane letting you enable the `golars` server. ## Configure Cursor / Windsurf Both editors read the same JSON format. Add the same snippet to your workspace MCP config (`~/.cursor/mcp.json` for Cursor). No restart needed in Cursor; the server picks up automatically. ## Available tools | Tool | What it does | | ------------- | ------------------------------------------------------- | | `schema` | Return column names + dtypes for a data file | | `head` | Return the first N rows (CSV/Parquet/Arrow/JSON/NDJSON) | | `describe` | Return describe()-style summary stats | | `sql` | Run a SQL query against one or more files | | `row_count` | Cheap "how many rows × cols" probe | | `null_counts` | Per-column null counts | Every tool returns *both* a plain-text fallback (for hosts that only render text) and a `structuredContent` payload with `columns` + `rows` arrays so richer UIs can render a table. ## Example session After configuring the server, ask your LLM host something like: > "What's the schema of `~/data/trades.csv`? If any column has more > than 10% nulls, summarise it with describe." The host picks up the tool catalogue from `tools/list`, calls `schema` then `null_counts` then `describe`, and the model answers using the structured results. ## Protocol notes * Protocol version: `2025-06-18`. We only implement the tools capability; resources and prompts are not served (yet). * Transport: stdio JSON-RPC 2.0, one object per line. * No authentication: `golars-mcp` reads files the user running the host process has read access to. ## Security The MCP server is **read-only**. It cannot write files, start subprocesses, or reach the network. The tools only accept a path string and execute a query against its contents; SQL is compiled to a lazy plan with a whitelist of operators (no arbitrary expressions or DDL). That said, it *will* read any file the caller names : don't point a host LLM at secrets. ## Extending `cmd/golars-mcp/tools.go` registers tools into a flat slice. Add a new `Tool{Name, Description, InputSchema, Run}` entry and the server picks it up. Keep tools pure (no state outside the local session) so concurrent calls are safe. # Memory model A columnar analytics library lives or dies by how it manages memory. Go's garbage collector is capable but unforgiving of allocation-heavy hot loops. This document describes the rules golars follows to keep memory predictable. ## Buffer ownership All column data ultimately sits in `memory.Buffer` (from arrow-go). A buffer is a reference-counted, opaque handle to a contiguous byte region. We never copy buffer contents when we can share them. Reference counting rules: * Creating a `Series` from an `arrow.Array` retains the array's buffers. * Cloning a `Series` shares buffers, not copies them. * Slicing a `Series` shares buffers with an offset and length. * A `DataFrame` holds retains on every Series it owns. * Release happens through `Series.Release()` and `DataFrame.Release()`. Without an explicit release, the GC collects the wrapper and the underlying arrow buffer release runs via a finalizer. Finalizers are a safety net, not the intended path. Release explicitly in tight loops. ## Allocator Every Series, array, and kernel output is allocated through a `memory.Allocator`. The default is `memory.DefaultAllocator`. For benchmarks and tests we use `memory.NewCheckedAllocator` which panics on unreleased buffers. Every test that constructs Series must use a checked allocator and verify zero leaks at teardown. Allocator choice flows through a `ctx.Context` in plan execution. Expression evaluation takes the allocator from the surrounding execution context, not from a global. ## Immutability `Series` and `DataFrame` are immutable to the user. Every mutating-looking method returns a new value. Under the hood we exploit ref-counted buffer sharing so that `df.Rename("a", "b")` is O(1) and does not copy column data. This buys us a few things: * Safe concurrent reads without locks. * Easier reasoning about plan transformations. * The optimizer can reorder and eliminate subplans without worrying about side effects. ## The chunked model A Series is a sequence of chunks. Each chunk is an `arrow.Array` of the same dtype. Properties: * All chunks share one dtype and one validity bitmap format. * Total length is the sum of chunk lengths. * Chunk boundaries are an implementation detail. Kernels must not depend on specific chunk sizes for correctness, only for scheduling. Why chunks: * Natural unit of parallelism. * Natural unit of streaming (a morsel is a DataFrame-shaped collection of chunks, one per column). * Enables append-without-copy: appending two Series concatenates chunk lists instead of copying. The downside is that kernels must iterate over chunks. We mitigate this with a `series.Iter()` helper that yields `(chunk arrow.Array, offset int)` pairs. ## Null masks Arrow's validity bitmap is one bit per row, ones for valid, zeros for null. golars never materializes a null to a sentinel value. All kernels operate on `(data, bitmap)` pairs. Aggregations skip null positions. Comparisons propagate null per polars' semantics: `null == null` is null, `null < 1` is null, and so on. Bitmaps are shared via buffer refcounts just like data buffers. ## Hot-loop allocation rules Performance work follows a small set of rules: 1. **No allocation in inner loops.** Pre-allocate result buffers sized to the input. Use `memory.Allocator.Allocate(n)` once per chunk, not per row. 2. **No interface boxing in inner loops.** Hot kernels dispatch on dtype once at the outer level and then work on concrete `[]T` slices. We rely on generated code (`go generate`) to produce dtype-specialized kernels rather than paying interface dispatch cost per row. 3. **No map operations in inner loops.** Hash tables used by groupby and join are dedicated open-addressing implementations under `internal/hash`. No `map[K]V` in aggregation critical paths. 4. **Reuse buffers across morsels.** The streaming executor keeps a free-list of `memory.Buffer` per operator and reuses them across morsels where size permits. 5. **Bounded per-operator memory.** Operators declare a memory budget and spill to disk when they exceed it. ## Cross-operator sharing Projection pushdown and common subexpression elimination mean that the same underlying column appears in multiple operator outputs. We never copy: the output Series of a projection shares buffers with the input. Refcounting ensures correctness. ## GC pressure management Go's GC is concurrent and low-latency, but allocation pressure still drives pause frequency and throughput cost. golars keeps pressure low by: * Working in large `[]T` slices instead of many small objects. * Using `sync.Pool` for short-lived per-morsel scratch buffers (hash temp arrays, partition index buffers). * Avoiding string allocation on the hot path. String columns are kept in arrow's native offset-plus-buffer layout and operated on as byte slices. * Keeping the `Chunk` struct small (a few pointers) so that slices of chunks fit in cache. We run the test suite under `GODEBUG=gctrace=1` in CI and watch for surprise allocation. ## A note on off-heap We do not use off-heap memory (mmap backed by anonymous regions) by default. arrow-go's `memory.GoAllocator` returns Go-managed slices. We switch to `memory.CgoArrowAllocator` only if profiling shows GC overhead is a problem on real workloads, and only if we decide to relax the no-cgo constraint. For now, staying on-heap is simpler and fast enough. Spill-to-disk for OOC is different and uses mmap on regular files. That is not off-heap allocation; that is swapping to disk. # Parallelism model Go's concurrency primitives fit analytical query execution well. Goroutines are cheap, channels provide back-pressure for free, and `select` handles cancellation cleanly. This document describes how golars uses them. ## Two kinds of parallelism golars distinguishes two parallelism patterns: 1. **Data parallelism inside a single operator.** A filter over a column with 64 chunks can run all chunks in parallel. The result preserves chunk order. This is what the eager executor uses for every kernel. 2. **Pipeline parallelism across operators.** A groupby-agg over a parquet file runs the scan, the hash partition, the partial aggregate, and the final merge as separate stages, each in its own goroutine, communicating via channels. This is what the streaming executor uses. Both patterns compose. A single stage in the streaming executor may itself run data-parallel kernels internally. ## Worker pool A single process-wide worker pool (default size `GOMAXPROCS`, overridable) dispatches chunk-level work for the eager executor. Benefits over spawning goroutines ad hoc: * Predictable concurrency. We cap simultaneous work, which keeps memory bounded. * Cheap cancellation. Closing the pool's context stops all in-flight chunks without leaking goroutines. * Natural place to plug instrumentation (rows processed, time per chunk). The pool lives in `internal/pool`. Its public type is `*Pool` with methods `Submit(ctx, func(ctx) error)` and `Wait()`. Internally it uses a bounded `chan func()` fed by a fixed set of worker goroutines. We do not use `sync.WaitGroup` directly in user-facing code. `errgroup.Group` from `golang.org/x/sync/errgroup` handles error propagation and cancellation in the usual Go idiom. ## Morsel-driven streaming The streaming executor borrows the morsel-driven model from HyPer and DuckDB and adapts it to Go. The implementation lives in the `stream` package; `lazy.WithStreaming()` compiles streaming-friendly plan prefixes into a `stream.Pipeline`. Primitives currently available: * `Source`, `Stage`, `Sink` as plain function types. * `DataFrameSource` slices an in-memory frame into morsels. A row-partitioned parallel source is the next delivery. * `FilterStage`, `ProjectStage`, `WithColumnsStage`, `RenameStage`, `DropStage`, `SliceStage` (state-carrying, tracks the running row counter across morsels). * `ParallelMapStage` is the combinator that turns a per-morsel function into an order-preserving fan-out. It tags morsels on ingress with a sequence number, dispatches to a small worker pool, and uses a reorder buffer on egress so downstream stages see input order regardless of worker count. `ParallelFilterStage`, `ParallelProjectStage`, `ParallelWithColumnsStage` are thin wrappers. * `CollectSink` concatenates morsel chunks column-wise into a single DataFrame. Hybrid execution: `lazy.Collect(ctx, lazy.WithStreaming())` runs the longest streaming-friendly prefix through the pipeline executor. When a blocker node (Sort, Aggregate, Join) appears above that prefix, the upstream DataFrame is materialized first and the blocker runs eagerly. This keeps the surface simple (one `Collect` call) while letting streaming pay off for scan + filter + project chains and not regress for blockers. **Morsel.** A morsel is an `arrow.Record` with a bounded number of rows (default 64K, tuned by workload). All inter-stage communication is in morsels. **Channel back-pressure.** Every inter-stage channel has a small buffer (default 4). When a downstream stage is slow, its input channel fills, blocking the producer. This is the back-pressure mechanism, and it costs no allocation. **Exchange.** Partition-parallel operators (hash groupby, hash join) insert a hash exchange: a stage that takes one input channel and fans out to N output channels, one per partition, by hashing the keys. Downstream workers own their partition end-to-end. **Pipeline breakers.** Sort and groupby-agg with no suitable partition key are pipeline breakers. They buffer, compute, and then emit. The streaming executor tracks breakers explicitly so planners can decide when spilling is necessary. **Cancellation.** Every stage takes a `context.Context`. When the context cancels (user abort, downstream error, sink closed), stages drain their input channels, release references to any morsels they hold, and return. ## Why goroutines over thread pools polars uses Rayon, which is ideal for CPU-bound data-parallel loops in Rust. Go's goroutine scheduler does the same job for our workloads with less ceremony: * Goroutines are cheap enough that we do not need a join-on-completion primitive. We just launch them. * Channels are zero-allocation queues (for values up to channel element size). We do not reinvent bounded queues. * `select` handles timeouts, cancellation, and multi-source reads in one construct. No event loop needed. The tradeoff is that Go does not give us work-stealing the way Rayon does. In practice this matters less than it seems because our work units (morsels) are large enough that simple FIFO dispatch from the pool's channel is rarely the bottleneck. If profiling proves otherwise, we introduce work-stealing at the pool level without changing the operator interface. ## Determinism Parallel execution must not change results. Rules: * Chunked operations preserve chunk order in the output. A parallel filter produces chunks in the same order as input, even if they finished out of order. * Aggregation results are deterministic modulo the associativity of the aggregation. Sum, min, max, count are exact. Mean, std, var use a numerically stable parallel algorithm (Welford-Chan) and are reproducible. * Sort is stable. * Row order in a DataFrame is preserved across operations unless an operation explicitly reorders (sort, join, groupby). Tests cover determinism explicitly: the same input run N times produces byte-identical output. ## Scheduling heuristics The planner picks chunk size and partition count based on: * Estimated row count of the input * Number of group-by keys (for partition count) * `GOMAXPROCS` If estimates are unavailable (lazy input from a scan with no statistics), we default to a morsel size of 64K rows and a partition count of `2 * GOMAXPROCS`. These defaults are tunable per-session. ## Profiling and observability Every operator records rows in, rows out, bytes processed, and wall time. These metrics are available via `df.Profile()` on eager calls and `lf.Profile()` on lazy calls. Under the hood, the pool exposes `expvar` counters so long-running programs can scrape them. # golars scripting language (.glr) `.glr` files are a tiny, line-oriented language for pipeline-style DataFrame work. Designed to feel like "your REPL session in a file", nothing more. Every REPL command you know is also a script statement. ```bash # trades-daily.glr load data/trades.csv as trades load data/symbols.csv as symbols use trades filter volume > 100 groupby symbol amount:sum:total join symbols on symbol sort total desc limit 10 show ``` Run it: ```sh golars run trades-daily.glr # one-shot ``` From inside the REPL: ``` golars » .source trades-daily.glr ``` *** ## Grammar ```text program = { statement NL } ; statement = empty | comment | command ; comment = "#" { any-char-until-NL } ; command = [ "." ] identifier { arg } ; arg = identifier | number | string | operator | "as" | "on" ; string = '"' { any-char } '"' ; identifier = ( letter | "_" ) { letter | digit | "_" | "." | "-" | "/" | ":" } ; number = [ "-" ] digit { digit } [ "." digit { digit } ] ; operator = "==" | "!=" | "<=" | ">=" | "<" | ">" | "and" | "or" ; ``` Statements are line-terminated. A trailing `\` (after any trailing whitespace) continues onto the next physical line, useful for long filter predicates: ```bash filter salary > 100000 \ and dept == "eng" \ and tenure_years >= 2 ``` The leading `.` on every command is optional. A `#` inside a `"..."` string is treated as a literal - only unquoted `#` starts a comment. Typos get a `did you mean?` hint from the runner. *** ## Statement reference | Statement | What it does | | ------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | | `load PATH` | Focus a new frame (csv/tsv/parquet/ipc/arrow). | | `load PATH as NAME` | Stage a frame under `NAME` without touching focus. | | `use NAME` | Switch focus to a clone of `NAME`. `NAME` stays staged so repeated `use` branches off the same base; prior focus is discarded. | | `stash NAME` | Materialize the focus and save it under `NAME`; focus continues with the snapshot. | | `frames` | List loaded frames. The focused one is marked `*`. | | `drop_frame NAME` | Release `NAME` from the registry. | | `save PATH` | Materialize the focused pipeline and write to disk. | | `select COL [, COL...]` | Project columns (lazy). | | `drop COL [, COL...]` | Drop columns (lazy). | | `filter PRED` | Add a filter predicate (lazy). See [predicate grammar](#predicate-grammar). | | `sort COL [asc\|desc]` | Sort by one column (lazy). | | `limit N` | Keep the first N rows (lazy). | | `head [N]` | Collect and print first N rows (default 10). | | `tail [N]` | Collect and print last N rows. | | `show` | Alias for `head 10`. | | `ishow` / `browse` | Open the focused pipeline in the interactive browse TUI on the alt screen. Quit with `q` to return to the REPL. | | `schema` | Print column names + dtypes. | | `describe` | count/null\_count/mean/std/min/25%/50%/75%/max per column. | | `groupby KEYS AGG [AGG...]` | Group + aggregate. KEYS is comma-separated. AGG is `col:op[:alias]`; op is `sum`/`mean`/`min`/`max`/`count`/`null_count`/`first`/`last`. | | `join PATH\|NAME on KEY [TYPE]` | Join the focus with a file or named frame. TYPE ∈ `inner`/`left`/`cross` (default inner). | | `explain` | Print logical plan, optimiser trace, optimised plan. | | `explain_tree` / `tree` | Same three-section report rendered as a box-drawn tree. | | `graph` / `show_graph` | Styled plan tree with lipgloss colour coding. | | `mermaid` | Emit the plan as a Mermaid flowchart. Pipe into `mmdc` for PNG/SVG. | | `collect` | Materialize the pipeline back into the focused frame's source. | | `reset` | Discard the lazy pipeline; keep the source. | | `source PATH` | Run another `.glr` file inline. | | `reverse` | Reverse row order of the focus. | | `sample N [seed]` | Uniform-random sample of N rows without replacement. | | `shuffle [seed]` | Randomly reorder every row. | | `unique` | Drop duplicate rows across every column. | | `null_count` | Per-column null count as a 1-row frame. | | `glimpse [N]` | Compact peek at the first N rows (default 5). | | `size` | Estimated Arrow byte size of the pipeline result. | | `timing` | Toggle per-statement timing. | | `info` | Runtime info: Go version, heap, uptime, row counts. | | `clear` | Clear the screen. | | `exit` / `quit` | Quit the REPL (no-op in `golars run` mode). | | `cast COL TYPE` | Cast COL to `i64`/`i32`/`f64`/`f32`/`bool`/`str`. | | `fill_null VALUE` | Replace nulls across compatible columns with VALUE. | | `drop_null [COL...]` | Drop rows with nulls in any (or the listed) columns. | | `rename OLD as NEW` | Rename one column. | | `sum COL` / `mean COL` / `min COL` / `max COL` / `median COL` / `std COL` | Print one scalar for COL. | | `write PATH` | Alias for `save`. Supported sinks: `.csv`, `.tsv`, `.parquet`, `.arrow`, `.ipc`, `.json`, `.ndjson`/`.jsonl`. | | `with_row_index NAME [OFFSET]` | Prepend an int64 row index. | | `sum_horizontal OUT [COL...]` | Append a row-wise sum column (nulls ignored). | | `mean_horizontal OUT [COL...]` | Append a row-wise mean column. | | `min_horizontal OUT [COL...]` | Row-wise min. | | `max_horizontal OUT [COL...]` | Row-wise max. | | `all_horizontal OUT [COL...]` | Row-wise boolean AND. | | `any_horizontal OUT [COL...]` | Row-wise boolean OR. | | `sum_all` / `mean_all` / `min_all` / `max_all` / `std_all` / `var_all` / `median_all` | One-row per-column aggregate over every numeric column. | | `count_all` / `null_count_all` | One-row per-column (null-)count. | | `scan_csv PATH [as NAME]` | Register a lazy CSV scan (push-down friendly). | | `scan_parquet PATH [as NAME]` | Lazy Parquet scan. | | `scan_ipc PATH [as NAME]` | Lazy Arrow IPC scan. | | `scan_json PATH [as NAME]` | Lazy JSON scan. | | `scan_ndjson PATH [as NAME]` | Lazy NDJSON scan. | | `scan_auto PATH [as NAME]` | Infer the scan format from the file extension. | | `fill_nan VALUE` | Replace NaN with VALUE in every float column. | | `forward_fill [LIMIT]` | Forward-fill nulls per column (LIMIT=0 is unlimited). Leading nulls stay null. | | `backward_fill [LIMIT]` | Backward-fill nulls per column. Trailing nulls stay null. | | `top_k K COL` | Keep K rows with the largest values in COL. | | `bottom_k K COL` | Keep K rows with the smallest values in COL. | | `transpose [HEADER_COL] [PREFIX]` | Transpose the focus (numeric/bool columns). | | `unpivot IDS [VALS]` | Wide-to-long reshape. IDS/VALS are comma-separated lists. | | `partition_by KEYS` | Print a summary of per-key-combination row counts. | | `skew COL` / `kurtosis COL` | Scalar skewness / excess kurtosis. | | `approx_n_unique COL` | HyperLogLog estimate of distinct-value count. | | `corr COL1 COL2` / `cov COL1 COL2` | Pair-wise Pearson corr / sample cov. | | `pivot INDEX ON VALUES [AGG]` | Long-to-wide pivot. AGG: first/sum/mean/min/max/count. | | `pwd` / `ls [PATH]` / `cd [PATH]` | Working-directory helpers. | | `with NAME = EXPR` | Append a derived column. EXPR is a real expression: arithmetic, comparisons, logical ops, string methods, aggregates, rolling windows. | | `unnest COL` | Project fields of a struct-typed column as top-level columns. | | `explode COL` | Fan out each element of a list-typed column into its own row. | | `upsample COL EVERY` | Interpolate a sorted timestamp column at `ns`/`us`/`ms`/`s`/`m`/`h`/`d`/`w` intervals. | ### String operations String-column ops are reachable via `col(x).str.()` on the expression API, and inside `.filter` via keyword operators that desugar to the same Exprs. Every op below is backed by the series kernel of the same name; the expression layer is a thin dispatch on the function name. | Filter keyword | Expr method | Notes | | ------------------ | ------------------------------ | -------------------------------------------- | | `contains "sub"` | `col(x).str.contains("sub")` | Literal substring, no regex | | `starts_with "p"` | `col(x).str.starts_with("p")` | Byte-prefix | | `ends_with "s"` | `col(x).str.ends_with("s")` | Byte-suffix | | `like "%pat%"` | `col(x).str.like("%pat%")` | SQL wildcards: `%` any, `_` one, `\\` escape | | `not_like "%pat%"` | `col(x).str.not_like("%pat%")` | Negation fused into the kernel | Non-predicate string ops available on Expr (no filter-grammar sugar, used via `.select`, `.with_column`, or in aggregations): `str.to_lower`, `str.to_upper`, `str.trim`, `str.strip_prefix(p)`, `str.strip_suffix(s)`, `str.replace(o, n)`, `str.replace_all(o, n)`, `str.len_bytes`, `str.len_chars`, `str.count_matches(s)`, `str.find(s)`, `str.head(n)`, `str.tail(n)`, `str.slice(start, length)`, `str.contains_regex(pat)`. ### Expression grammar (for `with`) `with NAME = EXPR` accepts a full expression tree, not just the filter DSL. Bare identifiers resolve to column references; string methods hang off `.str.*`; aggregates, rolling, and EWM are reachable via fluent method calls; `col(...)`, `lit(...)`, `sum(...)`, and `coalesce(...)` are available as top-level functions. Examples: ```bash with bulk = amount > 1000 with name_upper = name.str.upper() with revenue = price * qty with trend = amount.rolling_mean(7, 1) with score = coalesce(primary, backup).str.trim() with ewm = value.ewm_mean(0.3) ``` Supported string methods: `upper`, `lower`, `trim`, `reverse`, `contains`, `contains_regex`, `starts_with`, `ends_with`, `like`, `not_like`, `replace`, `replace_all`, `strip_prefix`, `strip_suffix`, `len_bytes`, `len_chars`, `slice`, `head`, `tail`, `find`, `count_matches`, `split_exact`. Supported aggregates: `sum`, `mean`, `min`, `max`, `count`, `null_count`, `first`, `last`, `median`, `std`, `var`, `quantile`, `skew`, `kurtosis`, `n_unique`, `approx_n_unique`. Supported shape ops: `abs`, `neg`, `not`, `round`, `floor`, `ceil`, `sqrt`, `exp`, `log`, `log2`, `log10`, `sign`, `reverse`, `shift`, `diff`, `cum_sum`, `cum_min`, `cum_max`, `fill_null`, `alias`, `cast`, `between`, `forward_fill`, `backward_fill`. Supported windows: `rolling_sum`, `rolling_mean`, `rolling_min`, `rolling_max`, `rolling_std`, `rolling_var`, `ewm_mean`, `ewm_std`, `ewm_var`. ### Predicate grammar For `filter`: ``` col op value [and|or col op value]... ``` * No parentheses, left-to-right evaluation. * Ops: `==`, `!=`, `<`, `<=`, `>`, `>=`, `is_null`, `is_not_null`. * Values: integers, floats, double-quoted strings, `true`, `false`. Examples: ```bash filter age >= 21 and salary > 50000 filter symbol == "AAPL" filter is_active and created_at > 1704067200000000 filter note is_null ``` *** ## Multi-source workflows Scripts regularly need N frames. The `as NAME` / `use NAME` / `.frames` trio is the whole story: there's no hidden namespace: ```bash # Stage every input up front. None of these promote themselves to # focus, so we can read them in any order. load data/trades.csv as trades load data/symbols.csv as symbols load data/users.csv as users # Work on one, stash it, work on the next. use trades filter volume > 100 groupby user_id amount:sum:total_bought stash trade_totals # `use` is non-consuming: trade_totals stays staged, and so does the # original trades frame: we could `use trades` again to branch off a # different filter. use users filter region == "US" join trade_totals on user_id join symbols on symbol sort total_bought desc show ``` `stash` is the "save into a variable" move: it materializes whatever lazy pipeline is on the focus and parks a copy under `NAME` so later `use NAME` gives you that snapshot. The focus itself keeps going from the snapshot, so the idiomatic branching pattern is: ```bash load data/trades.csv filter volume > 100 stash base filter side == "buy" stash buys use base filter side == "sell" stash sells use buys join sells on symbol ``` When `.join` sees a name that exists in the frame registry, it consumes that frame (keeping it in the registry for reuse) instead of treating the argument as a path. Paths win only when no frame matches. ### Anonymous `load PATH` The short form `load PATH` (no `as`) is equivalent to `use NAME` where `NAME` is empty. It's the "single-frame script" ergonomic: ```bash load data/trades.csv filter volume > 100 show ``` No registry, no juggling: just pipe. *** ## Transpile to Go `golars transpile SCRIPT.glr [-o OUT.go] [--package NAME]` emits a standalone Go program that reproduces the pipeline through the lazy API. The generated source is piped through `go/format` and has its imports pruned by `go/ast`, so the output is always gofmt'd and free of unused imports. ```sh golars transpile examples/script/pipeline.glr -o main.go --package main go run main.go ``` Mapping: | `.glr` | Go | | --------------------------------------- | -------------------------------------------------------- | | `load PATH` | `golars.ReadCSV(PATH, csv.WithNullValues(""))` | | `load PATH as NAME` | stashes the LazyFrame in an internal map for later `use` | | `use NAME` | retargets `focus` onto the stashed frame | | `filter EXPR` | `.Filter(EXPR)` | | `with NAME = EXPR` | `.WithColumns(EXPR.Alias("NAME"))` | | `groupby KEY COL:OP[:ALIAS] ...` | `.GroupBy(KEY).Agg(...)` | | `sort COL [desc]` | `.Sort(COL, desc)` | | `limit N` | `.Limit(N)` | | `head N` | `.Limit(N)` + `.Collect` + `fmt.Println` | | `join NAME on KEY [inner\|left\|cross]` | `.Join(other, []string{KEY}, dataframe.InnerJoin)` | | `show` | `.Head(10).Collect` + `fmt.Println` | | `save PATH` | `golars.WriteCSV(df, PATH)` (or matching writer) | Commands without a direct lazy equivalent (`.tree`, `.graph`, `.mermaid`, `.reset`, `.frames`) emit a `TODO(glr):` comment so the file still compiles. If no `show` / `head` / `collect` / `save` appears in the script, transpile adds an implicit final `Collect` + `fmt.Println` so the generated binary prints something instead of exiting silently. See [`examples/script/transpiled/`](https://github.com/Gaurav-Gosain/golars/tree/main/examples/script/transpiled) for a transpiled copy of every bundled `.glr` example. *** ## Interop with code Anything that implements `script.Executor` can host the language. `cmd/golars` is the reference, but the package ships a generic runner: ```go import "github.com/Gaurav-Gosain/golars/script" r := script.Runner{ Exec: script.ExecutorFunc(func(line string) error { /* … */ return nil }), Trace: func(line string) { fmt.Println(">", line) }, ContinueOnErr: true, ErrOut: os.Stderr, } if err := r.RunFile("pipeline.glr"); err != nil { log.Fatal(err) } ``` * `Trace` receives every normalised statement just before execution. * `ContinueOnErr` + `ErrOut` emits errors inline and keeps running. * `script.Normalize(raw)` is exported so third parties can apply the same parsing rules (comment stripping, leading `.` insertion). *** ## Editor support Tree-sitter grammar + highlight queries live at [`editors/tree-sitter-golars/`](../editors/tree-sitter-golars/). Install notes for Neovim (`nvim-treesitter`) and VS Code in that directory's README. ### LSP golars-lsp in Neovim: inlay hints showing frame shape after every .glr statement [`cmd/golars-lsp`](../cmd/golars-lsp/) is a minimal Language Server that ships: * **Inline completions** for commands, staged-frame names, file paths, and column names read from loaded CSV files. * **Inlay hints** showing each pipeline step's output shape - `→ 5 rows × 3 cols` appears at the end of every shape-changing statement. Row counts propagate as upper bounds: `limit N` clamps to `N`, left joins preserve the left side's count, filters and inner joins mark rows `?`. * **Hover docs** with signature + long description on any command token. * **Diagnostics** for unknown commands and files that don't resolve. ### `# ^?` probe: live table previews Drop `# ^?` on its own line anywhere in a script and the Neovim plugin renders the focused frame's current table as virtual text below the comment (via a `golars --preview` subprocess). This is the scripting equivalent of Twoslash/Quokka probes: a live peek at the data at that pipeline position: ```bash load data/trades.csv filter volume > 100 sort amount desc limit 5 # ^? ``` The preview updates on save + debounced text changes; configure via `require("golars").setup({ preview_cmd = { "/path/to/golars" }, preview_rows = 20, preview_timeout_ms = 3000 })`. Set `preview = false` to disable. ### `golars --preview ` Invoke the preview pipeline from any editor, or manually via: ```sh golars --preview path/to/script.glr golars --preview path/to/script.glr --preview-rows 25 ``` Runs the script silently (no banner, no trace, no success chrome) and prints exactly one rendered table: the focused pipeline's head. Exit code 0 on success, non-zero on script error (message on stderr). *** ## What this language is NOT * No variables beyond the named-frame registry. If you need branching or reusable expressions, write a Go program that drives `script.Runner` with your own logic. * No control flow (no `if`, no loops). The idiom for conditional runs is shell scripting around `golars run`, or a Go host with an `Executor` that dispatches. * No expression language beyond the filter predicate DSL and groupby agg spec. Polars-style `pl.col("a") + pl.col("b")` is a `cmd/golars` feature we might add later, but the base language stays small. The design target is "drop a day of REPL work into a file and have it run again tomorrow." Everything else is out of scope. # SQL frontend `golars.sql` exposes a subset SQL frontend that compiles into the same lazy plan as the Go API. Run queries from the shell, from the REPL, or programmatically. ## From the CLI ```sh golars sql "SELECT dept, SUM(amount) AS total FROM sales GROUP BY dept ORDER BY total DESC" sales.csv ``` Each file becomes a table whose name is the filename stem. Multiple files register as separate tables so cross-file queries Just Work: ```sh golars sql "SELECT t.*, s.name FROM trades t" trades.csv symbols.csv ``` ## From Go ```go import "github.com/Gaurav-Gosain/golars/sql" session := sql.NewSession() defer session.Close() session.Register("people", df) out, err := session.Query(ctx, "SELECT name FROM people WHERE age > 25") defer out.Release() ``` ## Grammar ``` SELECT [DISTINCT] projection_list FROM table_name [WHERE predicate] [GROUP BY col_list] [ORDER BY col_list [ASC|DESC]] [LIMIT n] ``` `projection_list`: * `*` * `col[, col...]` (optionally each with `AS name`) * `agg(col)[, ...]` (agg: `SUM`, `MIN`, `MAX`, `AVG`, `MEAN`, `COUNT`, `FIRST`, `LAST`) * any mix of the above when there is a `GROUP BY` `predicate`: * `col OP value [AND|OR col OP value]...` * `OP` is one of `=`, `!=`, `<`, `<=`, `>`, `>=` `value`: * integer literal (`42`) * float literal (`3.14`) * single-quoted or double-quoted string (`'us'`, `"ops"`) * `true`, `false` ## Limitations * No JOIN clause yet (use the `df.Join(...)` API directly). * No window functions (use `Expr.Over(keys...)` in Go). * No subqueries. * No arithmetic in SELECT expressions (use `WithColumns` in Go). All of these come free with golars' Go API; the SQL frontend focuses on what ad-hoc shell queries need. ## From the MCP server `golars-mcp` exposes the same SQL compiler as a tool named `sql`. Host LLMs invoke it with `{query, files}` arguments. See [MCP integration](/docs/mcp).