Storage, Quotas, and Isolation

Datasets are a separate pipeline inside Loguro.

They do not share queues, storage paths, or DuckDB query instances with logs.

Storage path

Dataset records are written as Parquet files under a project and dataset scoped path:

/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L0/*.parquet

Compaction moves files through the same tier model:

/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L0/*.parquet
/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L1/*.parquet
/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L2/*.parquet

The dataset-specific query API reads only:

/datasets/project_<internal>/dataset_<datasetId>/split

It does not read /logs or other datasets in the same project.

Legacy default-dataset files can still exist under:

/datasets/project_<internal>/split

Default dataset queries can fall back to this legacy path for backwards compatibility.

Physical columns

Every Parquet row includes standard columns:

Column	Meaning
`id`	generated row id
internal project reference	Loguro-owned project reference
`dataset_id`	Loguro dataset id, nullable for legacy rows
`timestamp`	event timestamp from the payload
`ingested_at`	when Loguro processed the record
`context`	JSON object stored as JSON text
`__traceId`	request trace id

Declared schema fields become typed Parquet columns.

Example schema:

{
  "fields": {
    "country": "string",
    "amount": "number",
    "active": "boolean",
    "signup_at": "timestamp"
  }
}

Produces columns:

id
internal project reference
dataset_id
timestamp
ingested_at
context
__traceId
country
amount
active
signup_at

Queues

Dataset ingest uses dataset-only queue payloads. Messages include:

{
  "datasetId": 13,
  "records": []
}

Logs and web events use different queue namespaces.

Schema cache

The ingest service reads schema contracts from Redis.

Dataset-specific key:

datasets:schema:project_<internal>:dataset_<datasetId>

Legacy default key:

datasets:schema:project_<internal>

Registering a dataset schema through the Loguro API writes the schema to PostgreSQL and syncs the dataset-specific Redis key for ingest.

Quotas

Datasets have two relevant quota types:

Quota	Meaning
`maxDatasets`	how many dataset metadata records a user can create across projects
`maxDatasetsPerMonth`	how many dataset records can be ingested per month

Dataset record usage is metered separately:

quota_limit:datasets:<userId>
usage_total:datasets:<userId>:YYYY-MM
usage_delta:datasets:<userId>:YYYY-MM

Quota values:

Value	Meaning
`0`	no dataset record quota configured
positive integer	max records for the month
`-1`	unlimited

Dataset record quota does not consume log quota or web analytics quota.

Query isolation

Dataset queries use a separate DuckDB service and repository from logs.

That means:

dataset queries only scan /datasets/project_<id>/dataset_<id>/split
dataset queries do not scan sibling datasets by default
log queries do not see dataset records
web analytics queries do not see dataset records
dataset memory/cache/temp settings can be tuned separately

Operational notes

The dataset worker re-validates schema before writing. This protects storage if a bad message is pushed to Redis directly.

Deleting dataset metadata removes the schema metadata and Redis schema key, but it does not delete existing Parquet files from storage.

If a record cannot be processed after retries, it moves to the dataset failed path/queue used by the ingest worker. Failed dataset messages do not enter the logs failed queue.