Storage, Quotas, and Isolation
Datasets are a separate pipeline inside Loguro.
They do not share queues, storage paths, or DuckDB query instances with logs.
Storage path
Dataset records are written as Parquet files under a project and dataset scoped path:
/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L0/*.parquet Compaction moves files through the same tier model:
/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L0/*.parquet
/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L1/*.parquet
/datasets/project_<internal>/dataset_<datasetId>/split/YYYY-MM-DD/L2/*.parquet The dataset-specific query API reads only:
/datasets/project_<internal>/dataset_<datasetId>/split It does not read /logs or other datasets in the same project.
Legacy default-dataset files can still exist under:
/datasets/project_<internal>/split Default dataset queries can fall back to this legacy path for backwards compatibility.
Physical columns
Every Parquet row includes standard columns:
| Column | Meaning |
|---|---|
id | generated row id |
| internal project reference | Loguro-owned project reference |
dataset_id | Loguro dataset id, nullable for legacy rows |
timestamp | event timestamp from the payload |
ingested_at | when Loguro processed the record |
context | JSON object stored as JSON text |
__traceId | request trace id |
Declared schema fields become typed Parquet columns.
Example schema:
{
"fields": {
"country": "string",
"amount": "number",
"active": "boolean",
"signup_at": "timestamp"
}
} Produces columns:
id
internal project reference
dataset_id
timestamp
ingested_at
context
__traceId
country
amount
active
signup_at Queues
Dataset ingest uses dataset-only queue payloads. Messages include:
{
"datasetId": 13,
"records": []
} Logs and web events use different queue namespaces.
Schema cache
The ingest service reads schema contracts from Redis.
Dataset-specific key:
datasets:schema:project_<internal>:dataset_<datasetId> Legacy default key:
datasets:schema:project_<internal> Registering a dataset schema through the Loguro API writes the schema to PostgreSQL and syncs the dataset-specific Redis key for ingest.
Quotas
Datasets have two relevant quota types:
| Quota | Meaning |
|---|---|
maxDatasets | how many dataset metadata records a user can create across projects |
maxDatasetsPerMonth | how many dataset records can be ingested per month |
Dataset record usage is metered separately:
quota_limit:datasets:<userId>
usage_total:datasets:<userId>:YYYY-MM
usage_delta:datasets:<userId>:YYYY-MM Quota values:
| Value | Meaning |
|---|---|
0 | no dataset record quota configured |
| positive integer | max records for the month |
-1 | unlimited |
Dataset record quota does not consume log quota or web analytics quota.
Query isolation
Dataset queries use a separate DuckDB service and repository from logs.
That means:
- dataset queries only scan
/datasets/project_<id>/dataset_<id>/split - dataset queries do not scan sibling datasets by default
- log queries do not see dataset records
- web analytics queries do not see dataset records
- dataset memory/cache/temp settings can be tuned separately
Operational notes
The dataset worker re-validates schema before writing. This protects storage if a bad message is pushed to Redis directly.
Deleting dataset metadata removes the schema metadata and Redis schema key, but it does not delete existing Parquet files from storage.
If a record cannot be processed after retries, it moves to the dataset failed path/queue used by the ingest worker. Failed dataset messages do not enter the logs failed queue.