Reference¶

Every flag, marker, fixture, CLI command, and public function. The CLI and Python API below are rendered live from the source; the pytest surface (flags, marker, fixture, blob schema) is curated here. For the narrative versions see Quickstart, Choosing a metric, Grouping by dims, Compare two runs, and Catch regressions in CI.

pytest command-line flags¶

The plugin adds these to any pytest run (alongside pytest-benchmark's own flags). This table is generated from the plugin's own --help text, so it can't drift from the code:

Flag	Default	What
`--benchmark-memory`	off	Record peak memory (a memray pass) for every benchmark() call, not just the benchmark_memory fixture — no test changes. Off by default; the fixture is always measured, with or without this flag.
`--benchmark-memory-repeats=N`	—	Force a fixed number of memray passes per benchmark, suite-wide; the reported peak is the min across them. Overridden per-test by @pytest.mark.benchmem(repeats=N). Default: adaptive — run passes until the min floor settles (≥2, cap 10). Set this for a fixed, reproducible count (e.g. CI gating against a saved baseline).
`--benchmark-memory-warmup=N`	`1`	Untracked dry-runs of the action before measuring, suite-wide, to shed one-time costs (lazy imports, first-touch caches) so the measured passes aren't inflated by cold start. Overridden per-test by @pytest.mark.benchmem(warmup=N). Default: 1; set 0 to disable.
`--benchmark-memory-max-time=SECONDS`	—	Wall-clock budget for the adaptive memory passes (the analogue of --benchmark-max-time): caps how long adaptive sampling spends per benchmark. Ignored when --benchmark-memory-repeats forces a fixed count. Default: no time bound — the pass cap alone bounds it.
`--benchmark-memory-compare=REF`	off	Compare this run's peak memory against a prior saved run (a pytest-benchmark storage ref like 0001, or the latest if no value is given); folds base and delta-peak columns into the table.
`--benchmark-memory-compare-fail=FIELD:THRESHOLD`	—	Fail the session on a memory regression, e.g. peak:10%, peak:5MiB, allocations:5% (repeatable). Fields: peak, allocated, allocations, rss (rss needs isolated runs). Implies --benchmark-memory-compare.
`--benchmark-memory-profile=DIR`	—	Save the memray profile (.bin) into DIR — render later with `memray flamegraph` (or tree/summary). Scope follows the gate: WITH --benchmark-memory-compare-fail only the regressing ids, otherwise EVERY measured benchmark. Off by default (disk cost).
`--benchmark-memory-profile-native`	off	Capture native (C/C++/Rust) stacks in the kept profile, so the flamegraph attributes memory inside extension code (polars/numpy/solver bindings) instead of one opaque `??? at ???` bucket. Only affects --benchmark-memory-profile runs; opt-in (slower, bigger .bin). Per-test override: @pytest.mark.benchmem(profile_native=True). Off by default.
`--benchmark-memory-table`	`combined`	Layout for the memory metrics: combined (default) folds them into pytest-benchmark's timing table; split prints a separate memory table.
`--benchmark-memory-columns=peak,allocated,allocations,rss`	—	Which memory metrics the table shows, comma-separated and in order: peak, allocated, allocations, rss (rss only shows for isolated runs). Default: peak only.
`--benchmark-memory-stats=min,mean,max`	—	With repeats > 1, the stats each shown metric spreads into: min, mean, max, median, stddev. A single pass stays one column. Default: min,mean,max.

Timing regressions still use pytest-benchmark's own --benchmark-compare / --benchmark-compare-fail; the --benchmark-memory-compare* flags are the memory mirror. Their baseline comes from pytest-benchmark's storage (.benchmarks/) — save one first with --benchmark-save=NAME or --benchmark-autosave, or the gate finds nothing and passes. See Gating without separate files.

The `benchmem` marker¶

@pytest.mark.benchmem(repeats=3)
def test_build(benchmark_memory):
    ...

Kwarg	Default	What
`repeats`	auto	force a fixed `N` memray passes for this test (default: adaptive — see below). Every pass is kept (the blob stores the whole series); the headline `peak` is the minimum across them, and `--stat` reports any other. Overrides the suite-wide `--benchmark-memory-repeats`.
`warmup`	`1`	untracked dry-runs of the action before measuring, to shed one-time costs (lazy imports, first-touch caches). `0` disables. Ignored under `isolate=True` — see the note below. Overrides the suite-wide `--benchmark-memory-warmup`.
`isolate`	`False`	run each memray pass in a fresh process that calls the action once, and also record whole-process resident memory as the `rss` metric — the physical/OOM-relevant peak memray's logical heap can't give. Per-test only (no suite-wide flag): `rss` is a whole-job capacity number, meaningful only for build+operate benchmarks, so you mark the specific ones. Needs a top-level, picklable benchmarked function (see the whole-job warning below).
`calls`	`1`	how many invocations make up one measured pass, inside a single tracker. `peak` becomes the high-water across the whole sequence and `allocated` / `allocations` its totals. Pair with `isolate=True` to measure buildup — see below. Per-test only, like `isolate`: it redefines what the number means, so there's no suite-wide flag.
`profile_native`	`False`	on the `--benchmark-memory-profile` path, capture native (C/C++/Rust) stacks in the kept `.bin`, so a flamegraph attributes extension-code memory (polars/numpy/solver bindings) instead of one opaque `??? at ???` bucket. Opt-in (slower, bigger `.bin`). Overrides the suite-wide `--benchmark-memory-profile-native`.
`max_peak`	—	fail the test if the headline `peak` exceeds this absolute ceiling. A size string (`"100MiB"`, units `B`/`KiB`/`MiB`/`GiB`) or a bare int (bytes).
`max_allocated`	—	as `max_peak`, on `allocated` (total bytes).
`max_allocations`	—	as above, on the `allocations` count — a bare number (no unit).

Isolated rss measures the whole job — build the state inside the callable

The rss metric (isolate=True) runs the action in a fresh, empty process. Two consequences:

The build must happen inside the measured callable, and the callable must be a top-level, picklable function. The child starts with nothing, so it must construct whatever it operates on; and spawn serializes the call with standard pickle, so a lambda or closure is rejected (we don't use cloudpickle) — pass a module-level function plus lightweight args.

# ✅ ships only the spec (~bytes); the child builds + writes cold = the whole job's RSS
benchmark_memory(build_and_write, spec, n)

# ❌ a lambda/closure — rejected; std pickle can't serialize it (even build-inside)
benchmark_memory(lambda: write(build(spec, n)))

# ❌ a top-level partial over a *pre-built* model pickles fine, but ships the model and
#    measures *deserializing* it, not building it — the build never re-runs in the child
model = build(spec, n)
benchmark_memory(partial(write, model))

You can't isolate a single sub-phase. Since the child must build before it can operate, isolated rss is a build-plus-operate capacity number by construction, never a per-phase figure (e.g. write-only). For per-phase memory, use the in-process peak metric, which can measure a write given an already-built model. So the rule is two-part: use a top-level function (no lambdas), and don't pass heavy pre-built state — build it inside.

warmup doesn't apply to isolated passes — they're one cold call each

An isolated pass is a fresh process that calls the action exactly once; both its peak and its rss describe that call. warmup is an in-process knob only — ru_maxrss is a monotonic whole-process high-water that can't be reset, so a warmup call inside the child would fold its own peak into rss. (repeats still applies: N repeats = N fresh processes.)

So isolated peak is a cold first-call number — one-time costs inside the action land in it — and rss carries the child's interpreter + memray floor (~25-40 MiB) plus any setup state. Both are capacity figures; compare them across runs of the same harness, not against the in-process peak for the same test.

Measuring buildup — `calls`¶

repeats and calls both run the action more than once, but they answer different questions:

	What it does	Answers
`repeats=N`	N separate passes, each measured on its own	"how noisy is one call?" — the headline is the min across them
`calls=N`	N invocations inside one tracked pass	"what does N of these cost together?"

calls is how you catch memory a workload accumulates — a cache that grows, a leak — which a single call can't show and repeats can't either (those are independent passes; under isolate they're independent processes):

@pytest.mark.benchmem(isolate=True, calls=50)
def test_buildup(benchmark_memory):
    benchmark_memory(handle_request, payload)

rss then reports what the process holds after 50 requests. Compare against a calls=1 variant of the same benchmark and the difference is the buildup.

Why rss and not peak

memray only sees allocations made after its tracker starts, so in-process peak measures demand within the tracked window — it can't see state a previous call retained. The OS-level rss can, which is why buildup wants isolate=True. (Within a single calls=N pass, peak does see the accumulation, since all N calls are inside one tracker.)

Absolute ceilings — `max_peak` / `max_allocated` / `max_allocations`¶

@pytest.mark.benchmem(max_peak="100MiB", max_allocations=5000)
def test_build(benchmark_memory):
    benchmark_memory(build_model, 1000)

A baseline-free guardrail: the test fails if the measured metric exceeds the ceiling (test_build: peak 117 MiB exceeds max_peak 100 MiB). Thresholds are absolute only — there's no saved run to take a percent of; for relative gating against a prior run use --benchmark-memory-compare-fail or benchmem compare --fail-on. A ceiling is a worst-case budget, so with repeats > 1 (including adaptive sampling) the gate reads the worst pass — not the headline min — and fails if any pass breaches it; the two coincide for a single pass. The ceiling is enforced wherever memory is measured — the benchmark_memory fixture and the --benchmark-memory patch — but a plain benchmark() call without --benchmark-memory measures no memory, so the marker is a no-op there.

Scope: the benchmarked action only

This gates the benchmarked action only (the isolated call pytest-benchmem measures), not the whole test. For a whole-test limit or leak check, that's pytest-memray's limit_memory / limit_leaks — see the README's "With pytest-memray".

How many passes? By default pytest-benchmem samples adaptively — after an untracked warmup run, it runs the memray pass until the min floor settles (≥2 passes; capped at 10, or a --benchmark-memory-max-time budget). Deterministic code settles in ~3 passes; noisy code runs more. Set repeats=N (marker) or --benchmark-memory-repeats=N (suite) to force a fixed, reproducible count — what CI gating against a saved baseline wants. Full rationale and the noisy-workload guidance are in the guide: Repeats & adaptive sampling.

The `benchmark_memory` fixture¶

Depends on pytest-benchmark's benchmark fixture; measures peak in a separate untimed pass, then times via pytest-benchmark.

Order — memory first (cold), then timing

Every call form runs the memray pass first, then pytest-benchmark's timing (calibration + all rounds). This matters: timing runs the function thousands of times, which grows and fragments the allocator's arenas — so measuring memory after timing would report the warm plateau, not the fresh-process floor the headline min is meant to be. Memory-first measures the cold cost (the warmup pass still sheds the one-time cold-start within it); timing then runs cleanly, with no memray hooks active. This holds for __call__, pedantic, and the --benchmark-memory patch alike. The standalone measure_peak / measure_memory have no timing phase at all; warmup=0 skips the warmup, repeats=N forces a fixed count.

Call formPedantic form

Times then measures function(*args, **kwargs):

benchmark_memory(sorted, data)

Explicit control, like pytest-benchmark's pedantic plus a memory pass:

benchmark_memory.pedantic(target, args=(), kwargs=None, setup=None,
                          rounds=1, warmup_rounds=0, iterations=1)

setup — a callable run untracked before each measured call; if it returns (args, kwargs), those supply the call's arguments. Used for both the timed rounds and each (adaptive) memory sample — one setup rebuilds fresh state for both — so a stateful action's memory samples stay independent. The same applies to benchmark.pedantic(setup=…) under --benchmark-memory: no extra changes.
rounds, warmup_rounds, iterations — as in pytest-benchmark.

Mostly memory, little timing? There's no memory-only switch — the entry rides pytest-benchmark's timing. To trim it: --benchmark-min-rounds=1 --benchmark-max-time=0 (no test changes), or pedantic(rounds=1, warmup_rounds=0) for a single call. For pure memory outside pytest, use measure_peak / measure_memory.

Attributes (available after a call):

Attribute	What
`extra_info`	pytest-benchmark's per-benchmark dict. Set scalars here to attach analysis dims; the memory blob lands here under the `benchmem` key.
`peak_bytes`	peak memory (bytes) from the last call, or `None` before any call.
`result`	the full `MemoryResult` from the last call, or `None`.

The `extra_info.benchmem` blob¶

Each measured benchmark stores this dict under extra_info["benchmem"] — three flat per-repeat series, one entry per memray pass. Every reported number (headline peak = min, any --stat) derives from these on read:

Key	What
`peak_bytes`	per-repeat high-water of live bytes — the `peak` metric (headline = min)
`allocations`	per-repeat allocation count — the `allocations` metric
`total_bytes`	per-repeat total bytes allocated — the `allocated` metric (churn `peak` hides)
`rss_bytes`	per-repeat whole-process resident high-water (`VmHWM` on Linux, `ru_maxrss` on macOS) — the `rss` metric. Only present under `isolate=True` (each pass a fresh process making one cold call); absent otherwise. Includes the child's interpreter + memray floor, so it's a capacity figure, not a delta.

{"peak_bytes": [800000, 805000], "allocations": [12, 12], "total_bytes": [800000, 805000]}

See Choosing a metric for when to reach for each, and --stat for distributions.

CLI — `benchmem`¶

Ships with the base install (sweep, flamegraph); compare and plot need pytest-benchmem[plot]. The full command tree and every option, captured live from the typer app as it actually renders in a terminal:

benchmem --help
Usage: benchmem [OPTIONS] COMMAND [ARGS]...

pytest-benchmem — plot and compare benchmark runs.

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the │
│ installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮
│ plot Render an interactive plotly view from one or more pytest-benchmark runs. │
│ compare Print a per-id table for one run, or compare two or more (and optionally gate CI). │
│ sweep Run a benchmark suite across several installed versions of a package. │
│ flamegraph Render a kept memory profile in one step — resolve the ``.bin`` for a test and run │
│ memray. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

benchmem plot --help
Usage: benchmem plot [OPTIONS] RUNS...

Render an interactive plotly view from one or more pytest-benchmark runs.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * runs... PATH pytest-benchmark JSON file(s). [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --columns [time|peak|allocated|allocation Metric to plot: time | peak | │
│ s|rss] allocated | allocations | rss │
│ (rss = isolated runs only). One │
│ per figure (a plot has a single │
│ value axis) — same flag as │
│ `compare`; the spread shows as │
│ whiskers via --band. │
│ [default: time] │
│ --view TEXT compare | scatter | sweep | │
│ scaling (default: by count). │
│ --facet TEXT Dim to facet by. │
│ --pivot TEXT Comparison axis for --view │
│ compare/scatter: fold a single │
│ run along this dim instead of │
│ across run-files (param:NAME or │
│ a bare extra_info name); its │
│ values become the compared │
│ series. Like --group-by but it │
│ sets what's *compared*, not how │
│ rows cluster. Mutually │
│ exclusive with multiple runs. │
│ --x TEXT scaling: dim for the x-axis. │
│ --color TEXT scaling: dim to colour series │
│ by (default: run label, else │
│ inferred). │
│ --clip FLOAT Clamp the colour scale. │
│ --where TEXT Filter rows by dim: KEY=VALUE │
│ (repeatable, AND-combined). │
│ --free-axes [x|y|both] Free facet axes: x | y | both │
│ (needs --facet). │
│ --band [auto|minmax|none] scaling: spread whiskers on │
│ memory metrics — auto | minmax │
│ | none. │
│ [default: auto] │
│ --log-log --linear scaling: log-scale both axes │
│ (the default when the data is │
│ positive). Per-axis │
│ --log-x/--log-y override; │
│ --linear forces both linear. │
│ --log-x --linear-x scaling: force the x-axis log │
│ or linear. │
│ --log-y --linear-y scaling: force the y-axis log │
│ or linear. │
│ --y-zero --no-y-zero scaling: anchor a linear y-axis │
│ at 0 (auto: on whenever y is │
│ linear). │
│ --label -l TEXT Series label per run, in order │
│ (repeat). Default: stem. │
│ --output -o PATH Out file; .html is interactive, │
│ .png/.svg/.pdf/.jpg/.webp │
│ export a static image. │
│ --open --no-open [default: no-open] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

benchmem compare --help
Usage: benchmem compare [OPTIONS] RUNS...

Print a per-id table for one run, or compare two or more (and optionally gate CI).

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * runs... PATH One or more pytest-benchmark runs, oldest → newest. One prints a plain │
│ table; two or more compare (a sweep is N). │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --columns TEXT Comma list of metrics: time | peak | allocated | allocations | rss (rss │
│ = isolated runs only; e.g. peak or time,peak,rss). Default: time,peak. │
│ Each is shown across every --stat; a metric absent from every run is │
│ dropped. extra:NAME adds the numeric extra_info value NAME as a plain │
│ stat-less label column (e.g. extra:variables). │
│ --group-by TEXT Group rows into sub-tables: fullname | name | func | group | module | │
│ class | param:NAME (comma-composable). │
│ [default: fullname] │
│ --stat TEXT Which stat column(s) per metric: min | max | mean | median | stddev, or │
│ all (the default) for the full spread side by side. │
│ --sort TEXT Row order: name (id) | value (largest in the last run) | change. │
│ [default: name] │
│ --pivot TEXT Comparison axis: fold a single run along this dim instead of across │
│ run-files — param:NAME or a bare extra_info name. Rows differing only in │
│ it pair up and its values become the compared series. Like --group-by │
│ but it sets what's *compared*, not how rows cluster. Mutually exclusive │
│ with multiple runs. │
│ --format TEXT Output format: table (rich terminal, default) | md (GitHub-flavored │
│ markdown to stdout — redirect to a file or pipe into a PR comment / │
│ $GITHUB_STEP_SUMMARY). │
│ [default: table] │
│ --diff Collapsed baseline view: one row per benchmark × metric, showing the │
│ first run's value then a coloured signed Δ% per later run vs that │
│ baseline (flat in metric/run count — metrics go on the row axis). Needs │
│ at least two series (runs, or one run with --pivot over a multi-value │
│ dim) and defaults --stat to min. │
│ --csv PATH Also write the raw (unscaled) comparison to this CSV file. │
│ --fail-on TEXT Exit non-zero on a regression of the first run vs the last (or, with │
│ --pivot, the first dim value vs the last). FIELD:THRESHOLD, repeatable — │
│ e.g. --fail-on peak:10% --fail-on peak:5MiB --fail-on rss:10% (rss gates │
│ only isolated runs). │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

benchmem sweep --help
Usage: benchmem sweep [OPTIONS] PACKAGE VERSIONS...

Run a benchmark suite across several installed versions of a package.

Provisions one fresh uv venv per version — with the pytest harness installed
alongside (pytest-benchmem with --memory, pytest-benchmark without) — copies the
suite into the venv's working directory, runs 'pytest <suite> --benchmark-only'
in each writing <out>/<version>.json, then prints the next step. --memory adds
the memory pass; forward any other pytest flag with --pytest-arg, e.g.
benchmem sweep mypkg 1.2.0 1.3.0 --suite benchmarks/ --memory --pytest-arg=-k.
Exits non-zero if any version fails to provision or yields no benchmark data.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * package TEXT Package under test; each plain version installs `<package>==<v>`. │
│ [required] │
│ * versions... TEXT Versions or pip specs to sweep, e.g. 1.2.0 1.3.0 │
│ git+https://github.com/me/pkg@main. │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ * --suite PATH Benchmark suite (dir or file) to run in each version's │
│ venv. │
│ [required] │
│ --out PATH Directory for the per-version JSON runs. │
│ [default: .benchmarks/sweep] │
│ --memory --no-memory Add --benchmark-memory to each pytest run. │
│ [default: no-memory] │
│ --pytest-arg TEXT Arg forwarded to pytest, one token each, repeatable │
│ (e.g. --pytest-arg=-k). │
│ --pin TEXT Extra pip spec installed alongside (repeatable). │
│ --as-of TEXT YYYY-MM-DD for uv --exclude-newer (reproducible │
│ resolve). │
│ --import-check TEXT Module asserted to resolve to the venv (isolation │
│ preflight). │
│ --copy-dir PATH Directory staged into each venv's cwd (default: the │
│ --suite dir). Point it at the repo root when the suite │
│ reads a conftest or data files outside the suite dir. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

benchmem flamegraph --help
Usage: benchmem flamegraph [OPTIONS] PROFILE_DIR [TEST_ID]

Render a kept memory profile in one step — resolve the ``.bin`` for a test and run memray.

Closes the "regressed → *where*?" loop after ``--benchmark-memory-profile``: instead of
finding the right ``.bin`` and remembering the memray subcommand, point at the profile dir
and name the test (or ``--worst peak`` to auto-pick the heaviest). Defaults to an HTML
flamegraph written next to the ``.bin``; ``--report tree|summary|stats`` prints to the
terminal instead.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * profile_dir PATH Directory of kept .bin profiles (--benchmark-memory-profile). │
│ [required] │
│ [test_id] TEXT Test id (exact, or a unique substring) to render; omit with --worst. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --worst TEXT Auto-pick the heaviest: peak | allocated | allocations │
│ --report TEXT memray reporter: flamegraph | table | tree | summary | stats │
│ [default: flamegraph] │
│ --native --no-native Require the profile to carry native traces (captured via │
│ --benchmark-memory-profile-native); error if it doesn't. │
│ [default: no-native] │
│ --output -o PATH HTML out path (default: next to the .bin). │
│ --open --no-open Open the rendered HTML. [default: no-open] │
│ --force -f Overwrite an existing render. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

plot -o picks the writer from the suffix: .html (the default) is an interactive plotly page, while .png / .svg / .pdf / .jpg / .webp export a static image for a PR comment, README, or docs page. Static export needs kaleido — install pytest-benchmem[plot-static].

Public Python API¶

Light to import — pytest_benchmem re-exports only the engine and the readers; pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells to uv, so import those submodules directly.

Engine¶

measure_peak ¶

measure_peak(
    action: Action, repeats: int | None = None
) -> int

Run action() under memray.Tracker and return peak bytes.

The bare one-liner for a REPL or notebook; :func:measure_memory returns the full result (allocation count, spread). repeats behaves as there — None (default) samples adaptively, an int forces a fixed pass count.

Parameters:

Name	Type	Description	Default
`action`	`Action`	The zero-argument callable to measure.	required
`repeats`	`int \| None`	Fixed pass count, or `None` to sample adaptively.	`None`

Returns:

Type	Description
`int`	Peak bytes (the headline `peak` = min across passes, after warmup).

measure_memory ¶

measure_memory(
    action: Action,
    repeats: int | None = None,
    *,
    warmup: int = _DEFAULT_WARMUP,
    isolate: bool = False,
    calls: int = 1,
    max_time: float | None = None,
    min_passes: int = _ADAPTIVE_MIN_PASSES,
    max_passes: int = _ADAPTIVE_MAX_PASSES,
    patience: int = _ADAPTIVE_PATIENCE,
    keep_bin: Path | None = None,
    native: bool = False,
    setup: Action | None = None,
) -> MemoryResult

Run action() under memray.Tracker → :class:MemoryResult, one pass per repeat.

warmup untracked dry-runs run first to shed one-time costs; then each measured pass gets a fresh tracker. The headline is the min across passes (see :class:MemoryResult); every pass's :class:Measurement is kept for spread stats.

With isolate=True each measured pass runs in a fresh spawned process that calls the action exactly once, and that child's whole-process resident high-water (see :func:_peak_rss_bytes) is recorded as :attr:Measurement.rss_bytes — a physical-memory reading attributable to the action, which an in-process pass can't give. warmup does not apply (see below), and the action (and setup) must be picklable (a top-level callable, not a lambda/closure); keep_bin is ignored in this mode.

calls sets how many invocations make up one measured pass — they run inside a single tracker, so peak becomes the high-water across the whole sequence and the counts its totals. It's how you measure buildup: with isolate=True, calls=N answers "what does a process hold after N of these?", which repeats can't (those are N separate cold processes) and in-process peak can't either (memray only sees allocations made after its tracker starts, so state retained from earlier is invisible to it — the OS-level rss is what catches it).

Two modes, by repeats:

repeats=N (an int) — run exactly N passes. Fixed and reproducible; what CI gating and saved-baseline comparisons want.
repeats=None (default) — sample adaptively: keep running passes until the min stops moving (no new low for patience passes), bounded by min_passes (≥2), max_passes, and an optional max_time budget. Deterministic code settles in a few passes; noisy code runs more.

Parameters:

Name	Type	Description	Default
`action`	`Action`	The zero-argument callable to measure.	required
`repeats`	`int \| None`	Fixed pass count, or `None` to sample adaptively.	`None`
`warmup`	`int`	Untracked dry-runs (`setup` + `action`) before measuring; `0` disables. Ignored when `isolate=True` — every isolated pass is already a cold, fresh process, and a second call in it would inflate that child's resident high-water (a monotonic figure that can't be reset), tying `rss` to the warmup count.	`_DEFAULT_WARMUP`
`calls`	`int`	Invocations per measured pass, inside one tracker (default 1). `peak` is then the high-water across the sequence and `allocations` / `allocated` its totals. Use it with `isolate=True` to measure buildup — memory a workload accumulates across repeated calls (a cache that grows, a leak) shows up in `rss`.	`1`
`isolate`	`bool`	Run each pass in a fresh spawned process that calls the action `calls` times (once by default), and record that child's resident high-water as :attr:`Measurement.rss_bytes`. Requires a picklable `action`/`setup`; makes `peak` a cold first-call number rather than the warm steady state.	`False`
`max_time`	`float \| None`	Wall-clock budget (seconds) for adaptive sampling; `None` = no time bound.	`None`
`min_passes`	`int`	Minimum passes when sampling adaptively.	`_ADAPTIVE_MIN_PASSES`
`max_passes`	`int`	Hard ceiling on passes when sampling adaptively.	`_ADAPTIVE_MAX_PASSES`
`patience`	`int`	Stop adaptive sampling after this many consecutive passes with no new min.	`_ADAPTIVE_PATIENCE`
`keep_bin`	`Path \| None`	If set, the first pass's profile `.bin` is retained here (for a later :func:`render_flamegraph`); the rest still go to temp dirs and are discarded.	`None`
`native`	`bool`	Capture native (C/C++/Rust) stacks in the kept `.bin` so a flamegraph can attribute memory inside extension code instead of an opaque native bucket. Costs runtime and disk; only meaningful with `keep_bin` (ignored otherwise).	`False`
`setup`	`Action \| None`	Optional zero-arg callable run untracked before each pass (and each warmup run) — its allocations are not measured. Use it to rebuild fresh state so a stateful `action` (one that caches on or mutates a carried-over object) gives independent samples instead of a decaying/accumulating series. Mirrors pytest-benchmark's `pedantic(setup=...)`. Under `isolate=True` it runs once in each child, before that child's single tracked call; its state is excluded from `peak` but is resident, so it counts towards `rss` — which is a whole-job figure by design.	`None`

Returns:

Name	Type	Description
`A`	`MemoryResult`	class:`MemoryResult` over every measured pass (warmup runs are not retained).

MemoryResult `dataclass` ¶

A memory measurement across repeats passes, derived from the per-repeat samples.

The per-repeat :attr:samples are the single source of truth — that's all the blob stores (the series); everything else is derived from them on read.

The headline :attr:peak_bytes is the minimum peak across passes — the fresh-process floor, unbiased by the in-process warm plateau (repeated runs fragment/grow arenas and allocate more) that a central stat would report. :attr:allocations / :attr:total_bytes come from that same min-peak run (a coherent snapshot); :attr:peak_bytes_max is the worst peak, so the spread is visible. A warm-plateau / steady-state read is available via the mean / median --stat. A single pass collapses all of these to its own values.

repeats `property` ¶

repeats: int

How many passes were measured.

representative `property` ¶

representative: Measurement

The min-peak run — the one the headline peak/allocations/total_bytes come from.

peak_bytes `property` ¶

peak_bytes: int

The headline peak — the minimum high-water across passes (the fresh-process floor).

peak_bytes_max `property` ¶

peak_bytes_max: int

The worst peak across repeats (equals :attr:peak_bytes with one repeat).

allocations `property` ¶

allocations: int

Allocation count from the representative (min-peak) run.

total_bytes `property` ¶

total_bytes: int

Total bytes allocated by the representative (min-peak) run.

rss_bytes `property` ¶

rss_bytes: int | None

Headline whole-process RSS — the minimum resident high-water across isolated passes (the cold floor, like :attr:peak_bytes), or None if memory wasn't measured in isolation (in-process has no attributable process-global RSS).

series ¶

series(field: str) -> list[Any]

The per-repeat values of one series field (SERIES_FIELDS or optional).

as_dict ¶

as_dict() -> dict[str, Any]

The JSON blob stored under pytest-benchmark extra_info["benchmem"].

The three core per-repeat series, flat, plus any :data:OPTIONAL_SERIES_FIELDS that were measured (all-or-nothing per result). No denormalized scalars and no repeats (it's len of any series). Everything else derives on read.

from_blob `classmethod` ¶

from_blob(blob: Mapping[str, Any]) -> MemoryResult

Rebuild from a blob's per-repeat series. Core columns are required; any :data:OPTIONAL_SERIES_FIELDS are read when present (else left None).

Measurement `dataclass` ¶

One repeat's raw numbers — memray's peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees), plus an optional whole-process resident high-water.

rss_bytes is the whole-process resident high-water of an isolated pass (a fresh child process that calls the action once) — see :func:_peak_rss_bytes; None in-process, where a process-global RSS isn't attributable to the action. Being a whole-process high-water it also carries the child's fixed floor — interpreter, the memray import and tracker (~25-40 MiB together, varying with the harness) — plus any setup state, so it's a capacity figure, not a delta.

Readers & loader¶

from_pytest_benchmark reads timing (seconds, from stats); memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem). load_samples is the unified reader; load_long_df stacks runs into the tidy frame the plots pivot. discover_runs collects saved runs from pytest-benchmark's .benchmarks/ storage, so you can hand the readers a directory instead of listing files.

from_pytest_benchmark ¶

from_pytest_benchmark(
    path: str | Path, *, metric: str = "min"
) -> tuple[str, list[Sample], str]

Read timing out of a pytest-benchmark file → (label, samples, "s").

Dims come from each benchmark's parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	A pytest-benchmark JSON file.	required
`metric`	`str`	Which pytest-benchmark stat to read (`min` / `median` / …).	`'min'`

Returns:

Type	Description
`str`	`(label, samples, unit)` — the run label, one :class:`Sample` per benchmark,
`list[Sample]`	and the unit (`"s"`).

memory_from_pytest_benchmark ¶

memory_from_pytest_benchmark(
    path: str | Path,
    *,
    field: str = "peak_bytes",
    reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]

Read memory out of a pytest-benchmark file → (label, samples, unit).

The benchmark_memory fixture stores each run's memory blob under extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same benchmark id pytest-benchmark uses. Benchmarks lacking the blob (timing-only tests) are skipped. Dims come from parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	A pytest-benchmark JSON file.	required
`field`	`str`	Which series to read — `peak_bytes` (unit `B`), `allocations` (count), or `total_bytes`.	`'peak_bytes'`
`reduce`	`Callable[[list[float]], float] \| None`	Reduce the per-repeat series to one scalar. Default (`None`) derives the headline (peak = min, allocations/total_bytes = the min-peak run); pass a callable for a distribution stat over the series instead.	`None`

Returns:

Type	Description
`str`	`(label, samples, unit)` — the run label, one :class:`Sample` per benchmark
`list[Sample]`	with the blob, and the unit (`B` or count).

load_samples ¶

load_samples(
    path: str | Path,
    *,
    metric: Metric = "time",
    stat: str | None = None,
) -> tuple[str, list[Sample], str]

Read one pytest-benchmark file for the chosen metric → (label, samples, unit).

The unified reader over :func:from_pytest_benchmark (timing) and :func:memory_from_pytest_benchmark (memory).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	A pytest-benchmark JSON file.	required
`metric`	`Metric`	Which metric to read (`time` / `peak` / `allocated` / `allocations`).	`'time'`
`stat`	`str \| None`	Distribution stat over the metric's per-repeat series (`min` / `max` / `mean` / `median` / `stddev`); `None` reads the headline scalar. For `time` it selects the pytest-benchmark stat (default `min`).	`None`

Returns:

Type	Description
`tuple[str, list[Sample], str]`	`(label, samples, unit)` — the run label, its samples, and the metric's unit.

load_long_df ¶

load_long_df(
    runs: str | Path | Sequence[str | Path],
    *,
    metric: Metric = "time",
    stat: str | None = None,
    labels: Sequence[str] | None = None,
    pivot: str | None = None,
) -> tuple[pd.DataFrame, str]

Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).

One row per (run, id) for the chosen metric. Columns: snapshot (the series axis — see below), id (the pairing key), value, then one column per dim key seen (missing dims are NaN). Every plot view and the compare table pivots this frame, pairing rows on id and laying snapshot values side by side.

The series axis is just a dim. By default it's the run-file (snapshot = each file's label), which is why a run-file is a comparison axis: compare a.json b.json ranks one file against another. pivot re-points that axis at a real data dim instead — its values become snapshot and it's lifted out of each row's identity (dropped from the dims and stripped from the id) so rows differing only in it pair up. That lets one combined run be A/B'd along a config dim (--pivot param:semantics) exactly as two files are A/B'd today — the run-file is an external series axis, pivot promotes an internal dim to the same role. The two are mutually exclusive (an A/B view has one series axis), so pivot with more than one run is an error.

Parameters:

Name	Type	Description	Default
`runs`	`str \| Path \| Sequence[str \| Path]`	One path or a sequence of pytest-benchmark JSON files.	required
`metric`	`Metric`	Which metric to read (`time` / `peak` / `allocated` / `allocations`).	`'time'`
`stat`	`str \| None`	Distribution stat over the per-repeat series; `None` reads the headline scalar.	`None`
`labels`	`Sequence[str] \| None`	Overrides the `snapshot` label per run (one per path, same order), decoupling the display name from the filename; defaults to each file's stem.	`None`
`pivot`	`str \| None`	Use this dim as the series axis instead of the run-file (`param:NAME` or a bare `extra_info` name). Requires a single run.	`None`

Returns:

Type	Description
`tuple[DataFrame, str]`	`(df, unit)` — the long-form frame and the metric's unit.

discover_runs ¶

discover_runs(
    root: str | Path = ".benchmarks",
) -> list[Path]

Return pytest-benchmark JSON files under root (for CLI suggestions).

Parameters:

Name	Type	Description	Default
`root`	`str \| Path`	Directory to search (default: pytest-benchmark's `.benchmarks` store).	`'.benchmarks'`

Returns:

Type	Description
`list[Path]`	The JSON file paths found under `root`.

Sample ¶

Bases: NamedTuple

One measured result: an opaque id, a value, and analysis dims.

Plotting — `pytest_benchmem.plotting`¶

Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths; labels names the series per run (defaults to the file stems) — the API behind plot's -l/--label. plot_compare's sort is "absolute" (native units) or "relative" (percent).

plot_scaling ¶

plot_scaling(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    x: str | None = None,
    color: str | None = None,
    facet: str | None = None,
    log_log: bool | None = None,
    log_x: bool | Literal["auto"] = "auto",
    log_y: bool | Literal["auto"] = "auto",
    y_zero: bool | Literal["auto"] = "auto",
    band: Literal["auto", "minmax", "none"] = "auto",
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
    log: bool | Literal["auto"] | None = None,
) -> tuple[Figure, int]

Cost vs a numeric dim, coloured/faceted by other dims.

x/color/facet default to inference from the dims (the lone numeric dim → x); pass them to override. Passing more than one run overlays them as series — the run-file becomes the default color (like :func:plot_compare), so two builds' scaling curves sit on the same axes per facet.

Axis scaling resolves per axis as log_x/log_y (if set) → log_log (the shared shortcut, if set) → "auto" (log when that axis's data is strictly positive, i.e. log-log by default).

Parameters:

Name	Type	Description	Default
`snapshots`	`Snapshots`	Run JSON path(s). Multiple runs overlay as series (default `color` = run label).	required
`metric`	`Metric`	Which metric to plot (`time` / `peak` / `allocated` / `allocations`).	`'time'`
`x`	`str \| None`	Dim for the x-axis (default: the lone numeric dim).	`None`
`color`	`str \| None`	Dim to colour by (default: inferred).	`None`
`facet`	`str \| None`	Dim to split into subplots (default: inferred).	`None`
`log_log`	`bool \| None`	Shared shortcut — `True` log-scales both axes, `False` forces both linear, `None` (default) leaves each axis on its own `"auto"`. Overridden per axis by `log_x`/`log_y`.	`None`
`log_x`	`bool \| Literal['auto']`	`"auto"` log-scales x when it's numeric and strictly positive (the sweep decades); or force with a bool. Wins over `log_log`.	`'auto'`
`log_y`	`bool \| Literal['auto']`	`"auto"` log-scales y when the metric is strictly positive (log-log is the default); pass `False` for a linear cost axis where bar heights read truthfully. Wins over `log_log`; independent of `log_x`.	`'auto'`
`y_zero`	`bool \| Literal['auto']`	Anchor the linear y-axis at 0 so the between-run gap and the scaling slope aren't visually exaggerated. `"auto"` does this whenever the y-axis is linear (the only case where it applies); force with a bool.	`'auto'`
`band`	`Literal['auto', 'minmax', 'none']`	Spread whiskers (`min`…`max` of each point's per-pass series) on memory metrics — `"auto"` shows them where there's spread, `"minmax"` forces them on, `"none"` off. The line stays the headline (the min floor); whiskers reach up to the worst pass. Ignored for `time`.	`'auto'`
`where`	`Mapping[str, str] \| None`	Keep only rows matching these `dim=value` pairs.	`None`
`free_axes`	`FreeAxes \| None`	`"x"` / `"y"` / `"both"` — unmatch a faceted axis from the shared default (`"x"` for incommensurable sweeps, `"y"` when facets have different cost scales, e.g. per function).	`None`
`labels`	`Sequence[str] \| None`	Names the snapshot in the title (default: file stem).	`None`
`log`	`bool \| Literal['auto'] \| None`	Deprecated alias for `log_log` (tied both axes together). Pass `log_log` or the per-axis `log_x`/`log_y` instead.	`None`

Returns:

Type	Description
`tuple[Figure, int]`	`(figure, n_ids)` — the plotly figure and the number of ids plotted.

plot_scatter ¶

plot_scatter(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    facet: str | None = None,
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
    pivot: str | None = None,
) -> tuple[Figure, int]

Baseline cost (log-x) vs candidate/baseline ratio (log-y).

Top-right = slow and slower (the regressed corner). The first series is the baseline; with 3+, the rest animate. Colour encodes the absolute Δ. The series axis is the run-file by default; pivot re-points it at a data dim, folding a single run so its dim-values are the series (the first being the baseline) instead of the files (see :func:load_long_df).

Parameters:

Name	Type	Description	Default
`snapshots`	`Snapshots`	Run JSON path(s); the first is the baseline, extras animate (one run when `pivot` is set — the dim's values play that role).	required
`metric`	`Metric`	Which metric to plot (`time` / `peak` / `allocated` / `allocations`).	`'time'`
`facet`	`str \| None`	Dim to split into subplots.	`None`
`clip`	`float \| None`	Clamp the colour scale (default p95).	`None`
`where`	`Mapping[str, str] \| None`	Keep only rows matching these `dim=value` pairs.	`None`
`free_axes`	`FreeAxes \| None`	Give each facet its own axes instead of sharing.	`None`
`labels`	`Sequence[str] \| None`	Series names per run (default: file stems). Ignored when `pivot` is set.	`None`
`pivot`	`str \| None`	Use this dim as the series axis instead of the run-file (`param:NAME` or a bare `extra_info` name); requires a single run.	`None`

Returns:

Type	Description
`tuple[Figure, int]`	`(figure, n_ids)` — the plotly figure and the number of ids plotted.

plot_compare ¶

plot_compare(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    sort: SortMode = "absolute",
    facet: str | None = None,
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
    pivot: str | None = None,
) -> tuple[Figure, int]

Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).

The first two series are compared; the first is the baseline. The series axis is the run-file by default (the first two files); pivot re-points it at a data dim, folding a single run so its first two dim-values become the A and B series instead (see :func:load_long_df).

Parameters:

Name	Type	Description	Default
`snapshots`	`Snapshots`	Run JSON path(s); only the first two are used (one run when `pivot` is set).	required
`metric`	`Metric`	Which metric to plot (`time` / `peak` / `allocated` / `allocations`).	`'time'`
`sort`	`SortMode`	`absolute` plots `b - a` in the native unit; `relative` plots percent change.	`'absolute'`
`facet`	`str \| None`	Dim to split into subplots.	`None`
`clip`	`float \| None`	Clamp the colour scale (default symmetric p95).	`None`
`where`	`Mapping[str, str] \| None`	Keep only rows matching these `dim=value` pairs.	`None`
`free_axes`	`FreeAxes \| None`	Give each facet its own axes instead of sharing.	`None`
`labels`	`Sequence[str] \| None`	Series names for the two runs (default: file stems). Ignored when `pivot` is set (the series are the dim's values).	`None`
`pivot`	`str \| None`	Use this dim as the series axis instead of the run-file (`param:NAME` or a bare `extra_info` name); requires a single run.	`None`

Returns:

Type	Description
`tuple[Figure, int]`	`(figure, n_ids)` — the plotly figure and the number of ids plotted.

plot_sweep ¶

plot_sweep(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.

Parameters:

Name	Type	Description	Default
`snapshots`	`Snapshots`	Run JSON paths; columns in order, the first is the reference.	required
`metric`	`Metric`	Which metric to plot (`time` / `peak` / `allocated` / `allocations`).	`'time'`
`clip`	`float \| None`	Clamp the colour scale.	`None`
`where`	`Mapping[str, str] \| None`	Keep only rows matching these `dim=value` pairs.	`None`
`labels`	`Sequence[str] \| None`	Column (version) names (default: file stems).	`None`

Returns:

Type	Description
`tuple[Figure, int]`	`(figure, n_ids)` — the plotly figure and the number of ids plotted.

Sweeps — `pytest_benchmem.sweep`¶

See Cross-version sweeps for the narrative, the Venv object, and the provision parameters.

sweep ¶

sweep(
    versions: Sequence[str],
    run: Callable[[Venv], None],
    **provision_kwargs: object,
) -> list[str]

Provision a venv per version and call run(venv) in each.

run does whatever the consumer needs (invoke pytest / a memory command with venv.python and cwd=venv.cwd). Returns the list of versions that failed to provision.

Reference¶

pytest command-line flags¶

The benchmem marker¶

Measuring buildup — calls¶

Absolute ceilings — max_peak / max_allocated / max_allocations¶

The benchmark_memory fixture¶

The extra_info.benchmem blob¶

CLI — benchmem¶

Public Python API¶

Engine¶

measure_peak ¶

measure_memory ¶

MemoryResult dataclass ¶

repeats property ¶

representative property ¶

peak_bytes property ¶

peak_bytes_max property ¶

allocations property ¶

total_bytes property ¶

rss_bytes property ¶

series ¶

as_dict ¶

from_blob classmethod ¶

Measurement dataclass ¶

Readers & loader¶

from_pytest_benchmark ¶

memory_from_pytest_benchmark ¶

load_samples ¶

load_long_df ¶

discover_runs ¶

Sample ¶

Plotting — pytest_benchmem.plotting¶

plot_scaling ¶

plot_scatter ¶

plot_compare ¶

plot_sweep ¶

Sweeps — pytest_benchmem.sweep¶

sweep ¶

The `benchmem` marker¶

Measuring buildup — `calls`¶

Absolute ceilings — `max_peak` / `max_allocated` / `max_allocations`¶

The `benchmark_memory` fixture¶

The `extra_info.benchmem` blob¶

CLI — `benchmem`¶

MemoryResult `dataclass` ¶

repeats `property` ¶

representative `property` ¶

peak_bytes `property` ¶

peak_bytes_max `property` ¶

allocations `property` ¶

total_bytes `property` ¶

rss_bytes `property` ¶

from_blob `classmethod` ¶

Measurement `dataclass` ¶

Plotting — `pytest_benchmem.plotting`¶

Sweeps — `pytest_benchmem.sweep`¶