Skip to content

Reference

Various functions for multidimensional cluster generation in Python.

Note that:

  1. clugen() is the main function of the pyclugen package, and possibly the only function most users will need.
  2. Functions which accept rng as the last parameter are stochastic. Thus, in order to obtain the same result on separate invocations of these functions, pass them an instance of same pseudo-random number Generator initialized with the same seed.

Clusters

Bases: NamedTuple

Read-only container for results returned by clugen().

The symbols presented in the instances variable below have the following meanings:

  • \(n\) : Number of dimensions.
  • \(p\) : Number of points.
  • \(c\) : Number of clusters.
Source code in pyclugen/main.py
class Clusters(NamedTuple):
    r"""Read-only container for results returned by [`clugen()`][pyclugen.main.clugen].

    The symbols presented in the instances variable below have the following
    meanings:

    - $n$ : Number of dimensions.
    - $p$ : Number of points.
    - $c$ : Number of clusters.
    """

    points: NDArray
    r"""$p \times n$ matrix containing the generated points for all clusters."""

    clusters: NDArray
    r"""Vector of size $p$ indicating the cluster each point in `points`
    belongs to."""

    projections: NDArray
    r"""$p \times n$ matrix with the point projections on the cluster-supporting
    lines."""

    sizes: NDArray
    r"""Vector of size $c$ with the number of points in each cluster."""

    centers: NDArray
    r"""$c \times n$ matrix with the coordinates of the cluster centers."""

    directions: NDArray
    r"""$c \times n$ matrix with the direction of each cluster-supporting line."""

    angles: NDArray
    r"""Vector of size $c$ with the angles between the cluster-supporting lines and
    the main direction."""

    lengths: NDArray
    r"""Vector of size $c$ with the lengths of the cluster-supporting lines."""

angles instance-attribute

angles: NDArray

Vector of size \(c\) with the angles between the cluster-supporting lines and the main direction.

centers instance-attribute

centers: NDArray

\(c \times n\) matrix with the coordinates of the cluster centers.

clusters instance-attribute

clusters: NDArray

Vector of size \(p\) indicating the cluster each point in points belongs to.

directions instance-attribute

directions: NDArray

\(c \times n\) matrix with the direction of each cluster-supporting line.

lengths instance-attribute

lengths: NDArray

Vector of size \(c\) with the lengths of the cluster-supporting lines.

points instance-attribute

points: NDArray

\(p \times n\) matrix containing the generated points for all clusters.

projections instance-attribute

projections: NDArray

\(p \times n\) matrix with the point projections on the cluster-supporting lines.

sizes instance-attribute

sizes: NDArray

Vector of size \(c\) with the number of points in each cluster.

angle_btw

angle_btw(v1: NDArray, v2: NDArray) -> float

Angle between two \(n\)-dimensional vectors.

Typically, the angle between two vectors v1 and v2 can be obtained with:

arccos(dot(u, v) / (norm(u) * norm(v)))

However, this approach is numerically unstable. The version provided here is numerically stable and based on the AngleBetweenVectors Julia package by Jeffrey Sarnoff (MIT license), implementing an algorithm provided by Prof. W. Kahan in these notes (see page 15).

Examples:

>>> from numpy import array, degrees
>>> from pyclugen import angle_btw
>>> v1 = array([1.0, 1.0, 1.0, 1.0])
>>> v2 = array([1.0, 0.0, 0.0, 0.0])
>>> degrees(angle_btw(v1, v2))
60.00000000000001

Parameters:

Name Type Description Default
v1 NDArray

First vector.

required
v2 NDArray

Second vector.

required

Returns:

Type Description
float

Angle between v1 and v2 in radians.

Source code in pyclugen/helper.py
def angle_btw(v1: NDArray, v2: NDArray) -> float:
    r"""Angle between two $n$-dimensional vectors.

    Typically, the angle between two vectors `v1` and `v2` can be obtained with:

    ```python
    arccos(dot(u, v) / (norm(u) * norm(v)))
    ```

    However, this approach is numerically unstable. The version provided here is
    numerically stable and based on the
    [AngleBetweenVectors](https://github.com/JeffreySarnoff/AngleBetweenVectors.jl)
    Julia package by Jeffrey Sarnoff (MIT license), implementing an algorithm
    provided by Prof. W. Kahan in
    [these notes](https://people.eecs.berkeley.edu/~wkahan/MathH110/Cross.pdf)
    (see page 15).

    Examples:
        >>> from numpy import array, degrees
        >>> from pyclugen import angle_btw
        >>> v1 = array([1.0, 1.0, 1.0, 1.0])
        >>> v2 = array([1.0, 0.0, 0.0, 0.0])
        >>> degrees(angle_btw(v1, v2))
        60.00000000000001

    Args:
      v1: First vector.
      v2: Second vector.

    Returns:
      Angle between `v1` and `v2` in radians.
    """
    u1 = v1 / norm(v1)
    u2 = v2 / norm(v2)

    y = u1 - u2
    x = u1 + u2

    return 2 * arctan(norm(y) / norm(x))

angle_deltas

angle_deltas(
    num_clusters: int, angle_disp: float, rng: Generator = _default_rng
) -> NDArray

Get angles between average cluster direction and cluster-supporting lines.

Determine the angles between the average cluster direction and the cluster-supporting lines. These angles are obtained from a wrapped normal distribution ( \(\mu=0\), \(\sigma=\)angle_disp) with support in the interval \(\left[-\pi/2,\pi/2\right]\). Note this is different from the standard wrapped normal distribution, the support of which is given by the interval \(\left[-\pi,\pi\right]\).

Examples:

>>> from pyclugen import angle_deltas
>>> from numpy import degrees, pi
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> a_rad = angle_deltas(4, pi/8, rng=prng) # Angle dispersion of 22.5 degrees
>>> a_rad
array([-0.38842705, -0.14442948,  0.50576707,  0.07617358])
>>> degrees(a_rad) # Show angle deltas in degrees
array([-22.25523038,  -8.27519966,  28.97831838,   4.36442443])

Parameters:

Name Type Description Default
num_clusters int

Number of clusters.

required
angle_disp float

Angle dispersion, in radians.

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

Angles between the average cluster direction and the cluster-supporting lines, given in radians in the interval \(\left[-\pi/2,\pi/2\right]\).

Source code in pyclugen/module.py
def angle_deltas(
    num_clusters: int, angle_disp: float, rng: Generator = _default_rng
) -> NDArray:
    r"""Get angles between average cluster direction and cluster-supporting lines.

    Determine the angles between the average cluster direction and the
    cluster-supporting lines. These angles are obtained from a wrapped normal
    distribution ( $\mu=0$, $\sigma=$`angle_disp`) with support in the interval
    $\left[-\pi/2,\pi/2\right]$. Note this is different from the standard
    wrapped normal distribution, the support of which is given by the interval
    $\left[-\pi,\pi\right]$.

    Examples:
        >>> from pyclugen import angle_deltas
        >>> from numpy import degrees, pi
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> a_rad = angle_deltas(4, pi/8, rng=prng) # Angle dispersion of 22.5 degrees
        >>> a_rad
        array([-0.38842705, -0.14442948,  0.50576707,  0.07617358])
        >>> degrees(a_rad) # Show angle deltas in degrees
        array([-22.25523038,  -8.27519966,  28.97831838,   4.36442443])

    Args:
      num_clusters: Number of clusters.
      angle_disp: Angle dispersion, in radians.
      rng: Optional pseudo-random number generator.

    Returns:
      Angles between the average cluster direction and the cluster-supporting
        lines, given in radians in the interval $\left[-\pi/2,\pi/2\right]$.
    """
    # Get random angle differences using the normal distribution
    angles = angle_disp * rng.normal(size=num_clusters)

    # Reduce angle differences to the interval [-π, π]
    angles = arctan2(sin(angles), cos(angles))

    # Make sure angle differences are within interval [-π/2, π/2]
    return where(abs(angles) > pi / 2, angles - sign(angles) * pi / 2, angles)

clucenters

clucenters(
    num_clusters: int,
    clu_sep: NDArray,
    clu_offset: NDArray,
    rng: Generator = _default_rng,
) -> NDArray

Determine cluster centers using the uniform distribution.

The number of clusters (num_clusters) and the average cluster separation (clu_sep) are taken into account.

More specifically, let \(c=\)num_clusters, \(\mathbf{s}=\)clu_sep.reshape(-1,1), \(\mathbf{o}=\)clu_offset.reshape(-1,1), \(n=\)clu_sep.size (i.e., number of dimensions). Cluster centers are obtained according to the following equation:

\[ \mathbf{C}=c\mathbf{U} \cdot \operatorname{diag}(\mathbf{s}) + \mathbf{1}\,\mathbf{o}^T \]

where \(\mathbf{C}\) is the \(c \times n\) matrix of cluster centers, \(\mathbf{U}\) is an \(c \times n\) matrix of random values drawn from the uniform distribution between -0.5 and 0.5, and \(\mathbf{1}\) is an \(c \times 1\) vector with all entries equal to 1.

Examples:

>>> from pyclugen import clucenters
>>> from numpy import array
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> clucenters(3, array([30,10]), array([-50,50]), rng=prng)
array([[-33.58833231,  36.61463056],
       [-75.16761145,  40.53115432],
       [-79.1684689 ,  59.3628352 ]])

Parameters:

Name Type Description Default
num_clusters int

Number of clusters.

required
clu_sep NDArray

Average cluster separation ( \(n \times 1\) vector).

required
clu_offset NDArray

Cluster offsets ( \(n \times 1\) vector).

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

A \(c \times n\) matrix containing the cluster centers.

Source code in pyclugen/module.py
def clucenters(
    num_clusters: int,
    clu_sep: NDArray,
    clu_offset: NDArray,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Determine cluster centers using the uniform distribution.

    The number of clusters (`num_clusters`) and the average cluster separation
    (`clu_sep`) are taken into account.

    More specifically, let $c=$`num_clusters`, $\mathbf{s}=$`clu_sep.reshape(-1,1)`,
    $\mathbf{o}=$`clu_offset.reshape(-1,1)`, $n=$`clu_sep.size` (i.e., number of
    dimensions). Cluster centers are obtained according to the following equation:

    $$
    \mathbf{C}=c\mathbf{U} \cdot \operatorname{diag}(\mathbf{s}) +
        \mathbf{1}\,\mathbf{o}^T
    $$

    where $\mathbf{C}$ is the $c \times n$ matrix of cluster centers,
    $\mathbf{U}$ is an $c \times n$ matrix of random values drawn from the
    uniform distribution between -0.5 and 0.5, and $\mathbf{1}$ is an $c \times
    1$ vector with all entries equal to 1.

    Examples:
        >>> from pyclugen import clucenters
        >>> from numpy import array
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> clucenters(3, array([30,10]), array([-50,50]), rng=prng)
        array([[-33.58833231,  36.61463056],
               [-75.16761145,  40.53115432],
               [-79.1684689 ,  59.3628352 ]])

    Args:
      num_clusters: Number of clusters.
      clu_sep: Average cluster separation ( $n \times 1$ vector).
      clu_offset: Cluster offsets ( $n \times 1$ vector).
      rng: Optional pseudo-random number generator.

    Returns:
        A $c \times n$ matrix containing the cluster centers.
    """
    # Obtain a num_clusters x num_dims matrix of uniformly distributed values
    # between -0.5 and 0.5 representing the relative cluster centers
    ctr_rel = rng.random((num_clusters, clu_sep.size)) - 0.5

    return num_clusters * (ctr_rel @ diag(clu_sep)) + clu_offset

clugen

clugen(
    num_dims: int,
    num_clusters: int,
    num_points: int,
    direction: ArrayLike,
    angle_disp: float,
    cluster_sep: ArrayLike,
    llength: float,
    llength_disp: float,
    lateral_disp: float,
    allow_empty: bool = False,
    cluster_offset: Optional[ArrayLike] = None,
    proj_dist_fn: str | Callable[[float, int, Generator], NDArray] = "norm",
    point_dist_fn: str
    | Callable[
        [NDArray, float, float, NDArray, NDArray, Generator], NDArray
    ] = "n-1",
    clusizes_fn: Callable[[int, int, bool, Generator], NDArray]
    | ArrayLike = clusizes,
    clucenters_fn: Callable[[int, NDArray, NDArray, Generator], NDArray]
    | ArrayLike = clucenters,
    llengths_fn: Callable[[int, float, float, Generator], NDArray]
    | ArrayLike = llengths,
    angle_deltas_fn: Callable[[int, float, Generator], NDArray]
    | ArrayLike = angle_deltas,
    rng: int | Generator = _default_rng,
) -> Clusters

Generate multidimensional clusters.

Tip

This is the main function of the pyclugen package, and possibly the only function most users will need.

Examples:

>>> import matplotlib.pyplot as plt
>>> from pyclugen import clugen
>>> from numpy import pi
>>> out = clugen(2, 5, 10000, [1, 0.5], pi/16, [10, 40], 10, 1, 2, rng=321)
>>> out.centers # What are the cluster centers?
array([[ 20.02876212,  36.59611434],
       [-15.60290734, -26.52169579],
       [ 23.09775166,  91.66309916],
       [ -5.76816015,  54.9775074 ],
       [ -4.64224681,  78.40990876]])
>>> plt.scatter(out.points[:,0],
...             out.points[:,1],
...             c=out.clusters) # doctest: +SKIP
>>> plt.show() # doctest: +SKIP

clugen

Note

In the descriptions below, the terms "average" and "dispersion" refer to measures of central tendency and statistical dispersion, respectively. Their exact meaning depends on several optional arguments.

Parameters:

Name Type Description Default
num_dims int

Number of dimensions.

required
num_clusters int

Number of clusters to generate.

required
num_points int

Total number of points to generate.

required
direction ArrayLike

Average direction of the cluster-supporting lines. Can be a vector of length num_dims (same direction for all clusters) or a matrix of size num_clusters x num_dims (one direction per cluster).

required
angle_disp float

Angle dispersion of cluster-supporting lines (radians).

required
cluster_sep ArrayLike

Average cluster separation in each dimension (vector of size num_dims).

required
llength float

Average length of cluster-supporting lines.

required
llength_disp float

Length dispersion of cluster-supporting lines.

required
lateral_disp float

Cluster lateral dispersion, i.e., dispersion of points from their projection on the cluster-supporting line.

required
allow_empty bool

Allow empty clusters? False by default.

False
cluster_offset Optional[ArrayLike]

Offset to add to all cluster centers (vector of size num_dims). By default the offset will be equal to numpy.zeros(num_dims).

None
proj_dist_fn str | Callable[[float, int, Generator], NDArray]

Distribution of point projections along cluster-supporting lines, with three possible values:

  • "norm" (default): Distribute point projections along lines using a normal distribution (μ=line center, σ=llength/6).
  • "unif": Distribute points uniformly along the line.
  • User-defined function, which accepts three parameters, line length (float), number of points (int), and an instance of Generator, and returns an array containing the distance of each point projection to the center of the line. For example, the "norm" option roughly corresponds to lambda l, n, rg: l * rg.random((n, 1)) / 6.
'norm'
point_dist_fn str | Callable[[NDArray, float, float, NDArray, NDArray, Generator], NDArray]

Controls how the final points are created from their projections on the cluster-supporting lines, with three possible values:

  • "n-1" (default): Final points are placed on a hyperplane orthogonal to the cluster-supporting line, centered at each point's projection, using the normal distribution (μ=0, σ=lateral_disp). This is done by the clupoints_n_1() function.
  • "n": Final points are placed around their projection on the cluster-supporting line using the normal distribution (μ=0, σ=lateral_disp). This is done by the clupoints_n() function.
  • User-defined function: The user can specify a custom point placement strategy by passing a function with the same signature as clupoints_n_1() and clupoints_n().
'n-1'
clusizes_fn Callable[[int, int, bool, Generator], NDArray] | ArrayLike

Distribution of cluster sizes. By default, cluster sizes are determined by the clusizes() function, which uses the normal distribution (μ=num_points/num_clusters, σ=μ/3), and assures that the final cluster sizes add up to num_points. This parameter allows the user to specify a custom function for this purpose, which must follow clusizes() signature. Note that custom functions are not required to strictly obey the num_points parameter. Alternatively, the user can specify an array of cluster sizes directly.

clusizes
clucenters_fn Callable[[int, NDArray, NDArray, Generator], NDArray] | ArrayLike

Distribution of cluster centers. By default, cluster centers are determined by the clucenters() function, which uses the uniform distribution, and takes into account the num_clusters and cluster_sep parameters for generating well-distributed cluster centers. This parameter allows the user to specify a custom function for this purpose, which must follow clucenters() signature. Alternatively, the user can specify a matrix of size num_clusters x num_dims with the exact cluster centers.

clucenters
llengths_fn Callable[[int, float, float, Generator], NDArray] | ArrayLike

Distribution of line lengths. By default, the lengths of cluster-supporting lines are determined by the llengths() function, which uses the folded normal distribution (μ=llength, σ=llength_disp). This parameter allows the user to specify a custom function for this purpose, which must follow llengths() signature. Alternatively, the user can specify an array of line lengths directly.

llengths
angle_deltas_fn Callable[[int, float, Generator], NDArray] | ArrayLike

Distribution of line angle differences with respect to direction. By default, the angles between direction and the direction of cluster-supporting lines are determined by the angle_deltas() function, which uses the wrapped normal distribution (μ=0, σ=angle_disp) with support in the interval [-π/2, π/2]. This parameter allows the user to specify a custom function for this purpose, which must follow angle_deltas() signature. Alternatively, the user can specify an array of angle deltas directly.

angle_deltas
rng int | Generator

The seed for the random number generator or an instance of Generator for reproducible executions.

_default_rng

Returns:

Type Description
Clusters

The generated clusters and associated information in the form of a Clusters object.

Source code in pyclugen/main.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
def clugen(
    num_dims: int,
    num_clusters: int,
    num_points: int,
    direction: ArrayLike,
    angle_disp: float,
    cluster_sep: ArrayLike,
    llength: float,
    llength_disp: float,
    lateral_disp: float,
    allow_empty: bool = False,
    cluster_offset: Optional[ArrayLike] = None,
    proj_dist_fn: str | Callable[[float, int, Generator], NDArray] = "norm",
    point_dist_fn: str
    | Callable[[NDArray, float, float, NDArray, NDArray, Generator], NDArray] = "n-1",
    clusizes_fn: Callable[[int, int, bool, Generator], NDArray] | ArrayLike = clusizes,
    clucenters_fn: Callable[[int, NDArray, NDArray, Generator], NDArray]
    | ArrayLike = clucenters,
    llengths_fn: Callable[[int, float, float, Generator], NDArray]
    | ArrayLike = llengths,
    angle_deltas_fn: Callable[[int, float, Generator], NDArray]
    | ArrayLike = angle_deltas,
    rng: int | Generator = _default_rng,
) -> Clusters:
    """Generate multidimensional clusters.

    !!! tip
        This is the main function of the **pyclugen** package, and possibly the
        only function most users will need.

    ## Examples:

        >>> import matplotlib.pyplot as plt
        >>> from pyclugen import clugen
        >>> from numpy import pi
        >>> out = clugen(2, 5, 10000, [1, 0.5], pi/16, [10, 40], 10, 1, 2, rng=321)
        >>> out.centers # What are the cluster centers?
        array([[ 20.02876212,  36.59611434],
               [-15.60290734, -26.52169579],
               [ 23.09775166,  91.66309916],
               [ -5.76816015,  54.9775074 ],
               [ -4.64224681,  78.40990876]])
        >>> plt.scatter(out.points[:,0],
        ...             out.points[:,1],
        ...             c=out.clusters) # doctest: +SKIP
        >>> plt.show() # doctest: +SKIP

    ![clugen](https://user-images.githubusercontent.com/3018963/151056890-c83c9509-b40d-4ab2-a842-f2a4706344c6.png)

    !!! Note
        In the descriptions below, the terms "average" and "dispersion" refer to
        measures of central tendency and statistical dispersion, respectively.
        Their exact meaning depends on several optional arguments.

    Args:
      num_dims: Number of dimensions.
      num_clusters: Number of clusters to generate.
      num_points: Total number of points to generate.
      direction: Average direction of the cluster-supporting lines. Can be a
        vector of length `num_dims` (same direction for all clusters) or a
        matrix of size `num_clusters` x `num_dims` (one direction per cluster).
      angle_disp: Angle dispersion of cluster-supporting lines (radians).
      cluster_sep: Average cluster separation in each dimension (vector of size
        `num_dims`).
      llength: Average length of cluster-supporting lines.
      llength_disp: Length dispersion of cluster-supporting lines.
      lateral_disp: Cluster lateral dispersion, i.e., dispersion of points from their
        projection on the cluster-supporting line.
      allow_empty: Allow empty clusters? `False` by default.
      cluster_offset: Offset to add to all cluster centers (vector of size `num_dims`).
        By default the offset will be equal to `numpy.zeros(num_dims)`.
      proj_dist_fn: Distribution of point projections along cluster-supporting lines,
        with three possible values:

        - `"norm"` (default): Distribute point projections along lines using a normal
          distribution (μ=_line center_, σ=`llength/6`).
        - `"unif"`: Distribute points uniformly along the line.
        - User-defined function, which accepts three parameters, line length (`float`),
          number of points (`int`), and an instance of
          [`Generator`](https://numpy.org/doc/stable/reference/random/generator.html?highlight=generator#numpy.random.Generator),
          and returns an array containing the distance of each point projection to
          the center of the line. For example, the `"norm"` option roughly corresponds
          to `lambda l, n, rg: l * rg.random((n, 1)) / 6`.

      point_dist_fn: Controls how the final points are created from their projections
        on the cluster-supporting lines, with three possible values:

        - `"n-1"` (default): Final points are placed on a hyperplane orthogonal to
          the cluster-supporting line, centered at each point's projection, using the
          normal distribution (μ=0, σ=`lateral_disp`). This is done by the
          [`clupoints_n_1()`][pyclugen.module.clupoints_n_1] function.
        - `"n"`: Final points are placed around their projection on the
          cluster-supporting line using the normal distribution (μ=0,
          σ=`lateral_disp`). This is done by the
          [`clupoints_n()`][pyclugen.module.clupoints_n] function.
        - User-defined function: The user can specify a custom point placement
          strategy by passing a function with the same signature as
          [`clupoints_n_1()`][pyclugen.module.clupoints_n_1] and
          [`clupoints_n()`][pyclugen.module.clupoints_n].

      clusizes_fn: Distribution of cluster sizes. By default, cluster sizes are
        determined by the [`clusizes()`][pyclugen.module.clusizes] function, which
        uses the normal distribution (μ=`num_points`/`num_clusters`, σ=μ/3), and
        assures that the final cluster sizes add up to `num_points`. This parameter
        allows the user to specify a custom function for this purpose, which must
        follow [`clusizes()`][pyclugen.module.clusizes] signature. Note that custom
        functions are not required to strictly obey the `num_points` parameter.
        Alternatively, the user can specify an array of cluster sizes directly.
      clucenters_fn: Distribution of cluster centers. By default, cluster centers
        are determined by the [`clucenters()`][pyclugen.module.clucenters] function,
        which uses the uniform distribution, and takes into account the `num_clusters`
        and `cluster_sep` parameters for generating well-distributed cluster centers.
        This parameter allows the user to specify a custom function for this purpose,
        which must follow [`clucenters()`][pyclugen.module.clucenters] signature.
        Alternatively, the user can specify a matrix of size `num_clusters` x
        `num_dims` with the exact cluster centers.
      llengths_fn: Distribution of line lengths. By default, the lengths of
        cluster-supporting lines are determined by the
        [`llengths()`][pyclugen.module.llengths] function, which uses the folded
        normal distribution (μ=`llength`, σ=`llength_disp`). This parameter allows
        the user to specify a custom function for this purpose, which must follow
        [`llengths()`][pyclugen.module.llengths] signature. Alternatively, the user
        can specify an array of line lengths directly.
      angle_deltas_fn: Distribution of line angle differences with respect to
        `direction`. By default, the angles between `direction` and the direction of
        cluster-supporting lines are determined by the
        [`angle_deltas()`][pyclugen.module.angle_deltas] function, which uses the
        wrapped normal distribution (μ=0, σ=`angle_disp`) with support in the interval
        [-π/2, π/2]. This parameter allows the user to specify a custom function for
        this purpose, which must follow [`angle_deltas()`][pyclugen.module.angle_deltas]
        signature. Alternatively, the user can specify an array of angle deltas
        directly.
      rng: The seed for the random number generator or an instance of
        [`Generator`][numpy.random.Generator] for reproducible executions.

    Returns:
      The generated clusters and associated information in the form of a
        [`Clusters`][pyclugen.main.Clusters] object.
    """
    # ############### #
    # Validate inputs #
    # ############### #

    # Check that number of dimensions is > 0
    if num_dims < 1:
        raise ValueError("Number of dimensions, `num_dims`, must be > 0")

    # Check that number of clusters is > 0
    if num_clusters < 1:
        raise ValueError("Number of clusters, `num_clust`, must be > 0")

    # Convert given direction into a NumPy array
    arrdir: NDArray = asarray(direction)

    # Get number of dimensions in `direction` array
    dir_ndims = arrdir.ndim

    # Is direction a vector or a matrix?
    if dir_ndims == 1:
        # It's a vector, let's convert it into a row matrix, since this will be
        # useful down the road
        arrdir = arrdir.reshape((1, -1))
    elif dir_ndims == 2:
        # If a matrix was given (i.e. a main direction is given for each cluster),
        # check if the number of directions is the same as the number of clusters
        dir_size_1 = arrdir.shape[0]
        if dir_size_1 != num_clusters:
            raise ValueError(
                "Number of rows in `direction` must be the same as the "
                + f"number of clusters ({dir_size_1} != {num_clusters})"
            )
    else:
        # The `directions` array must be a vector or a matrix, so if we get here
        # it means we have invalid arguments
        raise ValueError(
            "`direction` must be a vector (1D array) or a matrix (2D array), "
            + f"but is {dir_ndims}D"
        )

    # Check that direction has num_dims dimensions
    dir_size_2 = arrdir.shape[1]
    if dir_size_2 != num_dims:
        raise ValueError(
            "Length of directions in `direction` must be equal to "
            + f"`num_dims` ({dir_size_2} != {num_dims})"
        )

    # Check that directions have magnitude > 0
    dir_magnitudes = apply_along_axis(norm, 1, arrdir)
    if any(isclose(dir_magnitudes, 0)):
        raise ValueError("Directions in `direction` must have magnitude > 0")

    # If allow_empty is false, make sure there are enough points to distribute
    # by the clusters
    if (not allow_empty) and num_points < num_clusters:
        raise ValueError(
            f"A total of {num_points} points is not enough for "
            + f"{num_clusters} non-empty clusters"
        )

    # Check that cluster_sep has num_dims dimensions
    cluster_sep = asarray(cluster_sep)
    if cluster_sep.size != num_dims:
        raise ValueError(
            "Length of `cluster_sep` must be equal to `num_dims` "
            + f"({cluster_sep.size} != {num_dims})"
        )

    # If given, cluster_offset must have the correct number of dimensions,
    # if not given then it will be a num_dims x 1 vector of zeros
    if cluster_offset is None:
        cluster_offset = zeros(num_dims)
    else:
        cluster_offset = asarray(cluster_offset)
        if cluster_offset.size != num_dims:
            raise ValueError(
                "Length of `cluster_offset` must be equal to `num_dims` "
                + f"({cluster_offset.size} != {num_dims})"
            )

    # If the user specified rng as an int, create a proper rng object
    rng_sel: Generator
    if isinstance(rng, Generator):
        rng_sel = cast(Generator, rng)
    elif isinstance(rng, int):
        rng_sel = Generator(PCG64(cast(int, rng)))
    else:
        raise ValueError(
            f"`rng` must be an instance of int or Generator, but is {type(rng)}"
        )

    # Check that proj_dist_fn specifies a valid way for projecting points along
    # cluster-supporting lines i.e., either "norm" (default), "unif" or a
    # user-defined function
    pointproj_fn: Callable[[float, int, Generator], NDArray]

    if callable(proj_dist_fn):
        # Use user-defined distribution; assume function accepts length of line
        # and number of points, and returns a number of points x 1 vector
        pointproj_fn = proj_dist_fn

    elif proj_dist_fn == "unif":
        # Point projections will be uniformly placed along cluster-supporting lines
        def pointproj_fn(length, n, rg):
            return length * rg.random(n) - length / 2

    elif proj_dist_fn == "norm":
        # Use normal distribution for placing point projections along cluster-supporting
        # lines, mean equal to line center, standard deviation equal to 1/6 of line
        # length such that the line length contains ≈99.73% of the points
        def pointproj_fn(length, n, rg):
            return (1.0 / 6.0) * length * rg.normal(size=n)

    else:
        raise ValueError(
            "`proj_dist_fn` has to be either 'norm', 'unif' or user-defined function"
        )

    # Check that point_dist_fn specifies a valid way for generating points given
    # their projections along cluster-supporting lines, i.e., either "n-1"
    # (default), "n" or a user-defined function
    pt_from_proj_fn: Callable[
        [NDArray, float, float, NDArray, NDArray, Generator], NDArray
    ]

    if num_dims == 1:
        # If 1D was specified, point projections are the points themselves
        def pt_from_proj_fn(projs, lat_disp, length, clu_dir, clu_ctr, rng=rng_sel):
            return projs

    elif callable(point_dist_fn):
        # Use user-defined distribution; assume function accepts point projections
        # on the line, lateral disp., cluster direction and cluster center, and
        # returns a num_points x num_dims matrix containing the final points
        # for the current cluster
        pt_from_proj_fn = point_dist_fn

    elif point_dist_fn == "n-1":
        # Points will be placed on a hyperplane orthogonal to the cluster-supporting
        # line using a normal distribution centered at their intersection
        pt_from_proj_fn = clupoints_n_1

    elif point_dist_fn == "n":
        # Points will be placed using a multivariate normal distribution
        # centered at the point projection
        pt_from_proj_fn = clupoints_n

    else:
        raise ValueError(
            "point_dist_fn has to be either 'n-1', 'n' or a user-defined function"
        )

    # ############################ #
    # Determine cluster properties #
    # ############################ #

    # Normalize main direction(s)
    arrdir = apply_along_axis(lambda a: a / norm(a), 1, arrdir)

    # If only one main direction was given, expand it for all clusters
    if dir_ndims == 1:
        arrdir = repeat(arrdir, num_clusters, axis=0)

    # Determine cluster sizes
    if callable(clusizes_fn):
        cluster_sizes = clusizes_fn(num_clusters, num_points, allow_empty, rng_sel)
    elif len(asarray(clusizes_fn)) == num_clusters:
        cluster_sizes = asarray(clusizes_fn)
    else:
        raise ValueError(
            "clusizes_fn has to be either a function or a `num_clusters`-sized array"
        )

    # Custom clusizes_fn's are not required to obey num_points, so we update
    # it here just in case it's different from what the user specified
    num_points = sum(cluster_sizes)

    # Determine cluster centers
    if callable(clucenters_fn):
        cluster_centers = clucenters_fn(
            num_clusters, cluster_sep, cluster_offset, rng_sel
        )
    elif asarray(clucenters_fn).shape == (num_clusters, num_dims):
        cluster_centers = asarray(clucenters_fn)
    else:
        raise ValueError(
            "clucenters_fn has to be either a function or a matrix of size "
            + "`num_clusters` x `num_dims`"
        )

    # Determine length of lines supporting clusters
    if callable(llengths_fn):
        cluster_lengths = llengths_fn(num_clusters, llength, llength_disp, rng_sel)
    elif len(asarray(llengths_fn)) == num_clusters:
        cluster_lengths = asarray(llengths_fn)
    else:
        raise ValueError(
            "llengths_fn has to be either a function or a `num_clusters`-sized array"
        )

    # Obtain angles between main direction and cluster-supporting lines
    if callable(angle_deltas_fn):
        cluster_angles = angle_deltas_fn(num_clusters, angle_disp, rng_sel)
    elif len(asarray(angle_deltas_fn)) == num_clusters:
        cluster_angles = asarray(angle_deltas_fn)
    else:
        raise ValueError(
            "angle_deltas_fn has to be either a function or a "
            + "`num_clusters`-sized array"
        )

    # Determine normalized cluster directions by applying the obtained angles
    cluster_directions = apply_along_axis(
        lambda v, a: rand_vector_at_angle(v, next(a), rng_sel),
        1,
        arrdir,
        iter(cluster_angles),
    )

    # ################################# #
    # Determine points for each cluster #
    # ################################# #

    # Aux. vector with cumulative sum of number of points in each cluster
    cumsum_points = concatenate((asarray([0]), cumsum(cluster_sizes)))

    # Pre-allocate data structures for holding cluster info and points
    point_clusters: NDArray = empty(
        num_points, dtype=int32
    )  # Cluster indices of each point
    point_projections = empty((num_points, num_dims))  # Point projections on
    #                                                  # cluster-supporting lines
    points = empty((num_points, num_dims))  # Final points to be generated

    # Loop through clusters and create points for each one
    for i in range(num_clusters):
        # Start and end indexes for points in current cluster
        idx_start = cumsum_points[i]
        idx_end = cumsum_points[i + 1]

        # Update cluster indices of each point
        point_clusters[idx_start:idx_end] = i

        # Determine distance of point projections from the center of the line
        ptproj_dist_fn_center = pointproj_fn(
            cluster_lengths[i], cluster_sizes[i], rng_sel
        )

        # Determine coordinates of point projections on the line using the
        # parametric line equation (this works since cluster direction is normalized)
        point_projections[idx_start:idx_end, :] = points_on_line(
            cluster_centers[i, :], cluster_directions[i, :], ptproj_dist_fn_center
        )

        # Determine points from their projections on the line
        points[idx_start:idx_end, :] = pt_from_proj_fn(
            point_projections[idx_start:idx_end, :],
            lateral_disp,
            cluster_lengths[i],
            cluster_directions[i, :],
            cluster_centers[i, :],
            rng_sel,
        )

    return Clusters(
        points,
        point_clusters,
        point_projections,
        cluster_sizes,
        cluster_centers,
        cluster_directions,
        cluster_angles,
        cluster_lengths,
    )

clumerge

clumerge(
    *data: NamedTuple | Mapping[str, ArrayLike],
    fields: tuple[str, ...] = ("points", "clusters"),
    clusters_field: str | None = "clusters"
) -> dict[str, NDArray]

Merges the fields (specified in fields) of two or more data sets.

Merges the fields (specified in fields) of two or more data sets (named tuples or dictionaries). The fields to be merged need to have the same number of columns. The corresponding merged field will contain the rows of the fields to be merged, and will have a common supertype.

The clusters_field parameter specifies a field containing integers that identify the cluster to which the respective points belongs to. If clusters_field is specified (by default it's specified as "clusters"), cluster assignments in individual datasets will be updated in the merged dataset so that clusters are considered separate. This parameter can be set to None, in which case no field will be considered as a special cluster assignments field.

This function can be used to merge data sets generated with the clugen() function, by default merging the points and clusters fields in those data sets. It also works with arbitrary data by specifying alternative fields in the fields parameter. It can be used, for example, to merge third-party data with clugen()-generated data.

Examples:

>>> from pyclugen import clugen, clumerge
>>> data1 = clugen(2, 5, 1000, [1, 1], 0.01, [20, 20], 14, 1.2, 1.5);
>>> data2 = clugen(2, 3, 450, [0.8, -0.3], 0, [25, 21], 6, 0.4, 3.5);
>>> data3 = clugen(2, 2, 600, [0, -0.7], 0.2, [15, 10], 1, 0.1, 5.2);
>>> data_merged = clumerge(data1, data2, data3)

Parameters:

Name Type Description Default
*data NamedTuple | Mapping[str, ArrayLike]

One or more cluster data sets whose fields are to be merged.

()
fields tuple[str, ...]

Fields to be merged, which must exist in the data set given in *data.

('points', 'clusters')
clusters_field str | None

Field containing the integer cluster labels. If specified, cluster assignments in individual datasets will be updated in the merged dataset so that clusters are considered separate.

'clusters'

Returns:

Type Description
dict[str, NDArray]

A dictionary, where keys correspond to field names, and values to the merged numerical arrays.

Source code in pyclugen/main.py
def clumerge(
    *data: NamedTuple | Mapping[str, ArrayLike],
    fields: tuple[str, ...] = ("points", "clusters"),
    clusters_field: str | None = "clusters",
) -> dict[str, NDArray]:
    r"""Merges the fields (specified in `fields`) of two or more `data` sets.

    Merges the fields (specified in `fields`) of two or more `data` sets (named
    tuples or dictionaries). The fields to be merged need to have the same
    number of columns. The corresponding merged field will contain the rows of
    the fields to be merged, and will have a common supertype.

    The `clusters_field` parameter specifies a field containing integers that
    identify the cluster to which the respective points belongs to. If
    `clusters_field` is specified (by default it's specified as `"clusters"`),
    cluster assignments in individual datasets will be updated in the merged
    dataset so that clusters are considered separate. This parameter can be set
    to `None`, in which case no field will be considered as a special cluster
    assignments field.

    This function can be used to merge data sets generated with the
    [`clugen()`][pyclugen.main.clugen] function, by default merging the
    `points` and `clusters` fields in those data sets. It also works with
    arbitrary data by specifying alternative fields in the `fields` parameter.
    It can be used, for example, to merge third-party data with
    [`clugen()`][pyclugen.main.clugen]-generated data.

    Examples:
        >>> from pyclugen import clugen, clumerge
        >>> data1 = clugen(2, 5, 1000, [1, 1], 0.01, [20, 20], 14, 1.2, 1.5);
        >>> data2 = clugen(2, 3, 450, [0.8, -0.3], 0, [25, 21], 6, 0.4, 3.5);
        >>> data3 = clugen(2, 2, 600, [0, -0.7], 0.2, [15, 10], 1, 0.1, 5.2);
        >>> data_merged = clumerge(data1, data2, data3)

    Args:
      *data: One or more cluster data sets whose `fields` are to be merged.
      fields: Fields to be merged, which must exist in the data set given in
        `*data`.
      clusters_field: Field containing the integer cluster labels. If specified,
        cluster assignments in individual datasets will be updated in the merged
        dataset so that clusters are considered separate.

    Returns:
      A dictionary, where keys correspond to field names, and values to the
        merged numerical arrays.
    """
    # Number of elements in each array the merged dataset
    numel: int = 0

    # Number of columns of values in each field
    fields_info: dict[str, _FieldInfo] = {}

    # Merged dataset to output, initially empty
    output: dict[str, NDArray] = {}

    # Create a fields set
    fields_set: MutableSet[str] = set(fields)

    # If a clusters field is given, add it
    if clusters_field is not None:
        fields_set.add(str(clusters_field))

    # Data in dictionary format with NDArray views on data
    ddata: MutableSequence[Mapping[str, NDArray]] = []
    for dt in data:
        # If dt is a named tuple, convert it into a dictionary
        ddt: Mapping[str, ArrayLike]
        if isinstance(dt, dict):
            ddt = cast(dict, dt)
        else:
            ntdt = cast(NamedTuple, dt)
            ddt = ntdt._asdict()

        # Convert dictionary values to NDArrays
        ddtnp: Mapping[str, NDArray] = {k: asarray(v) for k, v in ddt.items()}

        # Add converted dictionary to our sequence of dictionaries
        ddata.append(ddtnp)

    # Cycle through data items
    for dt in ddata:
        # Number of elements in the current item
        numel_i: int = -1

        # Cycle through fields for the current item
        for field in fields_set:
            if field not in dt:
                raise ValueError(f"Data item does not contain required field `{field}`")
            elif field == clusters_field and not can_cast(
                dt[clusters_field].dtype, int64
            ):
                raise ValueError(f"`{clusters_field}` must contain integer types")

            # Get the field value
            value: NDArray = dt[field]

            # Number of elements in field value
            numel_tmp = len(value)

            # Check the number of elements in the field value
            if numel_i == -1:
                # First field: get number of elements in value (must be the same
                # for the remaining field values)
                numel_i = numel_tmp

            elif numel_tmp != numel_i:
                # Fields values after the first must have the same number of
                # elements
                raise ValueError(
                    "Data item contains fields with different sizes "
                    + f"({numel_tmp} != {numel_i})"
                )

            # Get/check info about the field value type
            if field not in fields_info:
                # If it's the first time this field appears, just get the info
                fields_info[field] = _FieldInfo(value.dtype, _getcols(value))

            else:
                # If this field already appeared in previous data items, get the
                # info and check/determine its compatibility with respect to
                # previous data items
                if _getcols(value) != fields_info[field].ncol:
                    # Number of columns must be the same
                    raise ValueError(f"Dimension mismatch in field `{field}`")

                # Get the common supertype
                fields_info[field].dtype = promote_types(
                    fields_info[field].dtype, value.dtype
                )

        # Update total number of elements
        numel += numel_i

    # Initialize output dictionary fields with room for all items
    for field in fields_info:
        if fields_info[field].ncol == 1:
            output[field] = empty((numel,), dtype=fields_info[field].dtype)
        else:
            output[field] = empty(
                (numel, fields_info[field].ncol), dtype=fields_info[field].dtype
            )

    # Copy items from input data to output dictionary, field-wise
    copied: int = 0
    last_cluster: int = 0

    # Create merged output
    for dt in ddata:
        # How many elements to copy for the current data item?
        tocopy: int = len(dt[fields[0]])

        # Cycle through each field and its information
        for field in fields_info:
            # Copy elements
            if field == clusters_field:
                # If this is a clusters field, update the cluster IDs
                old_clusters = unique(dt[clusters_field])
                new_clusters = list(
                    range(last_cluster + 1, last_cluster + len(old_clusters) + 1)
                )
                old2new = zip(old_clusters, new_clusters)
                mapping = dict(old2new)
                last_cluster = new_clusters[-1]

                output[field][copied : (copied + tocopy)] = [
                    mapping[val] for val in dt[clusters_field]
                ]

            else:
                # Otherwise just copy the elements
                ncol: int = fields_info[field].ncol
                output[field].flat[copied * ncol : (copied + tocopy) * ncol] = dt[field]

        # Update how many were copied so far
        copied += tocopy

    # Return result
    return output

clupoints_n

clupoints_n(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray

Generate points from their \(n\)-D projections on a cluster-supporting line.

Each point is placed around its projection using the normal distribution ( \(\mu=0\), \(σ=\)lat_disp).

This function's main intended use is by the clugen() function, generating the final points when the point_dist_fn parameter is set to "n".

Examples:

>>> from pyclugen import clupoints_n, points_on_line
>>> from numpy import array, linspace
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> projs = points_on_line(array([5,5]),     # Get 5 point projections
...                        array([1,0]),     # on a 2D line
...                        linspace(-4,4,5))
>>> projs
array([[1., 5.],
       [3., 5.],
       [5., 5.],
       [7., 5.],
       [9., 5.]])
>>> clupoints_n(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
array([[0.50543932, 4.81610667],
       [3.64396263, 5.09698721],
       [5.46011545, 5.2885519 ],
       [6.68176818, 5.27097611],
       [8.84170227, 4.83880544]])

Parameters:

Name Type Description Default
projs NDArray

Point projections on the cluster-supporting line ( \(p \times n\) matrix).

required
lat_disp float

Standard deviation for the normal distribution, i.e., cluster lateral dispersion.

required
line_len float

Length of cluster-supporting line (ignored).

required
clu_dir NDArray

Direction of the cluster-supporting line.

required
clu_ctr NDArray

Center position of the cluster-supporting line (ignored).

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

Generated points ( \(p \times n\) matrix).

Source code in pyclugen/module.py
def clupoints_n(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Generate points from their $n$-D projections on a cluster-supporting line.

    Each point is placed around its projection using the normal distribution
    ( $\mu=0$, $σ=$`lat_disp`).

    This function's main intended use is by the [`clugen()`][pyclugen.main.clugen]
    function, generating the final points when the `point_dist_fn` parameter is
    set to `"n"`.

    Examples:
        >>> from pyclugen import clupoints_n, points_on_line
        >>> from numpy import array, linspace
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> projs = points_on_line(array([5,5]),     # Get 5 point projections
        ...                        array([1,0]),     # on a 2D line
        ...                        linspace(-4,4,5))
        >>> projs
        array([[1., 5.],
               [3., 5.],
               [5., 5.],
               [7., 5.],
               [9., 5.]])
        >>> clupoints_n(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
        array([[0.50543932, 4.81610667],
               [3.64396263, 5.09698721],
               [5.46011545, 5.2885519 ],
               [6.68176818, 5.27097611],
               [8.84170227, 4.83880544]])

    Args:
      projs: Point projections on the cluster-supporting line ( $p \times n$ matrix).
      lat_disp: Standard deviation for the normal distribution, i.e., cluster
        lateral dispersion.
      line_len: Length of cluster-supporting line (ignored).
      clu_dir: Direction of the cluster-supporting line.
      clu_ctr: Center position of the cluster-supporting line (ignored).
      rng: Optional pseudo-random number generator.

    Returns:
      Generated points ( $p \times n$ matrix).
    """
    # Number of dimensions
    num_dims = clu_dir.size

    # Number of points in this cluster
    clu_num_points = projs.shape[0]

    # Get random displacement vectors for each point projection
    displ = lat_disp * rng.normal(size=(clu_num_points, num_dims))

    # Add displacement vectors to each point projection
    points = projs + displ

    return points

clupoints_n_1

clupoints_n_1(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray

Generate points from their \(n\)-D projections on a cluster-supporting line.

Each point is placed on a hyperplane orthogonal to that line and centered at the point's projection, using the normal distribution ( \(\mu=0\), \(σ=\)lat_disp).

This function's main intended use is by the clugen() function, generating the final points when the point_dist_fn parameter is set to "n-1".

Examples:

>>> from pyclugen import clupoints_n_1, points_on_line
>>> from numpy import array, linspace
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> projs = points_on_line(array([5,5]),     # Get 5 point projections
...                        array([1,0]),     # on a 2D line
...                        linspace(-4,4,5))
>>> projs
array([[1., 5.],
       [3., 5.],
       [5., 5.],
       [7., 5.],
       [9., 5.]])
>>> clupoints_n_1(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
array([[1.        , 5.49456068],
       [3.        , 5.18389333],
       [5.        , 5.64396263],
       [7.        , 5.09698721],
       [9.        , 5.46011545]])

Parameters:

Name Type Description Default
projs NDArray

Point projections on the cluster-supporting line ( \(p \times n\) matrix).

required
lat_disp float

Standard deviation for the normal distribution, i.e., cluster lateral dispersion.

required
line_len float

Length of cluster-supporting line (ignored).

required
clu_dir NDArray

Direction of the cluster-supporting line.

required
clu_ctr NDArray

Center position of the cluster-supporting line (ignored).

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

Generated points ( \(p \times n\) matrix).

Source code in pyclugen/module.py
def clupoints_n_1(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Generate points from their $n$-D projections on a cluster-supporting line.

    Each point is placed on a hyperplane orthogonal to that line and centered at
    the point's projection, using the normal distribution ( $\mu=0$,
    $σ=$`lat_disp`).

    This function's main intended use is by the [`clugen()`][pyclugen.main.clugen]
    function, generating the final points when the `point_dist_fn` parameter is
    set to `"n-1"`.

    Examples:
        >>> from pyclugen import clupoints_n_1, points_on_line
        >>> from numpy import array, linspace
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> projs = points_on_line(array([5,5]),     # Get 5 point projections
        ...                        array([1,0]),     # on a 2D line
        ...                        linspace(-4,4,5))
        >>> projs
        array([[1., 5.],
               [3., 5.],
               [5., 5.],
               [7., 5.],
               [9., 5.]])
        >>> clupoints_n_1(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
        array([[1.        , 5.49456068],
               [3.        , 5.18389333],
               [5.        , 5.64396263],
               [7.        , 5.09698721],
               [9.        , 5.46011545]])

    Args:
      projs: Point projections on the cluster-supporting line ( $p \times n$ matrix).
      lat_disp: Standard deviation for the normal distribution, i.e., cluster
        lateral dispersion.
      line_len: Length of cluster-supporting line (ignored).
      clu_dir: Direction of the cluster-supporting line.
      clu_ctr: Center position of the cluster-supporting line (ignored).
      rng: Optional pseudo-random number generator.

    Returns:
      Generated points ( $p \times n$ matrix).
    """
    # No blank line allowed here

    # Define function to get distances from points to their projections on the
    # line (i.e., using the normal distribution)
    def dist_fn(clu_num_points, ldisp, rg):
        return ldisp * rg.normal(size=clu_num_points)

    # Use clupoints_n_1_template() to do the heavy lifting
    return clupoints_n_1_template(projs, lat_disp, clu_dir, dist_fn, rng=rng)

clupoints_n_1_template

clupoints_n_1_template(
    projs: NDArray,
    lat_disp: float,
    clu_dir: NDArray,
    dist_fn: Callable[[int, float, Generator], NDArray],
    rng: Generator = _default_rng,
) -> NDArray

Create \(p\) points from their \(n\)-D projections on a cluster-supporting line.

Each point is placed on a hyperplane orthogonal to that line and centered at the point's projection. The function specified in dist_fn is used to perform the actual placement.

This function is used internally by clupoints_n_1() and may be useful for constructing user-defined final point placement strategies for the point_dist_fn parameter of the main clugen() function.

Examples:

>>> from numpy import array, zeros
>>> from numpy.random import Generator, PCG64
>>> from pyclugen import clupoints_n_1_template, points_on_line
>>> ctr = zeros(2)
>>> dir = array([1, 0])
>>> pdist = array([-0.5, -0.2, 0.1, 0.3])
>>> rng = Generator(PCG64(123))
>>> proj = points_on_line(ctr, dir, pdist)
>>> clupoints_n_1_template(proj, 0, dir, lambda p, l, r: r.random(p), rng=rng)
array([[-0.5       ,  0.68235186],
       [-0.2       , -0.05382102],
       [ 0.1       ,  0.22035987],
       [ 0.3       , -0.18437181]])

Parameters:

Name Type Description Default
projs NDArray

Point projections on the cluster-supporting line ( \(p \times n\) matrix).

required
lat_disp float

Dispersion of points from their projection.

required
clu_dir NDArray

Direction of the cluster-supporting line (unit vector).

required
dist_fn Callable[[int, float, Generator], NDArray]

Function to place points on a second line, orthogonal to the first. The functions accepts as parameters the number of points in the current cluster, the lateral_disp parameter (the same passed to the clugen() function), and a random number generator, returning a vector containing the distance of each point to its projection on the cluster-supporting line.

required
rng Generator

An optional pseudo-random number generator for reproducible executions.

_default_rng

Returns:

Type Description
NDArray

Generated points ( \(p \times n\) matrix).

Source code in pyclugen/helper.py
def clupoints_n_1_template(
    projs: NDArray,
    lat_disp: float,
    clu_dir: NDArray,
    dist_fn: Callable[[int, float, Generator], NDArray],
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Create $p$ points from their $n$-D projections on a cluster-supporting line.

    Each point is placed on a hyperplane orthogonal to that line and centered at
    the point's projection. The function specified in `dist_fn` is used to perform
    the actual placement.

    This function is used internally by
    [`clupoints_n_1()`][pyclugen.module.clupoints_n_1] and may be useful for
    constructing user-defined final point placement strategies for the `point_dist_fn`
    parameter of the main [`clugen()`][pyclugen.main.clugen] function.

    Examples:
        >>> from numpy import array, zeros
        >>> from numpy.random import Generator, PCG64
        >>> from pyclugen import clupoints_n_1_template, points_on_line
        >>> ctr = zeros(2)
        >>> dir = array([1, 0])
        >>> pdist = array([-0.5, -0.2, 0.1, 0.3])
        >>> rng = Generator(PCG64(123))
        >>> proj = points_on_line(ctr, dir, pdist)
        >>> clupoints_n_1_template(proj, 0, dir, lambda p, l, r: r.random(p), rng=rng)
        array([[-0.5       ,  0.68235186],
               [-0.2       , -0.05382102],
               [ 0.1       ,  0.22035987],
               [ 0.3       , -0.18437181]])

    Args:
      projs: Point projections on the cluster-supporting line ( $p \times n$ matrix).
      lat_disp: Dispersion of points from their projection.
      clu_dir: Direction of the cluster-supporting line (unit vector).
      dist_fn: Function to place points on a second line, orthogonal to the first.
        The functions accepts as parameters the number of points in the current
        cluster, the `lateral_disp` parameter (the same passed to the
        [`clugen()`][pyclugen.main.clugen] function), and a random number generator,
        returning a vector containing the distance of each point to its projection
        on the cluster-supporting line.
      rng: An optional pseudo-random number generator for reproducible executions.

    Returns:
      Generated points ( $p \times n$ matrix).
    """
    # Number of dimensions
    num_dims = clu_dir.size

    # Number of points in this cluster
    clu_num_points = projs.shape[0]

    # Get distances from points to their projections on the line
    points_dist = dist_fn(clu_num_points, lat_disp, rng)

    # Get normalized vectors, orthogonal to the current line, for each point
    orth_vecs = zeros((clu_num_points, num_dims))

    for j in range(clu_num_points):
        orth_vecs[j, :] = rand_ortho_vector(clu_dir, rng=rng).ravel()

    # Set vector magnitudes
    orth_vecs = abs(points_dist).reshape(-1, 1) * orth_vecs

    # Add perpendicular vectors to point projections on the line,
    # yielding final cluster points
    points = projs + orth_vecs

    return points

clusizes

clusizes(
    num_clusters: int,
    num_points: int,
    allow_empty: bool,
    rng: Generator = _default_rng,
) -> NDArray

Determine cluster sizes, i.e., the number of points in each cluster.

Cluster sizes are determined using the normal distribution ( \(\mu=\)num_points \(/\)num_clusters, \(\sigma=\mu/3\)), and then assuring that the final cluster sizes add up to num_points via the fix_num_points() function.

Examples:

>>> from numpy.random import Generator, PCG64
>>> from pyclugen import clusizes
>>> prng = Generator(PCG64(123))
>>> sizes = clusizes(4, 1000, True, rng=prng)
>>> sizes
array([166, 217, 354, 263])
>>> sum(sizes)
1000

Parameters:

Name Type Description Default
num_clusters int

Number of clusters.

required
num_points int

Total number of points.

required
allow_empty bool

Allow empty clusters?

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

Number of points in each cluster (vector of size num_clusters).

Source code in pyclugen/module.py
def clusizes(
    num_clusters: int,
    num_points: int,
    allow_empty: bool,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Determine cluster sizes, i.e., the number of points in each cluster.

    Cluster sizes are determined using the normal distribution (
    $\mu=$`num_points` $/$`num_clusters`, $\sigma=\mu/3$), and then
    assuring that the final cluster sizes add up to `num_points` via the
    [`fix_num_points()`][pyclugen.helper.fix_num_points] function.

    Examples:
        >>> from numpy.random import Generator, PCG64
        >>> from pyclugen import clusizes
        >>> prng = Generator(PCG64(123))
        >>> sizes = clusizes(4, 1000, True, rng=prng)
        >>> sizes
        array([166, 217, 354, 263])
        >>> sum(sizes)
        1000

    Args:
      num_clusters: Number of clusters.
      num_points: Total number of points.
      allow_empty: Allow empty clusters?
      rng: Optional pseudo-random number generator.

    Returns:
      Number of points in each cluster (vector of size `num_clusters`).
    """
    # Determine number of points in each cluster using the normal distribution

    # Consider the mean an equal division of points between clusters
    mean = num_points / num_clusters
    # The standard deviation is such that the interval [0, 2 * mean] will contain
    # ≈99.7% of cluster sizes
    std = mean / 3

    # Determine points with the normal distribution
    clu_num_points = std * rng.normal(size=num_clusters) + mean

    # Set negative values to zero
    clu_num_points = where(clu_num_points > 0, clu_num_points, 0)

    # Fix imbalances, so that num_points is respected
    if sum(clu_num_points) > 0:  # Be careful not to divide by zero
        clu_num_points *= num_points / sum(clu_num_points)

    # Round the real values to integers since a cluster sizes is represented by
    # an integer
    clu_num_points = rint(clu_num_points).astype(int)

    # Make sure total points is respected, which may not be the case at this time due
    # to rounding
    fix_num_points(clu_num_points, num_points)

    # If empty clusters are not allowed, make sure there aren't any
    if not allow_empty:
        fix_empty(clu_num_points)

    return clu_num_points

fix_empty

fix_empty(clu_num_points: NDArray, allow_empty: bool = False) -> NDArray

Certifies that, given enough points, no clusters are left empty.

This is done by removing a point from the largest cluster and adding it to an empty cluster while there are empty clusters. If the total number of points is smaller than the number of clusters (or if the allow_empty parameter is set to true), this function does nothing.

This function is used internally by clusizes() and might be useful for custom cluster sizing implementations given as the clusizes_fn parameter of the main clugen() function.

Note that the array is changed in-place.

Examples:

>>> from numpy import array
>>> from pyclugen import fix_empty
>>> clusters = array([3, 4, 5, 0, 0])
>>> fix_empty(clusters)
array([3, 3, 4, 1, 1])
>>> clusters # Verify that the array was changed in-place
array([3, 3, 4, 1, 1])

Parameters:

Name Type Description Default
clu_num_points NDArray

Number of points in each cluster (vector of size \(c\)), where \(c\) is the number of clusters.

required
allow_empty bool

Allow empty clusters?

False

Returns:

Type Description
NDArray

Number of points in each cluster, after being fixed by this function (vector of size \(c\), which is the same reference than clu_num_points).

Source code in pyclugen/helper.py
def fix_empty(clu_num_points: NDArray, allow_empty: bool = False) -> NDArray:
    r"""Certifies that, given enough points, no clusters are left empty.

    This is done by removing a point from the largest cluster and adding it to an
    empty cluster while there are empty clusters. If the total number of points is
    smaller than the number of clusters (or if the `allow_empty` parameter is set
    to `true`), this function does nothing.

    This function is used internally by [`clusizes()`][pyclugen.module.clusizes]
    and might be useful for custom cluster sizing implementations given as the
    `clusizes_fn` parameter of the main [`clugen()`][pyclugen.main.clugen] function.

    Note that the array is changed in-place.

    Examples:
        >>> from numpy import array
        >>> from pyclugen import fix_empty
        >>> clusters = array([3, 4, 5, 0, 0])
        >>> fix_empty(clusters)
        array([3, 3, 4, 1, 1])
        >>> clusters # Verify that the array was changed in-place
        array([3, 3, 4, 1, 1])

    Args:
      clu_num_points: Number of points in each cluster (vector of size $c$),
        where $c$ is the number of clusters.
      allow_empty: Allow empty clusters?

    Returns:
      Number of points in each cluster, after being fixed by this function (vector
        of size $c$, which is the same reference than `clu_num_points`).
    """
    # If the allow_empty parameter is set to true, don't do anything and return
    # immediately; this is useful for quick `clusizes_fn` one-liners
    if not allow_empty:
        # Find empty clusters
        empty_clusts = [idx for idx, val in enumerate(clu_num_points) if val == 0]

        # If there are empty clusters and enough points for all clusters...
        if len(empty_clusts) > 0 and sum(clu_num_points) >= clu_num_points.size:
            # Go through the empty clusters...
            for i0 in empty_clusts:
                # ...get a point from the largest cluster and assign it to the
                # current empty cluster
                imax = argmax(clu_num_points)
                clu_num_points[imax] -= 1
                clu_num_points[i0] += 1

    return clu_num_points

fix_num_points

fix_num_points(clu_num_points: NDArray, num_points: int) -> NDArray

Certifies that the values in the clu_num_points array add up to num_points.

If this is not the case, the clu_num_points array is modified in-place, incrementing the value corresponding to the smallest cluster while sum(clu_num_points) < num_points, or decrementing the value corresponding to the largest cluster while sum(clu_num_points) > num_points.

This function is used internally by clusizes() and might be useful for custom cluster sizing implementations given as the clusizes_fn parameter of the main clugen() function.

Examples:

>>> from numpy import array
>>> from pyclugen import fix_num_points
>>> clusters = array([1, 6, 3])  # 10 total points
>>> fix_num_points(clusters, 12) # But we want 12 total points
array([3, 6, 3])
>>> clusters # Verify that the array was changed in-place
array([3, 6, 3])

Parameters:

Name Type Description Default
clu_num_points NDArray

Number of points in each cluster (vector of size \(c\)), where \(c\) is the number of clusters.

required
num_points int

The expected total number of points.

required

Returns:

Type Description
NDArray

Number of points in each cluster, after being fixed by this function (vector of size \(c\), which is the same reference than clu_num_points).

Source code in pyclugen/helper.py
def fix_num_points(clu_num_points: NDArray, num_points: int) -> NDArray:
    r"""Certifies that the values in the `clu_num_points` array add up to `num_points`.

    If this is not the case, the `clu_num_points` array is modified in-place,
    incrementing the value corresponding to the smallest cluster while
    `sum(clu_num_points) < num_points`, or decrementing the value corresponding to
    the largest cluster while `sum(clu_num_points) > num_points`.

    This function is used internally by [`clusizes()`][pyclugen.module.clusizes]
    and might be useful for custom cluster sizing implementations given as the
    `clusizes_fn` parameter of the main [`clugen()`][pyclugen.main.clugen] function.

    Examples:
        >>> from numpy import array
        >>> from pyclugen import fix_num_points
        >>> clusters = array([1, 6, 3])  # 10 total points
        >>> fix_num_points(clusters, 12) # But we want 12 total points
        array([3, 6, 3])
        >>> clusters # Verify that the array was changed in-place
        array([3, 6, 3])

    Args:
      clu_num_points: Number of points in each cluster (vector of size $c$),
        where $c$ is the number of clusters.
      num_points: The expected total number of points.

    Returns:
      Number of points in each cluster, after being fixed by this function (vector
        of size $c$, which is the same reference than `clu_num_points`).
    """
    while sum(clu_num_points) < num_points:
        imin = argmin(clu_num_points)
        clu_num_points[imin] += 1
    while sum(clu_num_points) > num_points:
        imax = argmax(clu_num_points)
        clu_num_points[imax] -= 1

    return clu_num_points

llengths

llengths(
    num_clusters: int,
    llength: float,
    llength_disp: float,
    rng: Generator = _default_rng,
) -> NDArray

Determine length of cluster-supporting lines.

Line lengths are determined using the folded normal distribution ( \(\mu=\)llength, \(\sigma=\)llength_disp).

Examples:

>>> from numpy.random import Generator, MT19937
>>> from pyclugen import llengths
>>> prng = Generator(MT19937(123))
>>> llengths(4, 20, 3.5, rng=prng)
array([19.50968733, 19.92482858, 25.99013804, 18.58029672])

Parameters:

Name Type Description Default
num_clusters int

Number of clusters.

required
llength float

Average line length.

required
llength_disp float

Line length dispersion.

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

Lengths of cluster-supporting lines (vector of size num_clusters).

Source code in pyclugen/module.py
def llengths(
    num_clusters: int,
    llength: float,
    llength_disp: float,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Determine length of cluster-supporting lines.

    Line lengths are determined using the folded normal distribution (
    $\mu=$`llength`, $\sigma=$`llength_disp`).

    Examples:
        >>> from numpy.random import Generator, MT19937
        >>> from pyclugen import llengths
        >>> prng = Generator(MT19937(123))
        >>> llengths(4, 20, 3.5, rng=prng)
        array([19.50968733, 19.92482858, 25.99013804, 18.58029672])

    Args:
      num_clusters: Number of clusters.
      llength: Average line length.
      llength_disp: Line length dispersion.
      rng: Optional pseudo-random number generator.

    Returns:
      Lengths of cluster-supporting lines (vector of size `num_clusters`).
    """
    return abs(llength + llength_disp * rng.normal(size=num_clusters))

points_on_line

points_on_line(
    center: NDArray, direction: NDArray, dist_center: NDArray
) -> NDArray

Determine coordinates of points on a line.

Determine coordinates of points on a line with center and direction, based on the distances from the center given in dist_center.

This works by using the vector formulation of the line equation assuming direction is a \(n\)-dimensional unit vector. In other words, considering \(\mathbf{d}=\)direction.reshape(-1,1) ( \(n \times 1\) vector), \(\mathbf{c}=\)center.reshape(-1,1) ( \(n \times 1\) vector), and \(\mathbf{w}=\) dist_center.reshape(-1,1) ( \(p \times 1\) vector), the coordinates of points on the line are given by:

\[ \mathbf{P}=\mathbf{1}\,\mathbf{c}^T + \mathbf{w}\mathbf{d}^T \]

where \(\mathbf{P}\) is the \(p \times n\) matrix of point coordinates on the line, and \(\mathbf{1}\) is a \(p \times 1\) vector with all entries equal to 1.

Examples:

>>> from pyclugen import points_on_line
>>> from numpy import array, linspace
>>> points_on_line(array([5., 5.]),
...                array([1., 0.]),
...                linspace(-4, 4, 5)) # 2D, 5 points
array([[1., 5.],
       [3., 5.],
       [5., 5.],
       [7., 5.],
       [9., 5.]])
>>> points_on_line(array([-2, 0, 0., 2]),
...                array([0., 0, -1, 0]),
...                array([10, -10])) # 4D, 2 points
array([[ -2.,   0., -10.,   2.],
       [ -2.,   0.,  10.,   2.]])

Parameters:

Name Type Description Default
center NDArray

Center of the line ( \(n\)-component vector).

required
direction NDArray

Line direction ( \(n\)-component unit vector).

required
dist_center NDArray

Distance of each point to the center of the line ( \(p\)-component vector, where \(p\) is the number of points).

required

Returns:

Type Description
NDArray

Coordinates of points on the specified line ( \(p \times n\) matrix).

Source code in pyclugen/core.py
def points_on_line(
    center: NDArray, direction: NDArray, dist_center: NDArray
) -> NDArray:
    r"""Determine coordinates of points on a line.

    Determine coordinates of points on a line with `center` and `direction`,
    based on the distances from the center given in `dist_center`.

    This works by using the vector formulation of the line equation assuming
    `direction` is a $n$-dimensional unit vector. In other words, considering
    $\mathbf{d}=$`direction.reshape(-1,1)` ( $n \times 1$ vector),
    $\mathbf{c}=$`center.reshape(-1,1)` ( $n \times 1$ vector), and
    $\mathbf{w}=$ `dist_center.reshape(-1,1)` ( $p \times 1$ vector),
    the coordinates of points on the line are given by:

    $$
    \mathbf{P}=\mathbf{1}\,\mathbf{c}^T + \mathbf{w}\mathbf{d}^T
    $$

    where $\mathbf{P}$ is the $p \times n$ matrix of point coordinates on the
    line, and $\mathbf{1}$ is a $p \times 1$ vector with all entries equal to 1.

    Examples:
        >>> from pyclugen import points_on_line
        >>> from numpy import array, linspace
        >>> points_on_line(array([5., 5.]),
        ...                array([1., 0.]),
        ...                linspace(-4, 4, 5)) # 2D, 5 points
        array([[1., 5.],
               [3., 5.],
               [5., 5.],
               [7., 5.],
               [9., 5.]])
        >>> points_on_line(array([-2, 0, 0., 2]),
        ...                array([0., 0, -1, 0]),
        ...                array([10, -10])) # 4D, 2 points
        array([[ -2.,   0., -10.,   2.],
               [ -2.,   0.,  10.,   2.]])

    Args:
      center: Center of the line ( $n$-component vector).
      direction: Line direction ( $n$-component unit vector).
      dist_center: Distance of each point to the center of the line
        ( $p$-component vector, where $p$ is the number of points).

    Returns:
      Coordinates of points on the specified line ( $p \times n$ matrix).
    """
    return center.reshape(1, -1) + dist_center.reshape(-1, 1) @ direction.reshape(
        (1, -1)
    )

rand_ortho_vector

rand_ortho_vector(u: NDArray, rng: Generator = _default_rng) -> NDArray

Get a random unit vector orthogonal to u.

Note that u is expected to be a unit vector itself.

Examples:

>>> from pyclugen import rand_ortho_vector
>>> from numpy import isclose, dot
>>> from numpy.linalg import norm
>>> from numpy.random import Generator, PCG64
>>> rng = Generator(PCG64(123))
>>> r = rng.random(3) # Get a random vector with 3 components (3D)
>>> r = r / norm(r) # Normalize it
>>> r_ort = rand_ortho_vector(r, rng=rng) # Get random unit vector orth. to r
>>> r_ort
array([-0.1982903 , -0.61401512,  0.76398062])
>>> isclose(dot(r, r_ort), 0) # Check that vectors are indeed orthogonal
True

Parameters:

Name Type Description Default
u NDArray

Unit vector with \(n\) components.

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

A random unit vector with \(n\) components orthogonal to u.

Source code in pyclugen/core.py
def rand_ortho_vector(u: NDArray, rng: Generator = _default_rng) -> NDArray:
    r"""Get a random unit vector orthogonal to `u`.

    Note that `u` is expected to be a unit vector itself.

    Examples:
        >>> from pyclugen import rand_ortho_vector
        >>> from numpy import isclose, dot
        >>> from numpy.linalg import norm
        >>> from numpy.random import Generator, PCG64
        >>> rng = Generator(PCG64(123))
        >>> r = rng.random(3) # Get a random vector with 3 components (3D)
        >>> r = r / norm(r) # Normalize it
        >>> r_ort = rand_ortho_vector(r, rng=rng) # Get random unit vector orth. to r
        >>> r_ort
        array([-0.1982903 , -0.61401512,  0.76398062])
        >>> isclose(dot(r, r_ort), 0) # Check that vectors are indeed orthogonal
        True

    Args:
      u: Unit vector with $n$ components.
      rng: Optional pseudo-random number generator.

    Returns:
      A random unit vector with $n$ components orthogonal to `u`.
    """
    # If 1D, just return a random unit vector
    if u.size == 1:
        return rand_unit_vector(1, rng=rng)

    # Find a random, non-parallel vector to u
    while True:
        # Find normalized random vector
        r = rand_unit_vector(u.size, rng=rng)

        # If not parallel to u we can keep it and break the loop
        if not isclose(abs(dot(u, r)), 1):
            break

    # Get vector orthogonal to u using 1st iteration of Gram-Schmidt process
    v = r - dot(u, r) / dot(u, u) * u

    # Normalize it
    v = v / norm(v)

    # And return it
    return v

rand_unit_vector

rand_unit_vector(num_dims: int, rng: Generator = _default_rng) -> NDArray

Get a random unit vector with num_dims components.

Examples:

>>> from pyclugen import rand_unit_vector
>>> rand_unit_vector(4)
array([ 0.48653889,  0.50753862,  0.05711487, -0.70881757])
>>> from pyclugen import rand_unit_vector
>>> from numpy.random import Generator, PCG64
>>> rng = Generator(PCG64(123))
>>> rand_unit_vector(2, rng=rng) # Reproducible
array([ 0.3783202 , -0.92567479])

Parameters:

Name Type Description Default
num_dims int

Number of components in vector (i.e. vector size).

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

A random unit vector with num_dims components.

Source code in pyclugen/core.py
def rand_unit_vector(num_dims: int, rng: Generator = _default_rng) -> NDArray:
    r"""Get a random unit vector with `num_dims` components.

    Examples:
        >>> from pyclugen import rand_unit_vector
        >>> rand_unit_vector(4) # doctest: +SKIP
        array([ 0.48653889,  0.50753862,  0.05711487, -0.70881757])

        >>> from pyclugen import rand_unit_vector
        >>> from numpy.random import Generator, PCG64
        >>> rng = Generator(PCG64(123))
        >>> rand_unit_vector(2, rng=rng) # Reproducible
        array([ 0.3783202 , -0.92567479])

    Args:
      num_dims: Number of components in vector (i.e. vector size).
      rng: Optional pseudo-random number generator.

    Returns:
      A random unit vector with `num_dims` components.
    """
    r = rng.random(num_dims) - 0.5
    r = r / norm(r)
    return r

rand_vector_at_angle

rand_vector_at_angle(
    u: NDArray, angle: float, rng: Generator = _default_rng
) -> NDArray

Get a random unit vector which is at angle radians of vector u.

Note that u is expected to be a unit vector itself.

Examples:

>>> from pyclugen import rand_vector_at_angle
>>> from numpy import arccos, array, degrees, pi, dot
>>> from numpy.linalg import norm
>>> from numpy.random import Generator, PCG64
>>> rng = Generator(PCG64(123))
>>> u = array([ 1.0, 0, 0.5, -0.5 ]) # Define a 4D vector
>>> u = u / norm(u) # Normalize the vector
>>> v = rand_vector_at_angle(u, pi/4, rng=rng) # Get a vector at 45 degrees
>>> v
array([ 0.633066  , -0.50953554, -0.10693823, -0.57285705])
>>> degrees(arccos(dot(u, v) / norm(u) * norm(v))) # Angle between u and v
45.0

Parameters:

Name Type Description Default
u NDArray

Unit vector with \(n\) components.

required
angle float

Angle in radians.

required
rng Generator

Optional pseudo-random number generator.

_default_rng

Returns:

Type Description
NDArray

Random unit vector with \(n\) components which is at angle radians with vector u.

Source code in pyclugen/core.py
def rand_vector_at_angle(
    u: NDArray, angle: float, rng: Generator = _default_rng
) -> NDArray:
    r"""Get a random unit vector which is at `angle` radians of vector `u`.

    Note that `u` is expected to be a unit vector itself.

    Examples:
        >>> from pyclugen import rand_vector_at_angle
        >>> from numpy import arccos, array, degrees, pi, dot
        >>> from numpy.linalg import norm
        >>> from numpy.random import Generator, PCG64
        >>> rng = Generator(PCG64(123))
        >>> u = array([ 1.0, 0, 0.5, -0.5 ]) # Define a 4D vector
        >>> u = u / norm(u) # Normalize the vector
        >>> v = rand_vector_at_angle(u, pi/4, rng=rng) # Get a vector at 45 degrees
        >>> v
        array([ 0.633066  , -0.50953554, -0.10693823, -0.57285705])
        >>> degrees(arccos(dot(u, v) / norm(u) * norm(v))) # Angle between u and v
        45.0

    Args:
      u: Unit vector with $n$ components.
      angle: Angle in radians.
      rng: Optional pseudo-random number generator.

    Returns:
      Random unit vector with $n$ components which is at `angle` radians
        with vector `u`.
    """
    if isclose(abs(angle), pi / 2) and u.size > 1:
        return rand_ortho_vector(u, rng=rng)
    elif -pi / 2 < angle < pi / 2 and u.size > 1:
        v = u + rand_ortho_vector(u, rng=rng) * tan(angle)
        return v / norm(v)
    else:
        # For |θ| > π/2 or the 1D case, simply return a random vector
        return rand_unit_vector(u.size, rng=rng)