Reference

Various functions for multidimensional cluster generation in Python.

Note that:

clugen() is the main function of the pyclugen package, and possibly the only function most users will need.
Functions which accept rng as the last parameter are stochastic. Thus, in order to obtain the same result on separate invocations of these functions, pass them an instance of same pseudo-random number Generator initialized with the same seed.

Clusters ¶

Bases: NamedTuple

Read-only container for results returned by clugen().

The symbols presented in the instances variable below have the following meanings:

\(n\) : Number of dimensions.
\(p\) : Number of points.
\(c\) : Number of clusters.

Source code in pyclugen/main.py

class Clusters(NamedTuple):
    r"""Read-only container for results returned by [`clugen()`][pyclugen.main.clugen].

    The symbols presented in the instances variable below have the following
    meanings:

    - $n$ : Number of dimensions.
    - $p$ : Number of points.
    - $c$ : Number of clusters.
    """

    points: NDArray
    r"""$p \times n$ matrix containing the generated points for all clusters."""

    clusters: NDArray
    r"""Vector of size $p$ indicating the cluster each point in `points`
    belongs to."""

    projections: NDArray
    r"""$p \times n$ matrix with the point projections on the cluster-supporting
    lines."""

    sizes: NDArray
    r"""Vector of size $c$ with the number of points in each cluster."""

    centers: NDArray
    r"""$c \times n$ matrix with the coordinates of the cluster centers."""

    directions: NDArray
    r"""$c \times n$ matrix with the direction of each cluster-supporting line."""

    angles: NDArray
    r"""Vector of size $c$ with the angles between the cluster-supporting lines and
    the main direction."""

    lengths: NDArray
    r"""Vector of size $c$ with the lengths of the cluster-supporting lines."""

angles `instance-attribute` ¶

angles: NDArray

Vector of size \(c\) with the angles between the cluster-supporting lines and the main direction.

centers `instance-attribute` ¶

centers: NDArray

\(c \times n\) matrix with the coordinates of the cluster centers.

clusters `instance-attribute` ¶

clusters: NDArray

Vector of size \(p\) indicating the cluster each point in points belongs to.

directions `instance-attribute` ¶

directions: NDArray

\(c \times n\) matrix with the direction of each cluster-supporting line.

lengths `instance-attribute` ¶

lengths: NDArray

Vector of size \(c\) with the lengths of the cluster-supporting lines.

points `instance-attribute` ¶

points: NDArray

\(p \times n\) matrix containing the generated points for all clusters.

projections `instance-attribute` ¶

projections: NDArray

\(p \times n\) matrix with the point projections on the cluster-supporting lines.

sizes `instance-attribute` ¶

sizes: NDArray

Vector of size \(c\) with the number of points in each cluster.

angle_btw ¶

angle_btw(v1: NDArray, v2: NDArray) -> float

Angle between two \(n\)-dimensional vectors.

Typically, the angle between two vectors v1 and v2 can be obtained with:

arccos(dot(u, v) / (norm(u) * norm(v)))

However, this approach is numerically unstable. The version provided here is numerically stable and based on the AngleBetweenVectors Julia package by Jeffrey Sarnoff (MIT license), implementing an algorithm provided by Prof. W. Kahan in these notes (see page 15).

Examples:

>>> from numpy import array, degrees
>>> from pyclugen import angle_btw
>>> v1 = array([1.0, 1.0, 1.0, 1.0])
>>> v2 = array([1.0, 0.0, 0.0, 0.0])
>>> float(degrees(angle_btw(v1, v2)))
60.00000000000001

Parameters:

Name	Type	Description	Default
`v1`	`NDArray`	First vector.	required
`v2`	`NDArray`	Second vector.	required

Returns:

Type	Description
`float`	Angle between `v1` and `v2` in radians.

Source code in pyclugen/helper.py

def angle_btw(v1: NDArray, v2: NDArray) -> float:
    r"""Angle between two $n$-dimensional vectors.

    Typically, the angle between two vectors `v1` and `v2` can be obtained with:

    ```python
    arccos(dot(u, v) / (norm(u) * norm(v)))
    ```

    However, this approach is numerically unstable. The version provided here is
    numerically stable and based on the
    [AngleBetweenVectors](https://github.com/JeffreySarnoff/AngleBetweenVectors.jl)
    Julia package by Jeffrey Sarnoff (MIT license), implementing an algorithm
    provided by Prof. W. Kahan in
    [these notes](https://people.eecs.berkeley.edu/~wkahan/MathH110/Cross.pdf)
    (see page 15).

    Examples:
        >>> from numpy import array, degrees
        >>> from pyclugen import angle_btw
        >>> v1 = array([1.0, 1.0, 1.0, 1.0])
        >>> v2 = array([1.0, 0.0, 0.0, 0.0])
        >>> float(degrees(angle_btw(v1, v2)))
        60.00000000000001

    Args:
      v1: First vector.
      v2: Second vector.

    Returns:
      Angle between `v1` and `v2` in radians.
    """
    u1 = v1 / norm(v1)
    u2 = v2 / norm(v2)

    y = u1 - u2
    x = u1 + u2

    return 2 * np.arctan(norm(y) / norm(x))

angle_deltas ¶

angle_deltas(
    num_clusters: int, angle_disp: float, rng: Generator = _default_rng
) -> NDArray

Get angles between average cluster direction and cluster-supporting lines.

Determine the angles between the average cluster direction and the cluster-supporting lines. These angles are obtained from a wrapped normal distribution ( \(\mu=0\), \(\sigma=\)angle_disp) with support in the interval \(\left[-\pi/2,\pi/2\right]\). Note this is different from the standard wrapped normal distribution, the support of which is given by the interval \(\left[-\pi,\pi\right]\).

Examples:

>>> from pyclugen import angle_deltas
>>> from numpy import degrees, pi
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> a_rad = angle_deltas(4, pi/8, rng=prng) # Angle dispersion of 22.5 degrees
>>> a_rad
array([-0.38842705, -0.14442948,  0.50576707,  0.07617358])
>>> degrees(a_rad) # Show angle deltas in degrees
array([-22.25523038,  -8.27519966,  28.97831838,   4.36442443])

Parameters:

Name	Type	Description	Default
`num_clusters`	`int`	Number of clusters.	required
`angle_disp`	`float`	Angle dispersion, in radians.	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	Angles between the average cluster direction and the cluster-supporting lines, given in radians in the interval \(\left[-\pi/2,\pi/2\right]\).

Source code in pyclugen/module.py

def angle_deltas(
    num_clusters: int, angle_disp: float, rng: Generator = _default_rng
) -> NDArray:
    r"""Get angles between average cluster direction and cluster-supporting lines.

    Determine the angles between the average cluster direction and the
    cluster-supporting lines. These angles are obtained from a wrapped normal
    distribution ( $\mu=0$, $\sigma=$`angle_disp`) with support in the interval
    $\left[-\pi/2,\pi/2\right]$. Note this is different from the standard
    wrapped normal distribution, the support of which is given by the interval
    $\left[-\pi,\pi\right]$.

    Examples:
        >>> from pyclugen import angle_deltas
        >>> from numpy import degrees, pi
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> a_rad = angle_deltas(4, pi/8, rng=prng) # Angle dispersion of 22.5 degrees
        >>> a_rad
        array([-0.38842705, -0.14442948,  0.50576707,  0.07617358])
        >>> degrees(a_rad) # Show angle deltas in degrees
        array([-22.25523038,  -8.27519966,  28.97831838,   4.36442443])

    Args:
      num_clusters: Number of clusters.
      angle_disp: Angle dispersion, in radians.
      rng: Optional pseudo-random number generator.

    Returns:
      Angles between the average cluster direction and the cluster-supporting
        lines, given in radians in the interval $\left[-\pi/2,\pi/2\right]$.
    """
    # Get random angle differences using the normal distribution
    angles = angle_disp * rng.normal(size=num_clusters)

    # Reduce angle differences to the interval [-π, π]
    angles = np.arctan2(np.sin(angles), np.cos(angles))

    # Make sure angle differences are within interval [-π/2, π/2]
    return np.where(
        np.abs(angles) > np.pi / 2, angles - np.sign(angles) * np.pi / 2, angles
    )

clucenters ¶

clucenters(
    num_clusters: int,
    clu_sep: NDArray,
    clu_offset: NDArray,
    rng: Generator = _default_rng,
) -> NDArray

Determine cluster centers using the uniform distribution.

The number of clusters (num_clusters) and the average cluster separation (clu_sep) are taken into account.

More specifically, let \(c=\)num_clusters, \(\mathbf{s}=\)clu_sep.reshape(-1,1), \(\mathbf{o}=\)clu_offset.reshape(-1,1), \(n=\)clu_sep.size (i.e., number of dimensions). Cluster centers are obtained according to the following equation:

\[ \mathbf{C}=c\mathbf{U} \cdot \operatorname{diag}(\mathbf{s}) + \mathbf{1}\,\mathbf{o}^T \]

where \(\mathbf{C}\) is the \(c \times n\) matrix of cluster centers, \(\mathbf{U}\) is an \(c \times n\) matrix of random values drawn from the uniform distribution between -0.5 and 0.5, and \(\mathbf{1}\) is an \(c \times 1\) vector with all entries equal to 1.

Examples:

>>> from pyclugen import clucenters
>>> from numpy import array
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> clucenters(3, array([30,10]), array([-50,50]), rng=prng)
array([[-33.58833231,  36.61463056],
       [-75.16761145,  40.53115432],
       [-79.1684689 ,  59.3628352 ]])

Parameters:

Name	Type	Description	Default
`num_clusters`	`int`	Number of clusters.	required
`clu_sep`	`NDArray`	Average cluster separation ( \(n \times 1\) vector).	required
`clu_offset`	`NDArray`	Cluster offsets ( \(n \times 1\) vector).	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	A \(c \times n\) matrix containing the cluster centers.

Source code in pyclugen/module.py

def clucenters(
    num_clusters: int,
    clu_sep: NDArray,
    clu_offset: NDArray,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Determine cluster centers using the uniform distribution.

    The number of clusters (`num_clusters`) and the average cluster separation
    (`clu_sep`) are taken into account.

    More specifically, let $c=$`num_clusters`, $\mathbf{s}=$`clu_sep.reshape(-1,1)`,
    $\mathbf{o}=$`clu_offset.reshape(-1,1)`, $n=$`clu_sep.size` (i.e., number of
    dimensions). Cluster centers are obtained according to the following equation:

    $$
    \mathbf{C}=c\mathbf{U} \cdot \operatorname{diag}(\mathbf{s}) +
        \mathbf{1}\,\mathbf{o}^T
    $$

    where $\mathbf{C}$ is the $c \times n$ matrix of cluster centers,
    $\mathbf{U}$ is an $c \times n$ matrix of random values drawn from the
    uniform distribution between -0.5 and 0.5, and $\mathbf{1}$ is an $c \times
    1$ vector with all entries equal to 1.

    Examples:
        >>> from pyclugen import clucenters
        >>> from numpy import array
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> clucenters(3, array([30,10]), array([-50,50]), rng=prng)
        array([[-33.58833231,  36.61463056],
               [-75.16761145,  40.53115432],
               [-79.1684689 ,  59.3628352 ]])

    Args:
      num_clusters: Number of clusters.
      clu_sep: Average cluster separation ( $n \times 1$ vector).
      clu_offset: Cluster offsets ( $n \times 1$ vector).
      rng: Optional pseudo-random number generator.

    Returns:
        A $c \times n$ matrix containing the cluster centers.
    """
    # Obtain a num_clusters x num_dims matrix of uniformly distributed values
    # between -0.5 and 0.5 representing the relative cluster centers
    ctr_rel = rng.random((num_clusters, clu_sep.size)) - 0.5

    return num_clusters * (ctr_rel @ np.diag(clu_sep)) + clu_offset

clugen ¶

clugen(
    num_dims: int,
    num_clusters: int,
    num_points: int,
    direction: ArrayLike,
    angle_disp: float,
    cluster_sep: ArrayLike,
    llength: float,
    llength_disp: float,
    lateral_disp: float,
    allow_empty: bool = False,
    cluster_offset: Optional[ArrayLike] = None,
    proj_dist_fn: str | Callable[[float, int, Generator], NDArray] = "norm",
    point_dist_fn: (
        str
        | Callable[
            [NDArray, float, float, NDArray, NDArray, Generator], NDArray
        ]
    ) = "n-1",
    clusizes_fn: (
        Callable[[int, int, bool, Generator], NDArray] | ArrayLike
    ) = clusizes,
    clucenters_fn: (
        Callable[[int, NDArray, NDArray, Generator], NDArray] | ArrayLike
    ) = clucenters,
    llengths_fn: (
        Callable[[int, float, float, Generator], NDArray] | ArrayLike
    ) = llengths,
    angle_deltas_fn: (
        Callable[[int, float, Generator], NDArray] | ArrayLike
    ) = angle_deltas,
    rng: int | integer | Generator = _default_rng,
) -> Clusters

Generate multidimensional clusters.

Tip

This is the main function of the pyclugen package, and possibly the only function most users will need.

Examples:¶

>>> import matplotlib.pyplot as plt
>>> from pyclugen import clugen
>>> from numpy import pi
>>> out = clugen(2, 5, 10000, [1, 0.5], pi/16, [10, 40], 10, 1, 2, rng=321)
>>> out.centers # What are the cluster centers?
array([[ 20.02876212,  36.59611434],
       [-15.60290734, -26.52169579],
       [ 23.09775166,  91.66309916],
       [ -5.76816015,  54.9775074 ],
       [ -4.64224681,  78.40990876]])
>>> plt.scatter(out.points[:,0],
...             out.points[:,1],
...             c=out.clusters) # doctest: +SKIP
>>> plt.show() # doctest: +SKIP

clugen

Note

In the descriptions below, the terms "average" and "dispersion" refer to measures of central tendency and statistical dispersion, respectively. Their exact meaning depends on several optional arguments.

Parameters:

Name	Type	Description	Default
`num_dims`	`int`	Number of dimensions.	required
`num_clusters`	`int`	Number of clusters to generate.	required
`num_points`	`int`	Total number of points to generate.	required
`direction`	`ArrayLike`	Average direction of the cluster-supporting lines. Can be a vector of length `num_dims` (same direction for all clusters) or a matrix of size `num_clusters` x `num_dims` (one direction per cluster).	required
`angle_disp`	`float`	Angle dispersion of cluster-supporting lines (radians).	required
`cluster_sep`	`ArrayLike`	Average cluster separation in each dimension (vector of size `num_dims`).	required
`llength`	`float`	Average length of cluster-supporting lines.	required
`llength_disp`	`float`	Length dispersion of cluster-supporting lines.	required
`lateral_disp`	`float`	Cluster lateral dispersion, i.e., dispersion of points from their projection on the cluster-supporting line.	required
`allow_empty`	`bool`	Allow empty clusters? `False` by default.	`False`
`cluster_offset`	`Optional[ArrayLike]`	Offset to add to all cluster centers (vector of size `num_dims`). By default the offset will be equal to `numpy.zeros(num_dims)`.	`None`
`proj_dist_fn`	`str \| Callable[[float, int, Generator], NDArray]`	Distribution of point projections along cluster-supporting lines, with three possible values: `"norm"` (default): Distribute point projections along lines using a normal distribution (μ=line center, σ=`llength/6`). `"unif"`: Distribute points uniformly along the line. User-defined function, which accepts three parameters, line length (`float`), number of points (`int`), and an instance of `Generator`, and returns an array containing the distance of each point projection to the center of the line. For example, the `"norm"` option roughly corresponds to `lambda l, n, rg: l * rg.random((n, 1)) / 6`.	`'norm'`
`point_dist_fn`	`str \| Callable[[NDArray, float, float, NDArray, NDArray, Generator], NDArray]`	Controls how the final points are created from their projections on the cluster-supporting lines, with three possible values: `"n-1"` (default): Final points are placed on a hyperplane orthogonal to the cluster-supporting line, centered at each point's projection, using the normal distribution (μ=0, σ=`lateral_disp`). This is done by the `clupoints_n_1()` function. `"n"`: Final points are placed around their projection on the cluster-supporting line using the normal distribution (μ=0, σ=`lateral_disp`). This is done by the `clupoints_n()` function. User-defined function: The user can specify a custom point placement strategy by passing a function with the same signature as `clupoints_n_1()` and `clupoints_n()`.	`'n-1'`
`clusizes_fn`	`Callable[[int, int, bool, Generator], NDArray] \| ArrayLike`	Distribution of cluster sizes. By default, cluster sizes are determined by the `clusizes()` function, which uses the normal distribution (μ=`num_points`/`num_clusters`, σ=μ/3), and assures that the final cluster sizes add up to `num_points`. This parameter allows the user to specify a custom function for this purpose, which must follow `clusizes()` signature. Note that custom functions are not required to strictly obey the `num_points` parameter. Alternatively, the user can specify an array of cluster sizes directly.	`clusizes`
`clucenters_fn`	`Callable[[int, NDArray, NDArray, Generator], NDArray] \| ArrayLike`	Distribution of cluster centers. By default, cluster centers are determined by the `clucenters()` function, which uses the uniform distribution, and takes into account the `num_clusters` and `cluster_sep` parameters for generating well-distributed cluster centers. This parameter allows the user to specify a custom function for this purpose, which must follow `clucenters()` signature. Alternatively, the user can specify a matrix of size `num_clusters` x `num_dims` with the exact cluster centers.	`clucenters`
`llengths_fn`	`Callable[[int, float, float, Generator], NDArray] \| ArrayLike`	Distribution of line lengths. By default, the lengths of cluster-supporting lines are determined by the `llengths()` function, which uses the folded normal distribution (μ=`llength`, σ=`llength_disp`). This parameter allows the user to specify a custom function for this purpose, which must follow `llengths()` signature. Alternatively, the user can specify an array of line lengths directly.	`llengths`
`angle_deltas_fn`	`Callable[[int, float, Generator], NDArray] \| ArrayLike`	Distribution of line angle differences with respect to `direction`. By default, the angles between `direction` and the direction of cluster-supporting lines are determined by the `angle_deltas()` function, which uses the wrapped normal distribution (μ=0, σ=`angle_disp`) with support in the interval [-π/2, π/2]. This parameter allows the user to specify a custom function for this purpose, which must follow `angle_deltas()` signature. Alternatively, the user can specify an array of angle deltas directly.	`angle_deltas`
`rng`	`int \| integer \| Generator`	The seed for the random number generator or an instance of `Generator` for reproducible executions.	`_default_rng`

Returns:

Type	Description
`Clusters`	The generated clusters and associated information in the form of a `Clusters` object.

Source code in pyclugen/main.py

def clugen(
    num_dims: int,
    num_clusters: int,
    num_points: int,
    direction: ArrayLike,
    angle_disp: float,
    cluster_sep: ArrayLike,
    llength: float,
    llength_disp: float,
    lateral_disp: float,
    allow_empty: bool = False,
    cluster_offset: Optional[ArrayLike] = None,
    proj_dist_fn: str | Callable[[float, int, Generator], NDArray] = "norm",
    point_dist_fn: (
        str | Callable[[NDArray, float, float, NDArray, NDArray, Generator], NDArray]
    ) = "n-1",
    clusizes_fn: Callable[[int, int, bool, Generator], NDArray] | ArrayLike = clusizes,
    clucenters_fn: (
        Callable[[int, NDArray, NDArray, Generator], NDArray] | ArrayLike
    ) = clucenters,
    llengths_fn: (
        Callable[[int, float, float, Generator], NDArray] | ArrayLike
    ) = llengths,
    angle_deltas_fn: (
        Callable[[int, float, Generator], NDArray] | ArrayLike
    ) = angle_deltas,
    rng: int | np.integer | Generator = _default_rng,
) -> Clusters:
    """Generate multidimensional clusters.

    !!! tip
        This is the main function of the **pyclugen** package, and possibly the
        only function most users will need.

    ## Examples:

        >>> import matplotlib.pyplot as plt
        >>> from pyclugen import clugen
        >>> from numpy import pi
        >>> out = clugen(2, 5, 10000, [1, 0.5], pi/16, [10, 40], 10, 1, 2, rng=321)
        >>> out.centers # What are the cluster centers?
        array([[ 20.02876212,  36.59611434],
               [-15.60290734, -26.52169579],
               [ 23.09775166,  91.66309916],
               [ -5.76816015,  54.9775074 ],
               [ -4.64224681,  78.40990876]])
        >>> plt.scatter(out.points[:,0],
        ...             out.points[:,1],
        ...             c=out.clusters) # doctest: +SKIP
        >>> plt.show() # doctest: +SKIP

    ![clugen](https://user-images.githubusercontent.com/3018963/151056890-c83c9509-b40d-4ab2-a842-f2a4706344c6.png)

    !!! Note
        In the descriptions below, the terms "average" and "dispersion" refer to
        measures of central tendency and statistical dispersion, respectively.
        Their exact meaning depends on several optional arguments.

    Args:
      num_dims: Number of dimensions.
      num_clusters: Number of clusters to generate.
      num_points: Total number of points to generate.
      direction: Average direction of the cluster-supporting lines. Can be a
        vector of length `num_dims` (same direction for all clusters) or a
        matrix of size `num_clusters` x `num_dims` (one direction per cluster).
      angle_disp: Angle dispersion of cluster-supporting lines (radians).
      cluster_sep: Average cluster separation in each dimension (vector of size
        `num_dims`).
      llength: Average length of cluster-supporting lines.
      llength_disp: Length dispersion of cluster-supporting lines.
      lateral_disp: Cluster lateral dispersion, i.e., dispersion of points from their
        projection on the cluster-supporting line.
      allow_empty: Allow empty clusters? `False` by default.
      cluster_offset: Offset to add to all cluster centers (vector of size `num_dims`).
        By default the offset will be equal to `numpy.zeros(num_dims)`.
      proj_dist_fn: Distribution of point projections along cluster-supporting lines,
        with three possible values:

        - `"norm"` (default): Distribute point projections along lines using a normal
          distribution (μ=_line center_, σ=`llength/6`).
        - `"unif"`: Distribute points uniformly along the line.
        - User-defined function, which accepts three parameters, line length (`float`),
          number of points (`int`), and an instance of
          [`Generator`](https://numpy.org/doc/stable/reference/random/generator.html?highlight=generator#numpy.random.Generator),
          and returns an array containing the distance of each point projection to
          the center of the line. For example, the `"norm"` option roughly corresponds
          to `lambda l, n, rg: l * rg.random((n, 1)) / 6`.

      point_dist_fn: Controls how the final points are created from their projections
        on the cluster-supporting lines, with three possible values:

        - `"n-1"` (default): Final points are placed on a hyperplane orthogonal to
          the cluster-supporting line, centered at each point's projection, using the
          normal distribution (μ=0, σ=`lateral_disp`). This is done by the
          [`clupoints_n_1()`][pyclugen.module.clupoints_n_1] function.
        - `"n"`: Final points are placed around their projection on the
          cluster-supporting line using the normal distribution (μ=0,
          σ=`lateral_disp`). This is done by the
          [`clupoints_n()`][pyclugen.module.clupoints_n] function.
        - User-defined function: The user can specify a custom point placement
          strategy by passing a function with the same signature as
          [`clupoints_n_1()`][pyclugen.module.clupoints_n_1] and
          [`clupoints_n()`][pyclugen.module.clupoints_n].

      clusizes_fn: Distribution of cluster sizes. By default, cluster sizes are
        determined by the [`clusizes()`][pyclugen.module.clusizes] function, which
        uses the normal distribution (μ=`num_points`/`num_clusters`, σ=μ/3), and
        assures that the final cluster sizes add up to `num_points`. This parameter
        allows the user to specify a custom function for this purpose, which must
        follow [`clusizes()`][pyclugen.module.clusizes] signature. Note that custom
        functions are not required to strictly obey the `num_points` parameter.
        Alternatively, the user can specify an array of cluster sizes directly.
      clucenters_fn: Distribution of cluster centers. By default, cluster centers
        are determined by the [`clucenters()`][pyclugen.module.clucenters] function,
        which uses the uniform distribution, and takes into account the `num_clusters`
        and `cluster_sep` parameters for generating well-distributed cluster centers.
        This parameter allows the user to specify a custom function for this purpose,
        which must follow [`clucenters()`][pyclugen.module.clucenters] signature.
        Alternatively, the user can specify a matrix of size `num_clusters` x
        `num_dims` with the exact cluster centers.
      llengths_fn: Distribution of line lengths. By default, the lengths of
        cluster-supporting lines are determined by the
        [`llengths()`][pyclugen.module.llengths] function, which uses the folded
        normal distribution (μ=`llength`, σ=`llength_disp`). This parameter allows
        the user to specify a custom function for this purpose, which must follow
        [`llengths()`][pyclugen.module.llengths] signature. Alternatively, the user
        can specify an array of line lengths directly.
      angle_deltas_fn: Distribution of line angle differences with respect to
        `direction`. By default, the angles between `direction` and the direction of
        cluster-supporting lines are determined by the
        [`angle_deltas()`][pyclugen.module.angle_deltas] function, which uses the
        wrapped normal distribution (μ=0, σ=`angle_disp`) with support in the interval
        [-π/2, π/2]. This parameter allows the user to specify a custom function for
        this purpose, which must follow [`angle_deltas()`][pyclugen.module.angle_deltas]
        signature. Alternatively, the user can specify an array of angle deltas
        directly.
      rng: The seed for the random number generator or an instance of
        [`Generator`][numpy.random.Generator] for reproducible executions.

    Returns:
      The generated clusters and associated information in the form of a
        [`Clusters`][pyclugen.main.Clusters] object.
    """
    # ############### #
    # Validate inputs #
    # ############### #

    # Check that number of dimensions is > 0
    if num_dims < 1:
        raise ValueError("Number of dimensions, `num_dims`, must be > 0")

    # Check that number of clusters is > 0
    if num_clusters < 1:
        raise ValueError("Number of clusters, `num_clust`, must be > 0")

    # Convert given direction into a NumPy array
    arrdir: NDArray = np.asarray(direction)

    # Get number of dimensions in `direction` array
    dir_ndims = arrdir.ndim

    # Is direction a vector or a matrix?
    if dir_ndims == 1:
        # It's a vector, let's convert it into a row matrix, since this will be
        # useful down the road
        arrdir = arrdir.reshape((1, -1))
    elif dir_ndims == 2:
        # If a matrix was given (i.e. a main direction is given for each cluster),
        # check if the number of directions is the same as the number of clusters
        dir_size_1 = arrdir.shape[0]
        if dir_size_1 != num_clusters:
            raise ValueError(
                "Number of rows in `direction` must be the same as the "
                + f"number of clusters ({dir_size_1} != {num_clusters})"
            )
    else:
        # The `directions` array must be a vector or a matrix, so if we get here
        # it means we have invalid arguments
        raise ValueError(
            "`direction` must be a vector (1D array) or a matrix (2D array), "
            + f"but is {dir_ndims}D"
        )

    # Check that direction has num_dims dimensions
    dir_size_2 = arrdir.shape[1]
    if dir_size_2 != num_dims:
        raise ValueError(
            "Length of directions in `direction` must be equal to "
            + f"`num_dims` ({dir_size_2} != {num_dims})"
        )

    # Check that directions have magnitude > 0
    dir_magnitudes = np.apply_along_axis(norm, 1, arrdir)
    if np.any(np.isclose(dir_magnitudes, 0)):
        raise ValueError("Directions in `direction` must have magnitude > 0")

    # If allow_empty is false, make sure there are enough points to distribute
    # by the clusters
    if (not allow_empty) and num_points < num_clusters:
        raise ValueError(
            f"A total of {num_points} points is not enough for "
            + f"{num_clusters} non-empty clusters"
        )

    # Check that cluster_sep has num_dims dimensions
    cluster_sep = np.asarray(cluster_sep)
    if cluster_sep.size != num_dims:
        raise ValueError(
            "Length of `cluster_sep` must be equal to `num_dims` "
            + f"({cluster_sep.size} != {num_dims})"
        )

    # If given, cluster_offset must have the correct number of dimensions,
    # if not given then it will be a num_dims x 1 vector of zeros
    if cluster_offset is None:
        cluster_offset = np.zeros(num_dims)
    else:
        cluster_offset = np.asarray(cluster_offset)
        if cluster_offset.size != num_dims:
            raise ValueError(
                "Length of `cluster_offset` must be equal to `num_dims` "
                + f"({cluster_offset.size} != {num_dims})"
            )

    # If the user specified rng as an int, create a proper rng object
    rng_sel: Generator
    if isinstance(rng, Generator):
        rng_sel = cast(Generator, rng)
    elif isinstance(rng, (int, np.integer)):
        rng_sel = Generator(PCG64(cast(int, rng)))
    else:
        raise ValueError(
            f"`rng` must be an instance of int or Generator, but is {type(rng)}"
        )

    # Check that proj_dist_fn specifies a valid way for projecting points along
    # cluster-supporting lines i.e., either "norm" (default), "unif" or a
    # user-defined function
    pointproj_fn: Callable[[float, int, Generator], NDArray]

    if callable(proj_dist_fn):
        # Use user-defined distribution; assume function accepts length of line
        # and number of points, and returns a number of points x 1 vector
        pointproj_fn = proj_dist_fn

    elif proj_dist_fn == "unif":
        # Point projections will be uniformly placed along cluster-supporting lines
        def pointproj_fn(length, n, rg):
            return length * rg.random(n) - length / 2

    elif proj_dist_fn == "norm":
        # Use normal distribution for placing point projections along cluster-supporting
        # lines, mean equal to line center, standard deviation equal to 1/6 of line
        # length such that the line length contains ≈99.73% of the points
        def pointproj_fn(length, n, rg):
            return (1.0 / 6.0) * length * rg.normal(size=n)

    else:
        raise ValueError(
            "`proj_dist_fn` has to be either 'norm', 'unif' or user-defined function"
        )

    # Check that point_dist_fn specifies a valid way for generating points given
    # their projections along cluster-supporting lines, i.e., either "n-1"
    # (default), "n" or a user-defined function
    pt_from_proj_fn: Callable[
        [NDArray, float, float, NDArray, NDArray, Generator], NDArray
    ]

    if num_dims == 1:
        # If 1D was specified, point projections are the points themselves
        def pt_from_proj_fn(projs, lat_disp, length, clu_dir, clu_ctr, rng=rng_sel):
            return projs

    elif callable(point_dist_fn):
        # Use user-defined distribution; assume function accepts point projections
        # on the line, lateral disp., cluster direction and cluster center, and
        # returns a num_points x num_dims matrix containing the final points
        # for the current cluster
        pt_from_proj_fn = point_dist_fn

    elif point_dist_fn == "n-1":
        # Points will be placed on a hyperplane orthogonal to the cluster-supporting
        # line using a normal distribution centered at their intersection
        pt_from_proj_fn = clupoints_n_1

    elif point_dist_fn == "n":
        # Points will be placed using a multivariate normal distribution
        # centered at the point projection
        pt_from_proj_fn = clupoints_n

    else:
        raise ValueError(
            "point_dist_fn has to be either 'n-1', 'n' or a user-defined function"
        )

    # ############################ #
    # Determine cluster properties #
    # ############################ #

    # Normalize main direction(s)
    arrdir = np.apply_along_axis(lambda a: a / norm(a), 1, arrdir)

    # If only one main direction was given, expand it for all clusters
    if dir_ndims == 1:
        arrdir = np.repeat(arrdir, num_clusters, axis=0)

    # Determine cluster sizes
    if callable(clusizes_fn):
        cluster_sizes = clusizes_fn(num_clusters, num_points, allow_empty, rng_sel)
    elif len(np.asarray(clusizes_fn)) == num_clusters:
        cluster_sizes = np.asarray(clusizes_fn)
    else:
        raise ValueError(
            "clusizes_fn has to be either a function or a `num_clusters`-sized array"
        )

    # Custom clusizes_fn's are not required to obey num_points, so we update
    # it here just in case it's different from what the user specified
    num_points = np.sum(cluster_sizes)

    # Determine cluster centers
    if callable(clucenters_fn):
        cluster_centers = clucenters_fn(
            num_clusters, cluster_sep, cluster_offset, rng_sel
        )
    elif np.asarray(clucenters_fn).shape == (num_clusters, num_dims):
        cluster_centers = np.asarray(clucenters_fn)
    else:
        raise ValueError(
            "clucenters_fn has to be either a function or a matrix of size "
            + "`num_clusters` x `num_dims`"
        )

    # Determine length of lines supporting clusters
    if callable(llengths_fn):
        cluster_lengths = llengths_fn(num_clusters, llength, llength_disp, rng_sel)
    elif len(np.asarray(llengths_fn)) == num_clusters:
        cluster_lengths = np.asarray(llengths_fn)
    else:
        raise ValueError(
            "llengths_fn has to be either a function or a `num_clusters`-sized array"
        )

    # Obtain angles between main direction and cluster-supporting lines
    if callable(angle_deltas_fn):
        cluster_angles = angle_deltas_fn(num_clusters, angle_disp, rng_sel)
    elif len(np.asarray(angle_deltas_fn)) == num_clusters:
        cluster_angles = np.asarray(angle_deltas_fn)
    else:
        raise ValueError(
            "angle_deltas_fn has to be either a function or a "
            + "`num_clusters`-sized array"
        )

    # Determine normalized cluster directions by applying the obtained angles
    cluster_directions = np.apply_along_axis(
        lambda v, a: rand_vector_at_angle(v, next(a), rng_sel),
        1,
        arrdir,
        iter(cluster_angles),
    )

    # ################################# #
    # Determine points for each cluster #
    # ################################# #

    # Aux. vector with cumulative sum of number of points in each cluster
    cumsum_points = np.concatenate((np.asarray([0]), np.cumsum(cluster_sizes)))

    # Pre-allocate data structures for holding cluster info and points
    point_clusters: NDArray = np.empty(
        num_points, dtype=np.int32
    )  # Cluster indices of each point
    point_projections = np.empty((num_points, num_dims))  # Point projections on
    #                                                  # cluster-supporting lines
    points = np.empty((num_points, num_dims))  # Final points to be generated

    # Loop through clusters and create points for each one
    for i in range(num_clusters):
        # Start and end indexes for points in current cluster
        idx_start = cumsum_points[i]
        idx_end = cumsum_points[i + 1]

        # Update cluster indices of each point
        point_clusters[idx_start:idx_end] = i

        # Determine distance of point projections from the center of the line
        ptproj_dist_fn_center = pointproj_fn(
            cluster_lengths[i], cluster_sizes[i], rng_sel
        )

        # Determine coordinates of point projections on the line using the
        # parametric line equation (this works since cluster direction is normalized)
        point_projections[idx_start:idx_end, :] = points_on_line(
            cluster_centers[i, :], cluster_directions[i, :], ptproj_dist_fn_center
        )

        # Determine points from their projections on the line
        points[idx_start:idx_end, :] = pt_from_proj_fn(
            point_projections[idx_start:idx_end, :],
            lateral_disp,
            cluster_lengths[i],
            cluster_directions[i, :],
            cluster_centers[i, :],
            rng_sel,
        )

    return Clusters(
        points,
        point_clusters,
        point_projections,
        cluster_sizes,
        cluster_centers,
        cluster_directions,
        cluster_angles,
        cluster_lengths,
    )

clumerge ¶

clumerge(
    *data: NamedTuple | Mapping[str, ArrayLike],
    fields: tuple[str, ...] = ("points", "clusters"),
    clusters_field: str | None = "clusters"
) -> dict[str, NDArray]

Merges the fields (specified in fields) of two or more data sets.

Merges the fields (specified in fields) of two or more data sets (named tuples or dictionaries). The fields to be merged need to have the same number of columns. The corresponding merged field will contain the rows of the fields to be merged, and will have a common supertype.

The clusters_field parameter specifies a field containing integers that identify the cluster to which the respective points belongs to. If clusters_field is specified (by default it's specified as "clusters"), cluster assignments in individual datasets will be updated in the merged dataset so that clusters are considered separate. This parameter can be set to None, in which case no field will be considered as a special cluster assignments field.

This function can be used to merge data sets generated with the clugen() function, by default merging the points and clusters fields in those data sets. It also works with arbitrary data by specifying alternative fields in the fields parameter. It can be used, for example, to merge third-party data with clugen()-generated data.

Examples:

>>> from pyclugen import clugen, clumerge
>>> data1 = clugen(2, 5, 1000, [1, 1], 0.01, [20, 20], 14, 1.2, 1.5);
>>> data2 = clugen(2, 3, 450, [0.8, -0.3], 0, [25, 21], 6, 0.4, 3.5);
>>> data3 = clugen(2, 2, 600, [0, -0.7], 0.2, [15, 10], 1, 0.1, 5.2);
>>> data_merged = clumerge(data1, data2, data3)

Parameters:

Name	Type	Description	Default
`*data`	`NamedTuple \| Mapping[str, ArrayLike]`	One or more cluster data sets whose `fields` are to be merged.	`()`
`fields`	`tuple[str, ...]`	Fields to be merged, which must exist in the data set given in `*data`.	`('points', 'clusters')`
`clusters_field`	`str \| None`	Field containing the integer cluster labels. If specified, cluster assignments in individual datasets will be updated in the merged dataset so that clusters are considered separate.	`'clusters'`

Returns:

Type	Description
`dict[str, NDArray]`	A dictionary, where keys correspond to field names, and values to the merged numerical arrays.

Source code in pyclugen/main.py

def clumerge(
    *data: NamedTuple | Mapping[str, ArrayLike],
    fields: tuple[str, ...] = ("points", "clusters"),
    clusters_field: str | None = "clusters",
) -> dict[str, NDArray]:
    r"""Merges the fields (specified in `fields`) of two or more `data` sets.

    Merges the fields (specified in `fields`) of two or more `data` sets (named
    tuples or dictionaries). The fields to be merged need to have the same
    number of columns. The corresponding merged field will contain the rows of
    the fields to be merged, and will have a common supertype.

    The `clusters_field` parameter specifies a field containing integers that
    identify the cluster to which the respective points belongs to. If
    `clusters_field` is specified (by default it's specified as `"clusters"`),
    cluster assignments in individual datasets will be updated in the merged
    dataset so that clusters are considered separate. This parameter can be set
    to `None`, in which case no field will be considered as a special cluster
    assignments field.

    This function can be used to merge data sets generated with the
    [`clugen()`][pyclugen.main.clugen] function, by default merging the
    `points` and `clusters` fields in those data sets. It also works with
    arbitrary data by specifying alternative fields in the `fields` parameter.
    It can be used, for example, to merge third-party data with
    [`clugen()`][pyclugen.main.clugen]-generated data.

    Examples:
        >>> from pyclugen import clugen, clumerge
        >>> data1 = clugen(2, 5, 1000, [1, 1], 0.01, [20, 20], 14, 1.2, 1.5);
        >>> data2 = clugen(2, 3, 450, [0.8, -0.3], 0, [25, 21], 6, 0.4, 3.5);
        >>> data3 = clugen(2, 2, 600, [0, -0.7], 0.2, [15, 10], 1, 0.1, 5.2);
        >>> data_merged = clumerge(data1, data2, data3)

    Args:
      *data: One or more cluster data sets whose `fields` are to be merged.
      fields: Fields to be merged, which must exist in the data set given in
        `*data`.
      clusters_field: Field containing the integer cluster labels. If specified,
        cluster assignments in individual datasets will be updated in the merged
        dataset so that clusters are considered separate.

    Returns:
      A dictionary, where keys correspond to field names, and values to the
        merged numerical arrays.
    """
    # Number of elements in each array the merged dataset
    numel: int = 0

    # Number of columns of values in each field
    fields_info: dict[str, _FieldInfo] = {}

    # Merged dataset to output, initially empty
    output: dict[str, NDArray] = {}

    # Create a fields set
    fields_set: MutableSet[str] = set(fields)

    # If a clusters field is given, add it
    if clusters_field is not None:
        fields_set.add(str(clusters_field))

    # Data in dictionary format with NDArray views on data
    ddata: MutableSequence[Mapping[str, NDArray]] = []
    for dt in data:
        # If dt is a named tuple, convert it into a dictionary
        ddt: Mapping[str, ArrayLike]
        if isinstance(dt, dict):
            ddt = cast(dict, dt)
        else:
            ntdt = cast(NamedTuple, dt)
            ddt = ntdt._asdict()

        # Convert dictionary values to NDArrays
        ddtnp: Mapping[str, NDArray] = {k: np.asarray(v) for k, v in ddt.items()}

        # Add converted dictionary to our sequence of dictionaries
        ddata.append(ddtnp)

    # Cycle through data items
    for dt in ddata:
        # Number of elements in the current item
        numel_i: int = -1

        # Cycle through fields for the current item
        for field in fields_set:
            if field not in dt:
                raise ValueError(f"Data item does not contain required field `{field}`")
            elif field == clusters_field and not np.can_cast(
                dt[clusters_field].dtype, np.int64
            ):
                raise ValueError(f"`{clusters_field}` must contain integer types")

            # Get the field value
            value: NDArray = dt[field]

            # Number of elements in field value
            numel_tmp = len(value)

            # Check the number of elements in the field value
            if numel_i == -1:
                # First field: get number of elements in value (must be the same
                # for the remaining field values)
                numel_i = numel_tmp

            elif numel_tmp != numel_i:
                # Fields values after the first must have the same number of
                # elements
                raise ValueError(
                    "Data item contains fields with different sizes "
                    + f"({numel_tmp} != {numel_i})"
                )

            # Get/check info about the field value type
            if field not in fields_info:
                # If it's the first time this field appears, just get the info
                fields_info[field] = _FieldInfo(value.dtype, _getcols(value))

            else:
                # If this field already appeared in previous data items, get the
                # info and check/determine its compatibility with respect to
                # previous data items
                if _getcols(value) != fields_info[field].ncol:
                    # Number of columns must be the same
                    raise ValueError(f"Dimension mismatch in field `{field}`")

                # Get the common supertype
                fields_info[field].dtype = np.promote_types(
                    fields_info[field].dtype, value.dtype
                )

        # Update total number of elements
        numel += numel_i

    # Initialize output dictionary fields with room for all items
    for field in fields_info:
        if fields_info[field].ncol == 1:
            output[field] = np.empty((numel,), dtype=fields_info[field].dtype)
        else:
            output[field] = np.empty(
                (numel, fields_info[field].ncol), dtype=fields_info[field].dtype
            )

    # Copy items from input data to output dictionary, field-wise
    copied: int = 0
    last_cluster: int = 0

    # Create merged output
    for dt in ddata:
        # How many elements to copy for the current data item?
        tocopy: int = len(dt[fields[0]])

        # Cycle through each field and its information
        for field in fields_info:
            # Copy elements
            if field == clusters_field:
                # If this is a clusters field, update the cluster IDs
                old_clusters = np.unique(dt[clusters_field])
                new_clusters = list(
                    range(last_cluster + 1, last_cluster + len(old_clusters) + 1)
                )
                old2new = zip(old_clusters, new_clusters)
                mapping = dict(old2new)
                last_cluster = new_clusters[-1]

                output[field][copied : (copied + tocopy)] = [
                    mapping[val] for val in dt[clusters_field]
                ]

            else:
                # Otherwise just copy the elements
                ncol: int = fields_info[field].ncol
                output[field].flat[copied * ncol : (copied + tocopy) * ncol] = dt[field]

        # Update how many were copied so far
        copied += tocopy

    # Return result
    return output

clupoints_n ¶

clupoints_n(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray

Generate points from their \(n\)-D projections on a cluster-supporting line.

Each point is placed around its projection using the normal distribution ( \(\mu=0\), \(σ=\)lat_disp).

This function's main intended use is by the clugen() function, generating the final points when the point_dist_fn parameter is set to "n".

Examples:

>>> from pyclugen import clupoints_n, points_on_line
>>> from numpy import array, linspace
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> projs = points_on_line(array([5,5]),     # Get 5 point projections
...                        array([1,0]),     # on a 2D line
...                        linspace(-4,4,5))
>>> projs
array([[1., 5.],
       [3., 5.],
       [5., 5.],
       [7., 5.],
       [9., 5.]])
>>> clupoints_n(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
array([[0.50543932, 4.81610667],
       [3.64396263, 5.09698721],
       [5.46011545, 5.2885519 ],
       [6.68176818, 5.27097611],
       [8.84170227, 4.83880544]])

Parameters:

Name	Type	Description	Default
`projs`	`NDArray`	Point projections on the cluster-supporting line ( \(p \times n\) matrix).	required
`lat_disp`	`float`	Standard deviation for the normal distribution, i.e., cluster lateral dispersion.	required
`line_len`	`float`	Length of cluster-supporting line (ignored).	required
`clu_dir`	`NDArray`	Direction of the cluster-supporting line.	required
`clu_ctr`	`NDArray`	Center position of the cluster-supporting line (ignored).	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	Generated points ( \(p \times n\) matrix).

Source code in pyclugen/module.py

def clupoints_n(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Generate points from their $n$-D projections on a cluster-supporting line.

    Each point is placed around its projection using the normal distribution
    ( $\mu=0$, $σ=$`lat_disp`).

    This function's main intended use is by the [`clugen()`][pyclugen.main.clugen]
    function, generating the final points when the `point_dist_fn` parameter is
    set to `"n"`.

    Examples:
        >>> from pyclugen import clupoints_n, points_on_line
        >>> from numpy import array, linspace
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> projs = points_on_line(array([5,5]),     # Get 5 point projections
        ...                        array([1,0]),     # on a 2D line
        ...                        linspace(-4,4,5))
        >>> projs
        array([[1., 5.],
               [3., 5.],
               [5., 5.],
               [7., 5.],
               [9., 5.]])
        >>> clupoints_n(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
        array([[0.50543932, 4.81610667],
               [3.64396263, 5.09698721],
               [5.46011545, 5.2885519 ],
               [6.68176818, 5.27097611],
               [8.84170227, 4.83880544]])

    Args:
      projs: Point projections on the cluster-supporting line ( $p \times n$ matrix).
      lat_disp: Standard deviation for the normal distribution, i.e., cluster
        lateral dispersion.
      line_len: Length of cluster-supporting line (ignored).
      clu_dir: Direction of the cluster-supporting line.
      clu_ctr: Center position of the cluster-supporting line (ignored).
      rng: Optional pseudo-random number generator.

    Returns:
      Generated points ( $p \times n$ matrix).
    """
    # Number of dimensions
    num_dims = clu_dir.size

    # Number of points in this cluster
    clu_num_points = projs.shape[0]

    # Get random displacement vectors for each point projection
    displ = lat_disp * rng.normal(size=(clu_num_points, num_dims))

    # Add displacement vectors to each point projection
    points = projs + displ

    return points

clupoints_n_1 ¶

clupoints_n_1(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray

Generate points from their \(n\)-D projections on a cluster-supporting line.

Each point is placed on a hyperplane orthogonal to that line and centered at the point's projection, using the normal distribution ( \(\mu=0\), \(σ=\)lat_disp).

This function's main intended use is by the clugen() function, generating the final points when the point_dist_fn parameter is set to "n-1".

Examples:

>>> from pyclugen import clupoints_n_1, points_on_line
>>> from numpy import array, linspace
>>> from numpy.random import Generator, PCG64
>>> prng = Generator(PCG64(123))
>>> projs = points_on_line(array([5,5]),     # Get 5 point projections
...                        array([1,0]),     # on a 2D line
...                        linspace(-4,4,5))
>>> projs
array([[1., 5.],
       [3., 5.],
       [5., 5.],
       [7., 5.],
       [9., 5.]])
>>> clupoints_n_1(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
array([[1.        , 5.49456068],
       [3.        , 5.18389333],
       [5.        , 5.64396263],
       [7.        , 5.09698721],
       [9.        , 5.46011545]])

Parameters:

Name	Type	Description	Default
`projs`	`NDArray`	Point projections on the cluster-supporting line ( \(p \times n\) matrix).	required
`lat_disp`	`float`	Standard deviation for the normal distribution, i.e., cluster lateral dispersion.	required
`line_len`	`float`	Length of cluster-supporting line (ignored).	required
`clu_dir`	`NDArray`	Direction of the cluster-supporting line.	required
`clu_ctr`	`NDArray`	Center position of the cluster-supporting line (ignored).	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	Generated points ( \(p \times n\) matrix).

Source code in pyclugen/module.py

def clupoints_n_1(
    projs: NDArray,
    lat_disp: float,
    line_len: float,
    clu_dir: NDArray,
    clu_ctr: NDArray,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Generate points from their $n$-D projections on a cluster-supporting line.

    Each point is placed on a hyperplane orthogonal to that line and centered at
    the point's projection, using the normal distribution ( $\mu=0$,
    $σ=$`lat_disp`).

    This function's main intended use is by the [`clugen()`][pyclugen.main.clugen]
    function, generating the final points when the `point_dist_fn` parameter is
    set to `"n-1"`.

    Examples:
        >>> from pyclugen import clupoints_n_1, points_on_line
        >>> from numpy import array, linspace
        >>> from numpy.random import Generator, PCG64
        >>> prng = Generator(PCG64(123))
        >>> projs = points_on_line(array([5,5]),     # Get 5 point projections
        ...                        array([1,0]),     # on a 2D line
        ...                        linspace(-4,4,5))
        >>> projs
        array([[1., 5.],
               [3., 5.],
               [5., 5.],
               [7., 5.],
               [9., 5.]])
        >>> clupoints_n_1(projs, 0.5, 1.0, array([1,0]), array([0,0]), rng=prng)
        array([[1.        , 5.49456068],
               [3.        , 5.18389333],
               [5.        , 5.64396263],
               [7.        , 5.09698721],
               [9.        , 5.46011545]])

    Args:
      projs: Point projections on the cluster-supporting line ( $p \times n$ matrix).
      lat_disp: Standard deviation for the normal distribution, i.e., cluster
        lateral dispersion.
      line_len: Length of cluster-supporting line (ignored).
      clu_dir: Direction of the cluster-supporting line.
      clu_ctr: Center position of the cluster-supporting line (ignored).
      rng: Optional pseudo-random number generator.

    Returns:
      Generated points ( $p \times n$ matrix).
    """
    # No blank line allowed here

    # Define function to get distances from points to their projections on the
    # line (i.e., using the normal distribution)
    def dist_fn(clu_num_points, ldisp, rg):
        return ldisp * rg.normal(size=clu_num_points)

    # Use clupoints_n_1_template() to do the heavy lifting
    return clupoints_n_1_template(projs, lat_disp, clu_dir, dist_fn, rng=rng)

clupoints_n_1_template ¶

clupoints_n_1_template(
    projs: NDArray,
    lat_disp: float,
    clu_dir: NDArray,
    dist_fn: Callable[[int, float, Generator], NDArray],
    rng: Generator = _default_rng,
) -> NDArray

Create \(p\) points from their \(n\)-D projections on a cluster-supporting line.

Each point is placed on a hyperplane orthogonal to that line and centered at the point's projection. The function specified in dist_fn is used to perform the actual placement.

This function is used internally by clupoints_n_1() and may be useful for constructing user-defined final point placement strategies for the point_dist_fn parameter of the main clugen() function.

Examples:

>>> from numpy import array, zeros
>>> from numpy.random import Generator, PCG64
>>> from pyclugen import clupoints_n_1_template, points_on_line
>>> ctr = zeros(2)
>>> dir = array([1, 0])
>>> pdist = array([-0.5, -0.2, 0.1, 0.3])
>>> rng = Generator(PCG64(123))
>>> proj = points_on_line(ctr, dir, pdist)
>>> clupoints_n_1_template(proj, 0, dir, lambda p, l, r: r.random(p), rng=rng)
array([[-0.5       ,  0.68235186],
       [-0.2       , -0.05382102],
       [ 0.1       ,  0.22035987],
       [ 0.3       , -0.18437181]])

Parameters:

Name	Type	Description	Default
`projs`	`NDArray`	Point projections on the cluster-supporting line ( \(p \times n\) matrix).	required
`lat_disp`	`float`	Dispersion of points from their projection.	required
`clu_dir`	`NDArray`	Direction of the cluster-supporting line (unit vector).	required
`dist_fn`	`Callable[[int, float, Generator], NDArray]`	Function to place points on a second line, orthogonal to the first. The functions accepts as parameters the number of points in the current cluster, the `lateral_disp` parameter (the same passed to the `clugen()` function), and a random number generator, returning a vector containing the distance of each point to its projection on the cluster-supporting line.	required
`rng`	`Generator`	An optional pseudo-random number generator for reproducible executions.	`_default_rng`

Returns:

Type	Description
`NDArray`	Generated points ( \(p \times n\) matrix).

Source code in pyclugen/helper.py

def clupoints_n_1_template(
    projs: NDArray,
    lat_disp: float,
    clu_dir: NDArray,
    dist_fn: Callable[[int, float, Generator], NDArray],
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Create $p$ points from their $n$-D projections on a cluster-supporting line.

    Each point is placed on a hyperplane orthogonal to that line and centered at
    the point's projection. The function specified in `dist_fn` is used to perform
    the actual placement.

    This function is used internally by
    [`clupoints_n_1()`][pyclugen.module.clupoints_n_1] and may be useful for
    constructing user-defined final point placement strategies for the `point_dist_fn`
    parameter of the main [`clugen()`][pyclugen.main.clugen] function.

    Examples:
        >>> from numpy import array, zeros
        >>> from numpy.random import Generator, PCG64
        >>> from pyclugen import clupoints_n_1_template, points_on_line
        >>> ctr = zeros(2)
        >>> dir = array([1, 0])
        >>> pdist = array([-0.5, -0.2, 0.1, 0.3])
        >>> rng = Generator(PCG64(123))
        >>> proj = points_on_line(ctr, dir, pdist)
        >>> clupoints_n_1_template(proj, 0, dir, lambda p, l, r: r.random(p), rng=rng)
        array([[-0.5       ,  0.68235186],
               [-0.2       , -0.05382102],
               [ 0.1       ,  0.22035987],
               [ 0.3       , -0.18437181]])

    Args:
      projs: Point projections on the cluster-supporting line ( $p \times n$ matrix).
      lat_disp: Dispersion of points from their projection.
      clu_dir: Direction of the cluster-supporting line (unit vector).
      dist_fn: Function to place points on a second line, orthogonal to the first.
        The functions accepts as parameters the number of points in the current
        cluster, the `lateral_disp` parameter (the same passed to the
        [`clugen()`][pyclugen.main.clugen] function), and a random number generator,
        returning a vector containing the distance of each point to its projection
        on the cluster-supporting line.
      rng: An optional pseudo-random number generator for reproducible executions.

    Returns:
      Generated points ( $p \times n$ matrix).
    """
    # Number of dimensions
    num_dims = clu_dir.size

    # Number of points in this cluster
    clu_num_points = projs.shape[0]

    # Get distances from points to their projections on the line
    points_dist = dist_fn(clu_num_points, lat_disp, rng)

    # Get normalized vectors, orthogonal to the current line, for each point
    orth_vecs = np.zeros((clu_num_points, num_dims))

    for j in range(clu_num_points):
        orth_vecs[j, :] = rand_ortho_vector(clu_dir, rng=rng).ravel()

    # Set vector magnitudes
    orth_vecs = np.abs(points_dist).reshape(-1, 1) * orth_vecs

    # Add perpendicular vectors to point projections on the line,
    # yielding final cluster points
    points = projs + orth_vecs

    return points

clusizes ¶

clusizes(
    num_clusters: int,
    num_points: int,
    allow_empty: bool,
    rng: Generator = _default_rng,
) -> NDArray

Determine cluster sizes, i.e., the number of points in each cluster.

Cluster sizes are determined using the normal distribution ( \(\mu=\)num_points \(/\)num_clusters, \(\sigma=\mu/3\)), and then assuring that the final cluster sizes add up to num_points via the fix_num_points() function.

Examples:

>>> from numpy.random import Generator, PCG64
>>> from pyclugen import clusizes
>>> prng = Generator(PCG64(123))
>>> sizes = clusizes(4, 1000, True, rng=prng)
>>> sizes
array([166, 217, 354, 263])
>>> int(sum(sizes))
1000

Parameters:

Name	Type	Description	Default
`num_clusters`	`int`	Number of clusters.	required
`num_points`	`int`	Total number of points.	required
`allow_empty`	`bool`	Allow empty clusters?	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	Number of points in each cluster (vector of size `num_clusters`).

Source code in pyclugen/module.py

def clusizes(
    num_clusters: int,
    num_points: int,
    allow_empty: bool,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Determine cluster sizes, i.e., the number of points in each cluster.

    Cluster sizes are determined using the normal distribution (
    $\mu=$`num_points` $/$`num_clusters`, $\sigma=\mu/3$), and then
    assuring that the final cluster sizes add up to `num_points` via the
    [`fix_num_points()`][pyclugen.helper.fix_num_points] function.

    Examples:
        >>> from numpy.random import Generator, PCG64
        >>> from pyclugen import clusizes
        >>> prng = Generator(PCG64(123))
        >>> sizes = clusizes(4, 1000, True, rng=prng)
        >>> sizes
        array([166, 217, 354, 263])
        >>> int(sum(sizes))
        1000

    Args:
      num_clusters: Number of clusters.
      num_points: Total number of points.
      allow_empty: Allow empty clusters?
      rng: Optional pseudo-random number generator.

    Returns:
      Number of points in each cluster (vector of size `num_clusters`).
    """
    # Determine number of points in each cluster using the normal distribution

    # Consider the mean an equal division of points between clusters
    mean = num_points / num_clusters
    # The standard deviation is such that the interval [0, 2 * mean] will contain
    # ≈99.7% of cluster sizes
    std = mean / 3

    # Determine points with the normal distribution
    clu_num_points = std * rng.normal(size=num_clusters) + mean

    # Set negative values to zero
    clu_num_points = np.where(clu_num_points > 0, clu_num_points, 0)

    # Fix imbalances, so that num_points is respected
    if np.sum(clu_num_points) > 0:  # Be careful not to divide by zero
        clu_num_points *= num_points / np.sum(clu_num_points)

    # Round the real values to integers since a cluster sizes is represented by
    # an integer
    clu_num_points = np.rint(clu_num_points).astype(int)

    # Make sure total points is respected, which may not be the case at this time due
    # to rounding
    fix_num_points(clu_num_points, num_points)

    # If empty clusters are not allowed, make sure there aren't any
    if not allow_empty:
        fix_empty(clu_num_points)

    return clu_num_points

fix_empty ¶

fix_empty(clu_num_points: NDArray, allow_empty: bool = False) -> NDArray

Certifies that, given enough points, no clusters are left empty.

This is done by removing a point from the largest cluster and adding it to an empty cluster while there are empty clusters. If the total number of points is smaller than the number of clusters (or if the allow_empty parameter is set to true), this function does nothing.

This function is used internally by clusizes() and might be useful for custom cluster sizing implementations given as the clusizes_fn parameter of the main clugen() function.

Note that the array is changed in-place.

Examples:

>>> from numpy import array
>>> from pyclugen import fix_empty
>>> clusters = array([3, 4, 5, 0, 0])
>>> fix_empty(clusters)
array([3, 3, 4, 1, 1])
>>> clusters # Verify that the array was changed in-place
array([3, 3, 4, 1, 1])

Parameters:

Name	Type	Description	Default
`clu_num_points`	`NDArray`	Number of points in each cluster (vector of size \(c\)), where \(c\) is the number of clusters.	required
`allow_empty`	`bool`	Allow empty clusters?	`False`

Returns:

Type	Description
`NDArray`	Number of points in each cluster, after being fixed by this function (vector of size \(c\), which is the same reference than `clu_num_points`).

Source code in pyclugen/helper.py

def fix_empty(clu_num_points: NDArray, allow_empty: bool = False) -> NDArray:
    r"""Certifies that, given enough points, no clusters are left empty.

    This is done by removing a point from the largest cluster and adding it to an
    empty cluster while there are empty clusters. If the total number of points is
    smaller than the number of clusters (or if the `allow_empty` parameter is set
    to `true`), this function does nothing.

    This function is used internally by [`clusizes()`][pyclugen.module.clusizes]
    and might be useful for custom cluster sizing implementations given as the
    `clusizes_fn` parameter of the main [`clugen()`][pyclugen.main.clugen] function.

    Note that the array is changed in-place.

    Examples:
        >>> from numpy import array
        >>> from pyclugen import fix_empty
        >>> clusters = array([3, 4, 5, 0, 0])
        >>> fix_empty(clusters)
        array([3, 3, 4, 1, 1])
        >>> clusters # Verify that the array was changed in-place
        array([3, 3, 4, 1, 1])

    Args:
      clu_num_points: Number of points in each cluster (vector of size $c$),
        where $c$ is the number of clusters.
      allow_empty: Allow empty clusters?

    Returns:
      Number of points in each cluster, after being fixed by this function (vector
        of size $c$, which is the same reference than `clu_num_points`).
    """
    # If the allow_empty parameter is set to true, don't do anything and return
    # immediately; this is useful for quick `clusizes_fn` one-liners
    if not allow_empty:
        # Find empty clusters
        empty_clusts = [idx for idx, val in enumerate(clu_num_points) if val == 0]

        # If there are empty clusters and enough points for all clusters...
        if len(empty_clusts) > 0 and np.sum(clu_num_points) >= clu_num_points.size:
            # Go through the empty clusters...
            for i0 in empty_clusts:
                # ...get a point from the largest cluster and assign it to the
                # current empty cluster
                imax = np.argmax(clu_num_points)
                clu_num_points[imax] -= 1
                clu_num_points[i0] += 1

    return clu_num_points

fix_num_points ¶

fix_num_points(clu_num_points: NDArray, num_points: int) -> NDArray

Certifies that the values in the clu_num_points array add up to num_points.

If this is not the case, the clu_num_points array is modified in-place, incrementing the value corresponding to the smallest cluster while sum(clu_num_points) < num_points, or decrementing the value corresponding to the largest cluster while sum(clu_num_points) > num_points.

This function is used internally by clusizes() and might be useful for custom cluster sizing implementations given as the clusizes_fn parameter of the main clugen() function.

Examples:

>>> from numpy import array
>>> from pyclugen import fix_num_points
>>> clusters = array([1, 6, 3])  # 10 total points
>>> fix_num_points(clusters, 12) # But we want 12 total points
array([3, 6, 3])
>>> clusters # Verify that the array was changed in-place
array([3, 6, 3])

Parameters:

Name	Type	Description	Default
`clu_num_points`	`NDArray`	Number of points in each cluster (vector of size \(c\)), where \(c\) is the number of clusters.	required
`num_points`	`int`	The expected total number of points.	required

Returns:

Type	Description
`NDArray`	Number of points in each cluster, after being fixed by this function (vector of size \(c\), which is the same reference than `clu_num_points`).

Source code in pyclugen/helper.py

def fix_num_points(clu_num_points: NDArray, num_points: int) -> NDArray:
    r"""Certifies that the values in the `clu_num_points` array add up to `num_points`.

    If this is not the case, the `clu_num_points` array is modified in-place,
    incrementing the value corresponding to the smallest cluster while
    `sum(clu_num_points) < num_points`, or decrementing the value corresponding to
    the largest cluster while `sum(clu_num_points) > num_points`.

    This function is used internally by [`clusizes()`][pyclugen.module.clusizes]
    and might be useful for custom cluster sizing implementations given as the
    `clusizes_fn` parameter of the main [`clugen()`][pyclugen.main.clugen] function.

    Examples:
        >>> from numpy import array
        >>> from pyclugen import fix_num_points
        >>> clusters = array([1, 6, 3])  # 10 total points
        >>> fix_num_points(clusters, 12) # But we want 12 total points
        array([3, 6, 3])
        >>> clusters # Verify that the array was changed in-place
        array([3, 6, 3])

    Args:
      clu_num_points: Number of points in each cluster (vector of size $c$),
        where $c$ is the number of clusters.
      num_points: The expected total number of points.

    Returns:
      Number of points in each cluster, after being fixed by this function (vector
        of size $c$, which is the same reference than `clu_num_points`).
    """
    while np.sum(clu_num_points) < num_points:
        imin = np.argmin(clu_num_points)
        clu_num_points[imin] += 1
    while np.sum(clu_num_points) > num_points:
        imax = np.argmax(clu_num_points)
        clu_num_points[imax] -= 1

    return clu_num_points

llengths ¶

llengths(
    num_clusters: int,
    llength: float,
    llength_disp: float,
    rng: Generator = _default_rng,
) -> NDArray

Determine length of cluster-supporting lines.

Line lengths are determined using the folded normal distribution ( \(\mu=\)llength, \(\sigma=\)llength_disp).

Examples:

>>> from numpy.random import Generator, MT19937
>>> from pyclugen import llengths
>>> prng = Generator(MT19937(123))
>>> llengths(4, 20, 3.5, rng=prng)
array([19.50968733, 19.92482858, 25.99013804, 18.58029672])

Parameters:

Name	Type	Description	Default
`num_clusters`	`int`	Number of clusters.	required
`llength`	`float`	Average line length.	required
`llength_disp`	`float`	Line length dispersion.	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	Lengths of cluster-supporting lines (vector of size `num_clusters`).

Source code in pyclugen/module.py

def llengths(
    num_clusters: int,
    llength: float,
    llength_disp: float,
    rng: Generator = _default_rng,
) -> NDArray:
    r"""Determine length of cluster-supporting lines.

    Line lengths are determined using the folded normal distribution (
    $\mu=$`llength`, $\sigma=$`llength_disp`).

    Examples:
        >>> from numpy.random import Generator, MT19937
        >>> from pyclugen import llengths
        >>> prng = Generator(MT19937(123))
        >>> llengths(4, 20, 3.5, rng=prng)
        array([19.50968733, 19.92482858, 25.99013804, 18.58029672])

    Args:
      num_clusters: Number of clusters.
      llength: Average line length.
      llength_disp: Line length dispersion.
      rng: Optional pseudo-random number generator.

    Returns:
      Lengths of cluster-supporting lines (vector of size `num_clusters`).
    """
    return np.abs(llength + llength_disp * rng.normal(size=num_clusters))

points_on_line ¶

points_on_line(
    center: NDArray, direction: NDArray, dist_center: NDArray
) -> NDArray

Determine coordinates of points on a line.

Determine coordinates of points on a line with center and direction, based on the distances from the center given in dist_center.

This works by using the vector formulation of the line equation assuming direction is a \(n\)-dimensional unit vector. In other words, considering \(\mathbf{d}=\)direction.reshape(-1,1) ( \(n \times 1\) vector), \(\mathbf{c}=\)center.reshape(-1,1) ( \(n \times 1\) vector), and \(\mathbf{w}=\) dist_center.reshape(-1,1) ( \(p \times 1\) vector), the coordinates of points on the line are given by:

\[ \mathbf{P}=\mathbf{1}\,\mathbf{c}^T + \mathbf{w}\mathbf{d}^T \]

where \(\mathbf{P}\) is the \(p \times n\) matrix of point coordinates on the line, and \(\mathbf{1}\) is a \(p \times 1\) vector with all entries equal to 1.

Examples:

>>> from pyclugen import points_on_line
>>> from numpy import array, linspace
>>> points_on_line(array([5., 5.]),
...                array([1., 0.]),
...                linspace(-4, 4, 5)) # 2D, 5 points
array([[1., 5.],
       [3., 5.],
       [5., 5.],
       [7., 5.],
       [9., 5.]])
>>> points_on_line(array([-2, 0, 0., 2]),
...                array([0., 0, -1, 0]),
...                array([10, -10])) # 4D, 2 points
array([[ -2.,   0., -10.,   2.],
       [ -2.,   0.,  10.,   2.]])

Parameters:

Name	Type	Description	Default
`center`	`NDArray`	Center of the line ( \(n\)-component vector).	required
`direction`	`NDArray`	Line direction ( \(n\)-component unit vector).	required
`dist_center`	`NDArray`	Distance of each point to the center of the line ( \(p\)-component vector, where \(p\) is the number of points).	required

Returns:

Type	Description
`NDArray`	Coordinates of points on the specified line ( \(p \times n\) matrix).

Source code in pyclugen/core.py

def points_on_line(
    center: NDArray, direction: NDArray, dist_center: NDArray
) -> NDArray:
    r"""Determine coordinates of points on a line.

    Determine coordinates of points on a line with `center` and `direction`,
    based on the distances from the center given in `dist_center`.

    This works by using the vector formulation of the line equation assuming
    `direction` is a $n$-dimensional unit vector. In other words, considering
    $\mathbf{d}=$`direction.reshape(-1,1)` ( $n \times 1$ vector),
    $\mathbf{c}=$`center.reshape(-1,1)` ( $n \times 1$ vector), and
    $\mathbf{w}=$ `dist_center.reshape(-1,1)` ( $p \times 1$ vector),
    the coordinates of points on the line are given by:

    $$
    \mathbf{P}=\mathbf{1}\,\mathbf{c}^T + \mathbf{w}\mathbf{d}^T
    $$

    where $\mathbf{P}$ is the $p \times n$ matrix of point coordinates on the
    line, and $\mathbf{1}$ is a $p \times 1$ vector with all entries equal to 1.

    Examples:
        >>> from pyclugen import points_on_line
        >>> from numpy import array, linspace
        >>> points_on_line(array([5., 5.]),
        ...                array([1., 0.]),
        ...                linspace(-4, 4, 5)) # 2D, 5 points
        array([[1., 5.],
               [3., 5.],
               [5., 5.],
               [7., 5.],
               [9., 5.]])
        >>> points_on_line(array([-2, 0, 0., 2]),
        ...                array([0., 0, -1, 0]),
        ...                array([10, -10])) # 4D, 2 points
        array([[ -2.,   0., -10.,   2.],
               [ -2.,   0.,  10.,   2.]])

    Args:
      center: Center of the line ( $n$-component vector).
      direction: Line direction ( $n$-component unit vector).
      dist_center: Distance of each point to the center of the line
        ( $p$-component vector, where $p$ is the number of points).

    Returns:
      Coordinates of points on the specified line ( $p \times n$ matrix).
    """
    return center.reshape(1, -1) + dist_center.reshape(-1, 1) @ direction.reshape(
        (1, -1)
    )

rand_ortho_vector ¶

rand_ortho_vector(u: NDArray, rng: Generator = _default_rng) -> NDArray

Get a random unit vector orthogonal to u.

Note that u is expected to be a unit vector itself.

Examples:

>>> from pyclugen import rand_ortho_vector
>>> from numpy import isclose, dot
>>> from numpy.linalg import norm
>>> from numpy.random import Generator, PCG64
>>> rng = Generator(PCG64(123))
>>> r = rng.random(3) # Get a random vector with 3 components (3D)
>>> r = r / norm(r) # Normalize it
>>> r_ort = rand_ortho_vector(r, rng=rng) # Get random unit vector orth. to r
>>> r_ort
array([-0.1982903 , -0.61401512,  0.76398062])
>>> bool(isclose(dot(r, r_ort), 0)) # Check that vectors are orthogonal
True

Parameters:

Name	Type	Description	Default
`u`	`NDArray`	Unit vector with \(n\) components.	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	A random unit vector with \(n\) components orthogonal to `u`.

Source code in pyclugen/core.py

def rand_ortho_vector(u: NDArray, rng: Generator = _default_rng) -> NDArray:
    r"""Get a random unit vector orthogonal to `u`.

    Note that `u` is expected to be a unit vector itself.

    Examples:
        >>> from pyclugen import rand_ortho_vector
        >>> from numpy import isclose, dot
        >>> from numpy.linalg import norm
        >>> from numpy.random import Generator, PCG64
        >>> rng = Generator(PCG64(123))
        >>> r = rng.random(3) # Get a random vector with 3 components (3D)
        >>> r = r / norm(r) # Normalize it
        >>> r_ort = rand_ortho_vector(r, rng=rng) # Get random unit vector orth. to r
        >>> r_ort
        array([-0.1982903 , -0.61401512,  0.76398062])
        >>> bool(isclose(dot(r, r_ort), 0)) # Check that vectors are orthogonal
        True

    Args:
      u: Unit vector with $n$ components.
      rng: Optional pseudo-random number generator.

    Returns:
      A random unit vector with $n$ components orthogonal to `u`.
    """
    # If 1D, just return a random unit vector
    if u.size == 1:
        return rand_unit_vector(1, rng=rng)

    # Find a random, non-parallel vector to u
    while True:
        # Find normalized random vector
        r = rand_unit_vector(u.size, rng=rng)

        # If not parallel to u we can keep it and break the loop
        if not np.isclose(np.abs(np.dot(u, r)), 1):
            break

    # Get vector orthogonal to u using 1st iteration of Gram-Schmidt process
    v = r - np.dot(u, r) / np.dot(u, u) * u

    # Normalize it
    v = v / norm(v)

    # And return it
    return v

rand_unit_vector ¶

rand_unit_vector(num_dims: int, rng: Generator = _default_rng) -> NDArray

Get a random unit vector with num_dims components.

Examples:

>>> from pyclugen import rand_unit_vector
>>> rand_unit_vector(4)
array([ 0.48653889,  0.50753862,  0.05711487, -0.70881757])

>>> from pyclugen import rand_unit_vector
>>> from numpy.random import Generator, PCG64
>>> rng = Generator(PCG64(123))
>>> rand_unit_vector(2, rng=rng) # Reproducible
array([ 0.3783202 , -0.92567479])

Parameters:

Name	Type	Description	Default
`num_dims`	`int`	Number of components in vector (i.e. vector size).	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	A random unit vector with `num_dims` components.

Source code in pyclugen/core.py

def rand_unit_vector(num_dims: int, rng: Generator = _default_rng) -> NDArray:
    r"""Get a random unit vector with `num_dims` components.

    Examples:
        >>> from pyclugen import rand_unit_vector
        >>> rand_unit_vector(4) # doctest: +SKIP
        array([ 0.48653889,  0.50753862,  0.05711487, -0.70881757])

        >>> from pyclugen import rand_unit_vector
        >>> from numpy.random import Generator, PCG64
        >>> rng = Generator(PCG64(123))
        >>> rand_unit_vector(2, rng=rng) # Reproducible
        array([ 0.3783202 , -0.92567479])

    Args:
      num_dims: Number of components in vector (i.e. vector size).
      rng: Optional pseudo-random number generator.

    Returns:
      A random unit vector with `num_dims` components.
    """
    r = rng.random(num_dims) - 0.5
    r = r / norm(r)
    return r

rand_vector_at_angle ¶

rand_vector_at_angle(
    u: NDArray, angle: float, rng: Generator = _default_rng
) -> NDArray

Get a random unit vector which is at angle radians of vector u.

Note that u is expected to be a unit vector itself.

Examples:

>>> from pyclugen import rand_vector_at_angle
>>> from numpy import arccos, array, degrees, pi, dot
>>> from numpy.linalg import norm
>>> from numpy.random import Generator, PCG64
>>> rng = Generator(PCG64(123))
>>> u = array([ 1.0, 0, 0.5, -0.5 ]) # Define a 4D vector
>>> u = u / norm(u) # Normalize the vector
>>> v = rand_vector_at_angle(u, pi/4, rng=rng) # Get a vector at 45 degrees
>>> v
array([ 0.633066  , -0.50953554, -0.10693823, -0.57285705])
>>> float(degrees(arccos(dot(u, v) / norm(u) * norm(v)))) # u-v angle
45.0

Parameters:

Name	Type	Description	Default
`u`	`NDArray`	Unit vector with \(n\) components.	required
`angle`	`float`	Angle in radians.	required
`rng`	`Generator`	Optional pseudo-random number generator.	`_default_rng`

Returns:

Type	Description
`NDArray`	Random unit vector with \(n\) components which is at `angle` radians with vector `u`.

Source code in pyclugen/core.py

def rand_vector_at_angle(
    u: NDArray, angle: float, rng: Generator = _default_rng
) -> NDArray:
    r"""Get a random unit vector which is at `angle` radians of vector `u`.

    Note that `u` is expected to be a unit vector itself.

    Examples:
        >>> from pyclugen import rand_vector_at_angle
        >>> from numpy import arccos, array, degrees, pi, dot
        >>> from numpy.linalg import norm
        >>> from numpy.random import Generator, PCG64
        >>> rng = Generator(PCG64(123))
        >>> u = array([ 1.0, 0, 0.5, -0.5 ]) # Define a 4D vector
        >>> u = u / norm(u) # Normalize the vector
        >>> v = rand_vector_at_angle(u, pi/4, rng=rng) # Get a vector at 45 degrees
        >>> v
        array([ 0.633066  , -0.50953554, -0.10693823, -0.57285705])
        >>> float(degrees(arccos(dot(u, v) / norm(u) * norm(v)))) # u-v angle
        45.0

    Args:
      u: Unit vector with $n$ components.
      angle: Angle in radians.
      rng: Optional pseudo-random number generator.

    Returns:
      Random unit vector with $n$ components which is at `angle` radians
        with vector `u`.
    """
    if np.isclose(abs(angle), np.pi / 2) and u.size > 1:
        return rand_ortho_vector(u, rng=rng)
    elif -np.pi / 2 < angle < np.pi / 2 and u.size > 1:
        v = u + rand_ortho_vector(u, rng=rng) * tan(angle)
        return v / norm(v)
    else:
        # For |θ| > π/2 or the 1D case, simply return a random vector
        return rand_unit_vector(u.size, rng=rng)

Reference

Clusters ¶

angles instance-attribute ¶

centers instance-attribute ¶

clusters instance-attribute ¶

directions instance-attribute ¶

lengths instance-attribute ¶

points instance-attribute ¶

projections instance-attribute ¶

sizes instance-attribute ¶

angle_btw ¶

angle_deltas ¶

clucenters ¶

clugen ¶

Examples:¶

clumerge ¶

clupoints_n ¶

clupoints_n_1 ¶

clupoints_n_1_template ¶

clusizes ¶

fix_empty ¶

fix_num_points ¶

llengths ¶

points_on_line ¶

rand_ortho_vector ¶

rand_unit_vector ¶

rand_vector_at_angle ¶

angles `instance-attribute` ¶

centers `instance-attribute` ¶

clusters `instance-attribute` ¶

directions `instance-attribute` ¶

lengths `instance-attribute` ¶

points `instance-attribute` ¶

projections `instance-attribute` ¶

sizes `instance-attribute` ¶