Theory
This section presents a general overview of the clugen algorithm. A complete description of the algorithm's theoretical framework is available in the article "Generating multidimensional clusters with support lines" (an open version is available on arXiv).
Clugen is an algorithm for generating multidimensional clusters. Each cluster is supported by a line segment, the position, orientation and length of which guide where the respective points are placed. For brevity, line segments will be referred to as lines.
Given an $n$-dimensional direction vector $\mathbf{d}$ (and a number of additional parameters, which will be discussed shortly), the clugen algorithm works as follows ($^*$ means the algorithm step is stochastic):
- Normalize $\mathbf{d}$.
- $^*$Determine cluster sizes.
- $^*$Determine cluster centers.
- $^*$Determine lengths of cluster-supporting lines.
- $^*$Determine angles between $\mathbf{d}$ and cluster-supporting lines.
- For each cluster:
- $^*$Determine direction of the cluster-supporting line.
- $^*$Determine distance of point projections from the center of the cluster-supporting line.
- Determine coordinates of point projections on the cluster-supporting line.
- $^*$Determine points from their projections on the cluster-supporting line.
Figure 1 provides a stylized overview of the algorithm's steps.
Figure 1 - Stylized overview of the clugen algorithm. Background tiles are 10 units wide and tall, when applicable.
The example in Figure 1 was generated with the following parameters, the exact meaning of each is described in the documentation for the clugen()
function, and further discussed in the article mentioned above:
Parameter values | Description |
---|---|
$n=2$ | Number of dimensions. |
$c=4$ | Number of clusters. |
$p=200$ | Total number of points. |
$\mathbf{d}=\begin{bmatrix}1 & 1\end{bmatrix}^T$ | Average direction. |
$\theta_\sigma=\pi/16\approx{}11.25^{\circ}$ | Angle dispersion. |
$\mathbf{s}=\begin{bmatrix}10 & 10\end{bmatrix}^T$ | Average cluster separation. |
$l=10$ | Average line length. |
$l_\sigma=1.5$ | Line length dispersion. |
$f_\sigma=1$ | Cluster lateral dispersion. |
Additionally, all optional parameters (not listed above) were left to their default values. These will also be discussed next. This example can be reproduced and plotted with the following instructions:
julia> using CluGen, Plots
julia> r = clugen(2, 4, 200, [1, 1], pi/16, [10, 10], 10, 1.5, 1; rng=1234);
julia> plot(r.points[:,1], r.points[:,2], seriestype=:scatter, group=r.clusters)