src.actions.separation module
Data Separation and Balancing Module
This module provides functions for separating, balancing, and organizing neural network training data based on various thresholds and criteria. It contains utilities for:
Window-based data separation into good, bad, and difficult categories
Dataset balancing and concatenation operations
Train/validation/test split functionality
Outlier detection and removal
MVNO and VMAF-specific separation methods
The module is designed to work with TensorFlow tensors and DataManager objects, providing comprehensive data preprocessing capabilities for neural network training.
Functions
- window_to_groupfunction
Separate windows into good, bad, and gray categories based on feature thresholds.
- window_to_group_mvnofunction
MVNO-specific window separation function.
- windowsSeparation_mvnofunction
Complete MVNO window separation with class and label assignment.
- windowsSeparation_vmaffunction
VMAF-based window separation with quality thresholds.
- windowsSeparationfunction
General window separation with multiple feature thresholds.
- flag_windowsfunction
Flag windows based on threshold criteria for difficult case identification.
- windowDropOutliersfunction
Remove outlier samples from datasets based on target thresholds.
- TrnValTstSeparationfunction
Split data into training, validation, and test sets.
- balancingConcatenateNGfunction
Balance and concatenate multiple datasets according to specified ratios.
- ConcatenateNGfunction
Concatenate multiple DataManager instances.
- BalanceSeparationNGfunction
Combined portioning and balancing for train/validation/test splits.
Notes
All functions in this module use the @action and @f_logger decorators for consistent logging and action tracking within the SNN2 framework.
- src.actions.separation.BalanceSeparationNG(data: ~typing.List[~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls], portion: ~typing.Tuple[~typing.Tuple[float, float, float], ...] = ((0.7, 0.2, 0.1), (0.7, 0.2, 0.1), (0.0, 0.0, 0.0)), balancing: ~typing.Tuple[~typing.Tuple[float, float, float], ...] = ((0.98, 0.02, 0.0), (0.5, 0.5, 0.0), (0.5, 0.5, 0.0)), test_get_everything: bool = False, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[Dict[str, Dict[str, Any]], ...]
Apply portioning and balancing to multiple datasets for train/validation/test splits.
This function performs comprehensive dataset preparation by first splitting each input dataset into training, validation, and test portions, then applying balancing ratios to create final mixed datasets with desired class distributions.
- Parameters:
data (List[DataManager]) – List of DataManager instances representing different dataset categories (e.g., good, bad, difficult samples).
portion (Tuple[Tuple[float, float, float], ...],) – default=((0.7, 0.2, 0.1), (0.7, 0.2, 0.1), (0.0, 0.0, 0.0)) Tuple of portion tuples specifying how to split each dataset. Each inner tuple contains (train_portion, validation_portion, test_portion) and must sum to <= 1.0. Length must match number of input datasets.
balancing (Tuple[Tuple[float, float, float], ...],) – default=((0.98, 0.02, 0.0), (0.5, 0.5, 0.0), (0.5, 0.5, 0.0)) Tuple of balancing ratios for train, validation, and test sets. Each inner tuple specifies the desired proportion of each dataset category. Contains exactly 3 tuples for (train_balance, val_balance, test_balance).
test_get_everything (bool, default=False) – If True, test set includes all unused data from training and validation. If False, test set uses balanced sampling like train and validation.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.
- Returns:
A tuple containing three DataManager instances: - training: Balanced training dataset - validation: Balanced validation dataset - test: Balanced test dataset (or everything if test_get_everything=True)
- Return type:
Tuple[Dict[str, Dict[str, Any]], …]
- Raises:
AssertionError – If input validation fails for data/portion/balancing length matching.
Notes
The function first applies TrnValTstSeparation to each input dataset according to the specified portions, then uses balancingConcatenateNG to create balanced final datasets according to the balancing ratios.
When test_get_everything=True, unused samples from training and validation balancing are added to the test set, providing comprehensive test coverage.
Examples
>>> train, val, test = BalanceSeparationNG( ... data=[good_data, bad_data, gray_data], ... portion=((0.8, 0.1, 0.1), (0.7, 0.15, 0.15), (0.6, 0.2, 0.2)), ... balancing=((0.7, 0.3, 0.0), (0.5, 0.5, 0.0), (0.4, 0.4, 0.2)), ... test_get_everything=True ... )
- src.actions.separation.ConcatenateNG(datasets: ~typing.List[~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls], label: str = 'MergedData', *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) augmented_cls
Concatenate multiple datasets into a single DataManager instance.
This function merges a list of DataManager objects into one consolidated dataset and logs the merged data with a specified label.
- Parameters:
datasets (List[DataManager]) – A list of DataManager objects to be concatenated/merged together.
label (str, optional) – Label to use when logging the merged dataset. Default is “MergedData”.
**kwargs – Arbitrary keyword arguments. Must contain: - logger: Logger instance for logging operations - write_msg: Message writing function for logging
- Returns:
A new DataManager instance containing the merged data from all input datasets.
- Return type:
Notes
The function uses DataManager.merge() to perform the actual concatenation and automatically logs the result using the provided label.
- src.actions.separation.TrnValTstSeparation(data: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls, training_portion: float = 0.7, validation_portion: float = 0.1, test_portion: float = 0.2, to_dataset: bool = True, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[Dict[str, Dict[str, Any]], ...]
Split dataset into training, validation, and test sets with specified proportions.
This function randomly splits a dataset into three subsets for machine learning model training, validation, and testing. It ensures no overlap between the sets and provides statistics about positive and negative samples.
- Parameters:
data (DataManager) – The input DataManager instance containing the complete dataset.
training_portion (float, default=0.7) – Fraction of data to use for training (0.0 to 1.0).
validation_portion (float, default=0.1) – Fraction of data to use for validation (0.0 to 1.0).
test_portion (float, default=0.2) – Fraction of data to use for testing (0.0 to 1.0).
to_dataset (bool, default=True) – Whether to convert the resulting data to TensorFlow datasets.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.
- Returns:
A tuple containing three DataManager instances: - training: Training dataset - validation: Validation dataset - test: Test dataset
- Return type:
Tuple[Dict[str, Dict[str, Any]], …]
- Raises:
AssertionError – If the sum of portions exceeds 1.0 or if there are overlapping indices.
Notes
The function preserves TensorFlow dataset objects if they exist in the original data and randomly shuffles indices to ensure unbiased splits. It also logs statistics about positive and negative target value ranges.
Examples
>>> train, val, test = TrnValTstSeparation(data, 0.8, 0.1, 0.1) >>> # Splits data into 80% training, 10% validation, 10% test
- src.actions.separation.balancingConcatenateNG(datasets: ~typing.List[~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls], balancing: ~typing.Tuple[float, ...], *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[Dict[str, Dict[str, Any]], ...]
Balance and concatenate multiple datasets according to specified ratios.
This function takes multiple datasets and balances them according to the provided balancing ratios, then concatenates them into merged datasets. It handles cases where some datasets might be empty and ensures proper proportional sampling.
- Parameters:
datasets (List[DataManager]) – List of DataManager objects containing the datasets to be balanced and concatenated.
balancing (Tuple[float, ...]) – Tuple of float values specifying the desired proportion for each dataset. Must have the same length as datasets.
**kwargs (dict) – Additional keyword arguments containing: - logger : Logger object for logging operations - write_msg : Function for writing messages/logs
- Returns:
A tuple containing two dictionaries: - merged_data: Concatenated data from the balanced sampling - merged_not_used_data: Concatenated data that wasn’t used in balancing
- Return type:
Tuple[Dict[str, Dict[str, Any]], …]
- Raises:
AssertionError – If the length of balancing tuple doesn’t match the number of datasets.
Notes
The function handles empty datasets by removing them from consideration
Uses the minimum dataset size relative to its balancing ratio to determine total elements
Randomly shuffles indices before sampling to ensure randomness
At least 1 element is required from each dataset even if balancing suggests 0
- src.actions.separation.flag_windows(wdw: Tensor, pdr_threshold: float | str | None = None, bdr_threshold: float | str | None = None, avg_ipt_threshold: float | str | None = None, std_ipt_threshold: float | str | None = None, skw_ipt_threshold: float | str | None = None, kur_ipt_threshold: float | str | None = None, log: LogHandler | Callable = None) Tensor
Flag windows as positive (1) or negative (0) based on multiple feature thresholds.
This function evaluates windows against multiple feature thresholds and returns binary flags indicating whether each window exceeds any of the specified thresholds. It’s primarily used for labeling difficult/gray windows in the separation process.
- Parameters:
wdw (tf.Tensor) – Input tensor containing window data with shape (n_windows, window_size, n_features).
pdr_threshold (Optional[Union[float, str]], optional) – Threshold for Packet Delivery Ratio feature.
bdr_threshold (Optional[Union[float, str]], optional) – Threshold for Bit Delivery Ratio feature.
avg_ipt_threshold (Optional[Union[float, str]], optional) – Threshold for average Inter-Packet Time feature.
std_ipt_threshold (Optional[Union[float, str]], optional) – Threshold for standard deviation of Inter-Packet Time feature.
skw_ipt_threshold (Optional[Union[float, str]], optional) – Threshold for skewness of Inter-Packet Time feature.
kur_ipt_threshold (Optional[Union[float, str]], optional) – Threshold for kurtosis of Inter-Packet Time feature.
log (Union[LH, Callable], optional) – Logger instance or callable for logging operations.
- Returns:
Binary tensor of shape (n_windows,) where 1 indicates the window exceeds at least one threshold and 0 indicates it doesn’t exceed any threshold.
- Return type:
tf.Tensor
- Raises:
Exception – If any of the required thresholds is None.
Notes
String thresholds are parsed using ast.literal_eval. The function checks each feature column against its corresponding threshold and flags windows that exceed any threshold value.
- src.actions.separation.windowDropOutliers(data: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls, threshold: float | str = 101.0, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Dict[str, Dict[str, Any]]
Remove outlier samples from a dataset based on target value thresholds.
This function filters out data samples where the target values exceed a specified threshold, effectively removing outliers from the dataset.
- Parameters:
data (DataManager) – The DataManager instance containing the dataset to be filtered.
threshold (Union[float, str], default=101.0) – The threshold value above which samples are considered outliers and removed. If string, it will be parsed using ast.literal_eval.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.
- Returns:
The filtered DataManager instance with outliers removed.
- Return type:
Dict[str, Dict[str, Any]]
Notes
The function modifies the input DataManager in-place by selecting only the indices of samples that meet the threshold criteria. The original data structure is preserved but with reduced sample count.
Examples
>>> filtered_data = windowDropOutliers(data, threshold=100.0) >>> # Removes all samples where target values >= 100.0
- src.actions.separation.window_to_group(wdw: ~tensorflow.python.framework.tensor.Tensor, feature_thresholds: ~typing.List[~typing.Tuple[float, float]], *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[ndarray, ndarray, ndarray]
Separate windows into good, bad, and gray categories based on feature thresholds.
This function analyzes window data and categorizes each window based on multiple feature thresholds. Windows are classified as good (below good threshold), bad (above bad threshold), or gray/difficult (between thresholds).
- Parameters:
wdw (tf.Tensor) – Input tensor containing window data with shape (n_windows, window_size, n_features).
feature_thresholds (List[Tuple[float, float]]) – List of threshold tuples for each feature, where each tuple contains (good_threshold, bad_threshold).
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from f_logger decorator.
- Returns:
A tuple containing three numpy arrays: - good_indexes_intersection: Indices of windows classified as good - bad_indexes_intersection: Indices of windows classified as bad - gray_indexes_intersection: Indices of windows classified as gray/difficult
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
Notes
For the first feature (column 0), the function applies sum reduction across the window dimension before threshold comparison. This is designed for disaggregated mode processing.
The function ensures that all windows are classified into exactly one category with no overlap between categories.
- src.actions.separation.window_to_group_mvno(wdw: ~tensorflow.python.framework.tensor.Tensor, feature_thresholds: ~typing.List[~typing.Tuple[float, float]], *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[ndarray, ndarray, ndarray]
MVNO-specific window separation into good, bad, and gray categories.
This function is a specialized version of window_to_group designed for MVNO (Mobile Virtual Network Operator) data processing. It handles both 1D and multi-dimensional window data.
- Parameters:
wdw (tf.Tensor) – Input tensor containing window data. Can be 1D or multi-dimensional.
feature_thresholds (List[Tuple[float, float]]) – List of threshold tuples for each feature, where each tuple contains (good_threshold, bad_threshold).
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from f_logger decorator.
- Returns:
A tuple containing three numpy arrays: - good_indexes_intersection: Indices of windows classified as good - bad_indexes_intersection: Indices of windows classified as bad - gray_indexes_intersection: Indices of windows classified as gray/difficult
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
Notes
This function automatically detects if the input is 1D and adjusts the column processing accordingly. It maintains the same separation logic as window_to_group but is optimized for MVNO-specific data structures.
- src.actions.separation.windowsSeparation(wdw: Tensor, trg: Tensor, pdr_threshold: Tuple[float, float] | str | None = None, bdr_threshold: Tuple[float, float] | str | None = None, avg_ipt_threshold: Tuple[float, float] | str | None = None, std_ipt_threshold: Tuple[float, float] | str | None = None, skw_ipt_threshold: Tuple[float, float] | str | None = None, kur_ipt_threshold: Tuple[float, float] | str | None = None, difficult_pdr_threshold: float | str | None = None, difficult_bdr_threshold: float | str | None = None, difficult_avg_ipt_threshold: float | str | None = None, difficult_std_ipt_threshold: float | str | None = None, difficult_skw_ipt_threshold: float | str | None = None, difficult_kur_ipt_threshold: float | str | None = None, logger: LogHandler | None = None) Tuple[List[Tensor], List[Tensor], List[Tensor], List[Tensor]]
Separate windows into three groups using multiple network feature thresholds.
This function performs comprehensive window separation based on multiple network performance features including packet delivery ratios, bit delivery ratios, and inter-packet time statistics. It categorizes windows as good, bad, or difficult based on feature-specific threshold ranges.
- Parameters:
wdw (tf.Tensor) – Input tensor containing window data with shape (n_windows, window_size, n_features). Features are expected in order: PDR, BDR, avg_IPT, std_IPT, skw_IPT, kur_IPT.
trg (tf.Tensor) – Target tensor containing ground truth values for each window.
pdr_threshold (Optional[Union[Tuple[float, float], str]], optional) – Packet Delivery Ratio thresholds as (good_threshold, bad_threshold).
bdr_threshold (Optional[Union[Tuple[float, float], str]], optional) – Bit Delivery Ratio thresholds as (good_threshold, bad_threshold).
avg_ipt_threshold (Optional[Union[Tuple[float, float], str]], optional) – Average Inter-Packet Time thresholds as (good_threshold, bad_threshold).
std_ipt_threshold (Optional[Union[Tuple[float, float], str]], optional) – Standard deviation of Inter-Packet Time thresholds as (good_threshold, bad_threshold).
skw_ipt_threshold (Optional[Union[Tuple[float, float], str]], optional) – Skewness of Inter-Packet Time thresholds as (good_threshold, bad_threshold).
kur_ipt_threshold (Optional[Union[Tuple[float, float], str]], optional) – Kurtosis of Inter-Packet Time thresholds as (good_threshold, bad_threshold).
difficult_pdr_threshold (Optional[Union[float, str]], optional) – PDR threshold for labeling difficult samples.
difficult_bdr_threshold (Optional[Union[float, str]], optional) – BDR threshold for labeling difficult samples.
difficult_avg_ipt_threshold (Optional[Union[float, str]], optional) – Average IPT threshold for labeling difficult samples.
difficult_std_ipt_threshold (Optional[Union[float, str]], optional) – Standard deviation IPT threshold for labeling difficult samples.
difficult_skw_ipt_threshold (Optional[Union[float, str]], optional) – Skewness IPT threshold for labeling difficult samples.
difficult_kur_ipt_threshold (Optional[Union[float, str]], optional) – Kurtosis IPT threshold for labeling difficult samples.
logger (Optional[LH], optional) – Logger instance for logging separation statistics and debug information.
- Returns:
A tuple containing four lists: - windows: [good_windows, gray_windows, bad_windows] - targets: [good_targets, gray_targets, bad_targets] - classes: [good_classes, gray_classes, bad_classes] (0, 2, 1 respectively) - expectations: [good_expectations, gray_expectations, bad_expectations]
- Return type:
Tuple[List[tf.Tensor], List[tf.Tensor], List[tf.Tensor], List[tf.Tensor]]
- Raises:
Exception – If any required threshold is None, or if window/target tensors are None.
Notes
The function uses feature_eval to parse string thresholds and ensures all required thresholds are provided. Difficult samples use flag_windows to determine their expected labels based on the difficult_* thresholds.
String thresholds are parsed using ast.literal_eval for flexibility.
Examples
>>> windows, targets, classes, expectations = windowsSeparation( ... wdw, trg, ... pdr_threshold=(0.8, 0.95), ... bdr_threshold=(0.7, 0.9), ... # ... other thresholds ... logger=my_logger ... )
- src.actions.separation.windowsSeparation_mvno(properties: ~typing.Dict[str, ~typing.Dict[str, ~tensorflow.python.framework.tensor.Tensor]] = {}, anomalous_threshold: ~typing.Dict[str, ~typing.Tuple[float, float]] | str | None = None, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[Dict[str, Dict[str, Tensor]], ...]
MVNO-specific window separation into positive, negative, and difficult groups.
This function separates MVNO (Mobile Virtual Network Operator) network data windows into three categories based on anomalous behavior thresholds. It’s specifically designed for network anomaly detection scenarios.
- Parameters:
properties (Dict[str, Dict[str, tf.Tensor]], default={}) – Dictionary containing window and target data organized by property names. Must contain ‘Windows’ and ‘Targets’ keys with TensorFlow tensors.
anomalous_threshold (Optional[Union[Dict[str, Tuple[float, float]], str]], optional) – Dictionary or string representation of anomalous behavior thresholds. Should contain an ‘anomalous’ key mapping to (good_threshold, bad_threshold).
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.
- Returns:
A tuple containing three property dictionaries: - goods_properties: Properties for normal/good samples (class 0, expected label 0) - bads_properties: Properties for anomalous/bad samples (class 1, expected label 1) - grays_properties: Properties for difficult/uncertain samples (class 2, mixed labels)
- Return type:
Tuple[Dict[str, Dict[str, tf.Tensor]], …]
- Raises:
Exception – If window or target tensors are None in the properties dictionary.
Notes
Expected labels for good and bad samples are fixed (0 and 1 respectively), while difficult samples get labels based on whether their target values are positive (1) or non-positive (0).
The function adds ‘Classes’ and ‘ExpectedLabel’ properties to each output group and logs detailed statistics about the separation results.
Examples
>>> goods, bads, grays = windowsSeparation_mvno( ... properties=network_data, ... anomalous_threshold={'anomalous': (0.1, 0.8)} ... )
- src.actions.separation.windowsSeparation_vmaf(properties: ~typing.Dict[str, ~typing.Dict[str, ~tensorflow.python.framework.tensor.Tensor]] = {}, vmaf_threshold: float | str | None = None, thresholds: ~typing.Dict[str, ~typing.Tuple[float, float]] | str | None = None, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) Tuple[Dict[str, Dict[str, Tensor]], ...]
Separate windows into three groups using VMAF-based quality thresholds.
This function separates video quality assessment windows into good, bad, and difficult categories based on VMAF (Video Multi-Method Assessment Fusion) thresholds and feature-specific criteria.
- Parameters:
properties (Dict[str, Dict[str, tf.Tensor]], default={}) – Dictionary containing window and target data organized by property names.
vmaf_threshold (Optional[Union[float, str]], optional) – VMAF quality threshold used to determine expected labels for samples. Higher VMAF scores indicate better video quality.
thresholds (Optional[Union[Dict[str, Tuple[float, float]], str]], optional) – Dictionary or string representation of feature thresholds, where each key maps to a tuple of (good_threshold, bad_threshold) values.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.
- Returns:
A tuple containing three property dictionaries: - goods_properties: Properties for samples classified as good (class 0) - bads_properties: Properties for samples classified as bad (class 1) - grays_properties: Properties for samples classified as difficult (class 2)
- Return type:
Tuple[Dict[str, Dict[str, tf.Tensor]], …]
- Raises:
Exception – If thresholds parameter is None, or if window/target tensors are None.
Notes
Expected labels are determined by comparing targets to the VMAF threshold: - Label 0: Target > vmaf_threshold (good quality) - Label 1: Target <= vmaf_threshold (poor quality)
The function ensures feature threshold count matches window feature dimensions and logs detailed statistics about the separation process.
Examples
>>> goods, bads, grays = windowsSeparation_vmaf( ... properties=data_props, ... vmaf_threshold=30.0, ... thresholds={'feature1': (10, 50), 'feature2': (5, 25)} ... )