src.actions.kaggleDst module

Kaggle Dataset Actions Module

This module provides specialized actions for processing and manipulating Kaggle datasets within the SNN2 neural network framework. It contains utilities for:

Data separation and label extraction from Kaggle datasets
Dataset shuffling and train/validation/test splitting
Triplet generation for metric learning tasks
Categorical target preparation for classification
Post-processing operations on tensor data

The module is specifically designed to handle common Kaggle dataset formats and workflows, providing seamless integration with the SNN2 data processing pipeline.

Functions

apply_post_operationsfunction: Apply a sequence of post-processing operations to tensor data.
kaggleDataVsLabelSeparationfunction: Extract and separate data and labels from Kaggle DataFrames.
kaggleShufflefunction: Randomly shuffle dataset samples for training preparation.
kaggleTrnValTestSeparationfunction: Split Kaggle datasets into training, validation, and test sets.
generateKaggleTripletsfunction: Generate triplet datasets (anchor, positive, negative) for metric learning.
generateKaggleCategoricalfunction: Prepare categorical targets and datasets for classification tasks.

Notes

All functions in this module use the @action and @f_logger decorators for consistent logging and action tracking within the SNN2 framework. The module handles pandas DataFrames, TensorFlow tensors, and DataManager objects for comprehensive data processing.

The module includes specific handling for web service classification tasks commonly found in Kaggle competitions, with built-in data cleaning and preprocessing steps.

Examples

Basic Kaggle dataset processing workflow:

>>> # Load and process Kaggle DataFrame
>>> processed_data = kaggleDataVsLabelSeparation(df, requests=data_requests)
>>>
>>> # Shuffle and split the data
>>> kaggleShuffle(processed_data)
>>> train, val, test = kaggleTrnValTestSeparation(processed_data, trn_portion=0.7)
>>>
>>> # Generate triplets for metric learning
>>> triplet_data = generateKaggleTriplets(train, sample_name="Features")

See also

SNN2.src.core.data.DataManager: DataManager class for data handling
SNN2.src.decorators.decorators: Action and logging decorators
SNN2.src.actions.separation: General data separation utilities

src.actions.kaggleDst.apply_post_operations(request: ~typing.Dict[str, ~typing.Any], tf_values: ~tensorflow.python.framework.tensor.Tensor, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) → Tensor

Apply a sequence of post-processing operations to tensor data.

This function sequentially applies a list of operations to a TensorFlow tensor, using the operation functions, arguments, and keyword arguments specified in the request dictionary. Each operation is applied in order with its corresponding arguments.

Parameters:

request (Dict[str, Any]) – Dictionary containing post-processing configuration with keys: - ‘post_operation’: List of callable functions to apply - ‘post_operation_args’: List of argument tuples for each operation - ‘post_operation_kwargs’: List of keyword argument dicts for each operation
tf_values (tf.Tensor) – Input tensor to which operations will be applied sequentially.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from f_logger decorator.

Returns:

Transformed tensor after applying all specified post-processing operations.

Return type:

tf.Tensor

Notes

The function iterates through the operation lists in parallel, applying each operation with its corresponding arguments and keyword arguments. The operations are applied in the order they appear in the lists.

The function logs each operation being applied using the qualified name of the operation function for debugging and tracking purposes.

Examples

>>> request = {
...     'post_operation': [tf.nn.relu, tf.nn.l2_normalize],
...     'post_operation_args': [(), (1,)],
...     'post_operation_kwargs': [{}, {'epsilon': 1e-12}]
... }
>>> processed_tensor = apply_post_operations(request, input_tensor)

src.actions.kaggleDst.generateKaggleCategorical(anchor: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls, sample_name: str = 'Samples', target_name: str = 'Targets', dst_name: str = 'SamplesDst', num_classes: int = 141, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) → augmented_cls

Prepare categorical targets and datasets for classification tasks.

This function converts integer target labels to one-hot categorical encoding and prepares the data for classification training by creating appropriate TensorFlow datasets with samples and categorical targets.

Parameters:

anchor (DataManager) – DataManager instance containing the dataset to be prepared for classification.
sample_name (str, default="Samples") – Key name for sample data in the DataManager.
target_name (str, default="Targets") – Key name for target/label data in the DataManager.
dst_name (str, default="SamplesDst") – Key name for the output dataset in the DataManager.
num_classes (int, default=141) – Number of classes for categorical encoding. Should match the maximum class index + 1 in the target data.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.

Returns:

Updated DataManager instance with categorical classification data in dst_name key containing: - Sample data copy - TfDataset: TensorFlow dataset with (samples, categorical_targets) pairs - CategoricalTargets: One-hot encoded target vectors

Return type:

DataManager

Notes

The function uses keras.utils.to_categorical to convert integer labels to one-hot encoded vectors. The num_classes parameter should be set to accommodate all possible class indices in the dataset.

The resulting dataset is suitable for training classification models with categorical crossentropy loss functions.

Examples

>>> # Prepare data for 141-class classification
>>> categorical_data = generateKaggleCategorical(
...     training_data,
...     sample_name="Features",
...     target_name="Labels",
...     dst_name="ClassificationDst",
...     num_classes=141
... )
>>>
>>> # Access categorical dataset
>>> dataset = categorical_data['ClassificationDst']['TfDataset']
>>> for samples, targets in dataset.take(1):
...     print(f"Sample shape: {samples.shape}, Target shape: {targets.shape}")

src.actions.kaggleDst.generateKaggleTriplets(anchor: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls, sample_name: str = 'Samples', target_name: str = 'Targets', *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) → augmented_cls

Generate triplet datasets (anchor, positive, negative) for metric learning.

This function creates triplet datasets suitable for metric learning by generating positive and negative sample pairs for each anchor sample based on target labels. For each unique target class, positive samples are from the same class and negative samples are from different classes.

Parameters:

anchor (DataManager) – DataManager instance containing the base dataset from which triplets are generated.
sample_name (str, default="Samples") – Key name for sample data in the DataManager.
target_name (str, default="Targets") – Key name for target/label data in the DataManager.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.

Returns:

Updated DataManager instance with triplet data in ‘TripletDst’ key containing: - Stacked tensor: [positive, anchor, negative] samples - TfDataset: TensorFlow dataset with (positive, anchor, negative) tuples

Return type:

DataManager

Notes

The function handles class imbalance by repeating samples as needed: - For each target class, positive/negative indices are repeated to match dataset size - Uses tf.repeat with calculated repetitions to ensure sufficient samples - Applies random shuffling to positive/negative indices for variety

The resulting triplets maintain the property that: - Anchor and positive samples have the same target label - Anchor and negative samples have different target labels

Examples

>>> # Generate triplets for metric learning
>>> triplet_data = generateKaggleTriplets(
...     training_data,
...     sample_name="Features",
...     target_name="Labels"
... )
>>>
>>> # Access triplet dataset
>>> triplet_dataset = triplet_data['TripletDst']['TfDataset']
>>> for pos, anc, neg in triplet_dataset.take(1):
...     print(f"Shapes: pos={pos.shape}, anc={anc.shape}, neg={neg.shape}")

src.actions.kaggleDst.kaggleDataVsLabelSeparation(df: ~pandas.core.frame.DataFrame, requests: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls | None = None, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) → augmented_cls

Extract and separate data and labels from Kaggle DataFrames with preprocessing.

This function processes Kaggle DataFrames by performing data cleaning operations (removing low-frequency classes, duplicates) and extracting specified columns into tensors according to the provided requests configuration.

Parameters:

df (pd.DataFrame) – Input Kaggle DataFrame containing the dataset to be processed. Expected to have a ‘web_service’ column for classification tasks.
requests (Union[DataManager, None], optional) – DataManager instance containing column extraction requests. Each request should specify ‘columns’, ‘dtype’, and optional ‘post_operation’.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.

Returns:

Updated DataManager instance with extracted and processed tensor data. Each request key contains the corresponding processed tensor data.

Return type:

DataManager

Raises:

Exception – If requests parameter is None.

Notes

The function performs several preprocessing steps: 1. Removes web services with fewer than 100 occurrences 2. Removes duplicate rows 3. Logs data quality statistics (missing values, duplicates) 4. Extracts specified columns and converts to tensors 5. Applies post-processing operations if specified 6. Concatenates multiple tensor batches if needed

Examples

>>> requests = DataManager()
>>> requests['features'] = {
...     'columns': ['feature1', 'feature2'],
...     'dtype': tf.float32,
...     'post_operation': None
... }
>>> processed_data = kaggleDataVsLabelSeparation(kaggle_df, requests=requests)

src.actions.kaggleDst.kaggleShuffle(data: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) → None

Randomly shuffle dataset samples for training preparation.

This function performs in-place shuffling of all samples in the DataManager by generating random indices and reordering the data accordingly. This is commonly used before training to ensure random sample ordering.

Parameters:

data (DataManager) – DataManager instance containing the dataset to be shuffled. Must have a ‘Samples’ key with tensor data.
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.

Returns:

Function modifies the input DataManager in-place.

Return type:

None

Notes

The function uses tf.random.shuffle to generate random indices based on the number of samples in the ‘Samples’ tensor, then applies these indices to reorder all data in the DataManager using sub_select with inplace=True.

This ensures that all related data (samples, targets, etc.) are shuffled consistently while maintaining their relationships.

Examples

>>> # Shuffle training data before model training
>>> kaggleShuffle(training_data)
>>> # Data is now randomly ordered

src.actions.kaggleDst.kaggleTrnValTestSeparation(data: ~SNN2.src.decorators.decorators.c_logger.<locals>.augmented_cls, trn_portion: float = 0.6, val_portion: float = 0.2, *, logger=None, write_msg=<function f_logger.<locals>.__dummy_log>, **kwargs) → augmented_cls]

Split Kaggle datasets into training, validation, and test sets.

This function divides a DataManager instance into three separate datasets for training, validation, and testing based on the specified proportions. The remaining portion (1 - trn_portion - val_portion) is used for testing.

Parameters:

data (DataManager) – DataManager instance containing the complete dataset to be split. Must have a ‘Samples’ key with tensor data.
trn_portion (float, default=0.6) – Fraction of data to use for training (0.0 to 1.0).
val_portion (float, default=0.2) – Fraction of data to use for validation (0.0 to 1.0).
**kwargs (dict) – Additional keyword arguments containing logger and write_msg from decorators.

Returns:

A tuple containing three DataManager instances: - trn_data: Training dataset - val_data: Validation dataset - tst_data: Test dataset

Return type:

Tuple[DataManager, DataManager, DataManager]

Notes

The function uses sequential index slicing (not random) to create the splits: - Training: indices [0 : train_n] - Validation: indices [train_n : train_n + validation_n] - Test: indices [train_n + validation_n : end]

For random splits, use kaggleShuffle before calling this function. The test portion is automatically calculated as the remainder.

Examples

>>> # Split with custom proportions (70% train, 20% val, 10% test)
>>> train, val, test = kaggleTrnValTestSeparation(
...     data, trn_portion=0.7, val_portion=0.2
... )
>>>
>>> # For random splits, shuffle first
>>> kaggleShuffle(data)
>>> train, val, test = kaggleTrnValTestSeparation(data)