cellmil.datamodels.datasets.utils

Functions

apply_permutation(features)

Randomly permute the order of instances (rows).

cell_type_name_to_index(cell_type_names)

Convert cell type names to their corresponding indices.

cell_types_to_tensor(cell_types, cell_indices)

Convert cell types dictionary to a tensor.

centroids_to_tensor(centroids, cell_indices)

Convert centroids dictionary to a tensor.

column_sanity_check(data, label)

Perform sanity checks on the input data.

compute_normalization(features)

correlation_filter(features, ...[, plot])

extract_slide_name(file_path)

Extract slide name from full path.

filter_cells_by_roi(centroids, roi_df)

Filter cells to keep only those within ROI boundaries.

filter_split(data, split)

Filter the DataFrame by the specified split.

get_cell_detection_path(folder, slide_name, ...)

Get the path to the cell detection file for the given slide.

get_cell_features(folder, slide_name, extractor)

Get the features for a specific slide using the specified extractor.

get_cell_types(folder, slide_name, ...)

get_centroids(folder, slide_name, ...)

Get centroids for cells from the segmentation data.

get_feature_path(folder, slide_name, extractor)

Get the path to the feature file for the given slide.

load_precomputed_graph(folder, slide_name, ...)

Load pre-computed graph from disk.

load_roi_for_slide(slide_name, roi_folder, ...)

Load ROI data for a specific slide.

merge_graph_with_features(graph_data, ...)

Merge pre-computed graph structure with extracted features, ensuring proper alignment.

preprocess_row(row, label, folder, extractor)

Process a single slide row to extract slide name and validate features.

subsample_and_pad(features, target_size)

Randomly subsample or pad the bag to a fixed target size by replicating rows.

validate_features(folder, slide_name, extractor)

Check if the feature file(s) exist and contain valid (non-empty) features.

weights_for_sampler(labels)

Compute weights for WeightedRandomSampler to handle class imbalance.

wsl_preprocess(data)

Preprocess the data to ensure paths are correctly formatted.

cellmil.datamodels.datasets.utils.column_sanity_check(data: pandas.core.frame.DataFrame | None, label: Union[str, Tuple[str, str]]) None[source]

Perform sanity checks on the input data.

cellmil.datamodels.datasets.utils.preprocess_row(row: Series, label: Union[str, Tuple[str, str]], folder: Path, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, do_validate_features: bool = True) Union[Tuple[str, Union[int, Tuple[float, int]]], Tuple[None, ...]][source]

Process a single slide row to extract slide name and validate features.

Parameters:
  • row – A pandas Series representing a row from the Excel file

  • label – Either a single string (classification) or tuple of (duration, event) strings (survival)

Returns:

Tuple of (slide_name, label) For survival: Tuple of (slide_name, (duration, event)) On error: Tuple of (None, None)

Return type:

For classification

cellmil.datamodels.datasets.utils.validate_features(folder: Path, slide_name: str, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None)[source]

Check if the feature file(s) exist and contain valid (non-empty) features.

cellmil.datamodels.datasets.utils.get_feature_path(folder: Path, slide_name: str, extractor: ExtractorType, graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None) Path[source]

Get the path to the feature file for the given slide.

cellmil.datamodels.datasets.utils.filter_split(data: DataFrame, split: str) DataFrame[source]

Filter the DataFrame by the specified split.

cellmil.datamodels.datasets.utils.apply_permutation(features: Tensor) Tensor[source]

Randomly permute the order of instances (rows).

Parameters:

features – Input feature tensor of shape (n_instances, n_features)

Returns:

Feature tensor with rows permuted

cellmil.datamodels.datasets.utils.subsample_and_pad(features: Tensor, target_size: int) Tensor[source]

Randomly subsample or pad the bag to a fixed target size by replicating rows.

Parameters:
  • features – Input feature tensor of shape (n_instances, n_features)

  • target_size – Desired number of instances per bag

Returns:

Tensor of shape (target_size, n_features)

cellmil.datamodels.datasets.utils.wsl_preprocess(data: DataFrame) DataFrame[source]

Preprocess the data to ensure paths are correctly formatted.

cellmil.datamodels.datasets.utils.extract_slide_name(file_path: Path) str[source]

Extract slide name from full path.

cellmil.datamodels.datasets.utils.get_cell_detection_path(folder: Path, slide_name: str, segmentation_model: ModelType) Path[source]

Get the path to the cell detection file for the given slide.

cellmil.datamodels.datasets.utils.get_cell_types(folder: Path, slide_name: str, segmentation_model: ModelType) Optional[Dict[int, int]][source]
cellmil.datamodels.datasets.utils.get_centroids(folder: Path, slide_name: str, segmentation_model: ModelType) Optional[Dict[int, Tuple[float, float]]][source]

Get centroids for cells from the segmentation data.

Parameters:
  • folder – Path to the dataset folder

  • slide_name – Name of the slide

  • segmentation_model – Segmentation model used

Returns:

Dictionary mapping cell_id to (x, y) centroid coordinates, or None if data not found

cellmil.datamodels.datasets.utils.centroids_to_tensor(centroids: Dict[int, Tuple[float, float]], cell_indices: Dict[int, int]) Tensor[source]

Convert centroids dictionary to a tensor.

Parameters:
  • centroids – Dictionary mapping cell_id to (x, y) centroid coordinates

  • cell_indices – Dictionary mapping cell_id to its index in the tensor

Returns:

A tensor containing the centroid coordinates [num_cells, 2]

cellmil.datamodels.datasets.utils.cell_types_to_tensor(cell_types: Dict[int, int], cell_indices: Dict[int, int]) Tensor[source]

Convert cell types dictionary to a tensor.

Parameters:
  • cell_types – Dictionary mapping cell_id to cell_type

  • cell_indices – Dictionary mapping cell_id to its index in the tensor

Returns:

A tensor containing the cell types

cellmil.datamodels.datasets.utils.get_cell_features(folder: Path, slide_name: str, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None) Tuple[torch.Tensor | None, Optional[Dict[int, int]], Optional[List[str]]][source]

Get the features for a specific slide using the specified extractor.

Parameters:
  • folder – Path to the dataset folder

  • slide_name – Name of the slide

  • extractor – Feature extractor type or list of types to use for feature extraction.

  • graph_creator – Optional graph creator type, needed for some extractors

  • segmentation_model – Optional Segmentation model type, needed for some extractors

Returns:

A tensor containing the extracted features, or None if extraction failed.

cellmil.datamodels.datasets.utils.compute_normalization(features: Tensor) Tuple[Tensor, Tensor][source]
cellmil.datamodels.datasets.utils.correlation_filter(features: Tensor, correlation_threshold: float, plot: bool = True)[source]
cellmil.datamodels.datasets.utils.weights_for_sampler(labels: list[int]) Tensor[source]

Compute weights for WeightedRandomSampler to handle class imbalance.

The weight for each sample is computed as 1 / (class_frequency * num_samples_in_class). This gives higher weights to samples from underrepresented classes.

Returns:

Weights for each sample in the dataset, with shape (len(dataset),).

These weights can be used directly with torch.utils.data.WeightedRandomSampler.

Return type:

torch.Tensor

cellmil.datamodels.datasets.utils.load_precomputed_graph(folder: Path, slide_name: str, graph_creator: GraphCreatorType, segmentation_model: ModelType) Data[source]

Load pre-computed graph from disk.

Parameters:
  • folder – Base folder containing slide data

  • slide_name – Name of the slide

  • graph_creator – Graph creator type used (string or enum)

  • segmentation_model – Segmentation model used (string or enum)

Returns:

Data object containing the loaded graph

Raises:

ValueError – If graph file doesn’t exist or has invalid format

cellmil.datamodels.datasets.utils.merge_graph_with_features(graph_data: Data, features: Tensor, cell_indices: Dict[int, int], cell_coordinates: Tensor) Data[source]

Merge pre-computed graph structure with extracted features, ensuring proper alignment.

This function ensures features are correctly assigned to their corresponding graph nodes based on cell IDs, creating a proper subgraph with aligned features.

Parameters:
  • graph_data – Data object containing graph structure

  • features – Feature tensor

  • cell_indices – Mapping from cell_id to feature tensor index

  • cell_coordinates – Optional cell coordinates tensor [num_cells, 2]

Returns:

Data object with properly aligned features and graph structure

cellmil.datamodels.datasets.utils.cell_type_name_to_index(cell_type_names: List[str]) List[int][source]

Convert cell type names to their corresponding indices.

Parameters:

cell_type_names – List of cell type names (case-insensitive)

Returns:

List of cell type indices (1-based, as used in TYPE_NUCLEI_DICT)

Raises:

ValueError – If any cell type name is invalid

cellmil.datamodels.datasets.utils.load_roi_for_slide(slide_name: str, roi_folder: Path, metadata: DataFrame) Optional[DataFrame][source]

Load ROI data for a specific slide.

Parameters:
  • slide_name – Name of the slide (DIG_PAT_XXXXXXXX format)

  • roi_folder – Path to directory containing ROI CSV files

  • metadata – DataFrame containing ‘ID’, ‘I3LUNG_ID’, and ‘CENTER’ columns

Returns:

DataFrame with ROI coordinates or None if not found

cellmil.datamodels.datasets.utils.filter_cells_by_roi(centroids: Dict[int, Tuple[float, float]], roi_df: DataFrame) set[int][source]

Filter cells to keep only those within ROI boundaries.

Parameters:
  • centroids – Dictionary mapping cell_id to (x, y) centroid coordinates

  • roi_df – DataFrame with ROI coordinates (columns: roi_name, label, x_base, y_base)

Returns:

Set of cell IDs that are within the ROI boundaries