cellmil.datamodels.datasets.utils¶

Functions

`apply_permutation`(features)	Randomly permute the order of instances (rows).
`cell_type_name_to_index`(cell_type_names)	Convert cell type names to their corresponding indices.
`cell_types_to_tensor`(cell_types, cell_indices)	Convert cell types dictionary to a tensor.
`centroids_to_tensor`(centroids, cell_indices)	Convert centroids dictionary to a tensor.
`column_sanity_check`(data, label)	Perform sanity checks on the input data.
`compute_normalization`(features)
`correlation_filter`(features, ...[, plot])
`extract_slide_name`(file_path)	Extract slide name from full path.
`filter_cells_by_roi`(centroids, roi_df)	Filter cells to keep only those within ROI boundaries.
`filter_split`(data, split)	Filter the DataFrame by the specified split.
`get_cell_detection_path`(folder, slide_name, ...)	Get the path to the cell detection file for the given slide.
`get_cell_features`(folder, slide_name, extractor)	Get the features for a specific slide using the specified extractor.
`get_cell_types`(folder, slide_name, ...)
`get_centroids`(folder, slide_name, ...)	Get centroids for cells from the segmentation data.
`get_feature_path`(folder, slide_name, extractor)	Get the path to the feature file for the given slide.
`load_precomputed_graph`(folder, slide_name, ...)	Load pre-computed graph from disk.
`load_roi_for_slide`(slide_name, roi_folder, ...)	Load ROI data for a specific slide.
`merge_graph_with_features`(graph_data, ...)	Merge pre-computed graph structure with extracted features, ensuring proper alignment.
`preprocess_row`(row, label, folder, extractor)	Process a single slide row to extract slide name and validate features.
`subsample_and_pad`(features, target_size)	Randomly subsample or pad the bag to a fixed target size by replicating rows.
`validate_features`(folder, slide_name, extractor)	Check if the feature file(s) exist and contain valid (non-empty) features.
`weights_for_sampler`(labels)	Compute weights for WeightedRandomSampler to handle class imbalance.
`wsl_preprocess`(data)	Preprocess the data to ensure paths are correctly formatted.

cellmil.datamodels.datasets.utils.column_sanity_check(data: pandas.core.frame.DataFrame | None, label: Union[str, Tuple[str, str]]) → None[source]¶: Perform sanity checks on the input data.

cellmil.datamodels.datasets.utils.preprocess_row(row: Series, label: Union[str, Tuple[str, str]], folder: Path, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, do_validate_features: bool = True) → Union[Tuple[str, Union[int, Tuple[float, int]]], Tuple[None, ...]][source]¶

Process a single slide row to extract slide name and validate features.

Parameters:

row – A pandas Series representing a row from the Excel file
label – Either a single string (classification) or tuple of (duration, event) strings (survival)

Returns:

Tuple of (slide_name, label) For survival: Tuple of (slide_name, (duration, event)) On error: Tuple of (None, None)

Return type:

For classification

cellmil.datamodels.datasets.utils.validate_features(folder: Path, slide_name: str, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None)[source]¶: Check if the feature file(s) exist and contain valid (non-empty) features.

cellmil.datamodels.datasets.utils.get_feature_path(folder: Path, slide_name: str, extractor: ExtractorType, graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None) → Path[source]¶: Get the path to the feature file for the given slide.

cellmil.datamodels.datasets.utils.filter_split(data: DataFrame, split: str) → DataFrame[source]¶: Filter the DataFrame by the specified split.

cellmil.datamodels.datasets.utils.apply_permutation(features: Tensor) → Tensor[source]¶

Randomly permute the order of instances (rows).

Parameters:: features – Input feature tensor of shape (n_instances, n_features)
Returns:: Feature tensor with rows permuted

cellmil.datamodels.datasets.utils.subsample_and_pad(features: Tensor, target_size: int) → Tensor[source]¶

Randomly subsample or pad the bag to a fixed target size by replicating rows.

Parameters:

features – Input feature tensor of shape (n_instances, n_features)
target_size – Desired number of instances per bag

Returns:

Tensor of shape (target_size, n_features)

cellmil.datamodels.datasets.utils.wsl_preprocess(data: DataFrame) → DataFrame[source]¶: Preprocess the data to ensure paths are correctly formatted.

cellmil.datamodels.datasets.utils.extract_slide_name(file_path: Path) → str[source]¶: Extract slide name from full path.

cellmil.datamodels.datasets.utils.get_cell_detection_path(folder: Path, slide_name: str, segmentation_model: ModelType) → Path[source]¶: Get the path to the cell detection file for the given slide.

cellmil.datamodels.datasets.utils.get_cell_types(folder: Path, slide_name: str, segmentation_model: ModelType) → Optional[Dict[int, int]][source]¶

cellmil.datamodels.datasets.utils.get_centroids(folder: Path, slide_name: str, segmentation_model: ModelType) → Optional[Dict[int, Tuple[float, float]]][source]¶

Get centroids for cells from the segmentation data.

Parameters:

folder – Path to the dataset folder
slide_name – Name of the slide
segmentation_model – Segmentation model used

Returns:

Dictionary mapping cell_id to (x, y) centroid coordinates, or None if data not found

cellmil.datamodels.datasets.utils.centroids_to_tensor(centroids: Dict[int, Tuple[float, float]], cell_indices: Dict[int, int]) → Tensor[source]¶

Convert centroids dictionary to a tensor.

Parameters:

centroids – Dictionary mapping cell_id to (x, y) centroid coordinates
cell_indices – Dictionary mapping cell_id to its index in the tensor

Returns:

A tensor containing the centroid coordinates [num_cells, 2]

cellmil.datamodels.datasets.utils.cell_types_to_tensor(cell_types: Dict[int, int], cell_indices: Dict[int, int]) → Tensor[source]¶

Convert cell types dictionary to a tensor.

Parameters:

cell_types – Dictionary mapping cell_id to cell_type
cell_indices – Dictionary mapping cell_id to its index in the tensor

Returns:

A tensor containing the cell types

cellmil.datamodels.datasets.utils.get_cell_features(folder: Path, slide_name: str, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None) → Tuple[torch.Tensor | None, Optional[Dict[int, int]], Optional[List[str]]][source]¶

Get the features for a specific slide using the specified extractor.

Parameters:

folder – Path to the dataset folder
slide_name – Name of the slide
extractor – Feature extractor type or list of types to use for feature extraction.
graph_creator – Optional graph creator type, needed for some extractors
segmentation_model – Optional Segmentation model type, needed for some extractors

Returns:

A tensor containing the extracted features, or None if extraction failed.

cellmil.datamodels.datasets.utils.compute_normalization(features: Tensor) → Tuple[Tensor, Tensor][source]¶

cellmil.datamodels.datasets.utils.correlation_filter(features: Tensor, correlation_threshold: float, plot: bool = True)[source]¶

cellmil.datamodels.datasets.utils.weights_for_sampler(labels: list[int]) → Tensor[source]¶

Compute weights for WeightedRandomSampler to handle class imbalance.

The weight for each sample is computed as 1 / (class_frequency * num_samples_in_class). This gives higher weights to samples from underrepresented classes.

Returns:

Weights for each sample in the dataset, with shape (len(dataset),).: These weights can be used directly with torch.utils.data.WeightedRandomSampler.

Return type:

torch.Tensor

cellmil.datamodels.datasets.utils.load_precomputed_graph(folder: Path, slide_name: str, graph_creator: GraphCreatorType, segmentation_model: ModelType) → Data[source]¶

Load pre-computed graph from disk.

Parameters:

folder – Base folder containing slide data
slide_name – Name of the slide
graph_creator – Graph creator type used (string or enum)
segmentation_model – Segmentation model used (string or enum)

Returns:

Data object containing the loaded graph

Raises:

ValueError – If graph file doesn’t exist or has invalid format

cellmil.datamodels.datasets.utils.merge_graph_with_features(graph_data: Data, features: Tensor, cell_indices: Dict[int, int], cell_coordinates: Tensor) → Data[source]¶

Merge pre-computed graph structure with extracted features, ensuring proper alignment.

This function ensures features are correctly assigned to their corresponding graph nodes based on cell IDs, creating a proper subgraph with aligned features.

Parameters:

graph_data – Data object containing graph structure
features – Feature tensor
cell_indices – Mapping from cell_id to feature tensor index
cell_coordinates – Optional cell coordinates tensor [num_cells, 2]

Returns:

Data object with properly aligned features and graph structure

cellmil.datamodels.datasets.utils.cell_type_name_to_index(cell_type_names: List[str]) → List[int][source]¶

Convert cell type names to their corresponding indices.

Parameters:: cell_type_names – List of cell type names (case-insensitive)
Returns:: List of cell type indices (1-based, as used in TYPE_NUCLEI_DICT)
Raises:: ValueError – If any cell type name is invalid

cellmil.datamodels.datasets.utils.load_roi_for_slide(slide_name: str, roi_folder: Path, metadata: DataFrame) → Optional[DataFrame][source]¶

Load ROI data for a specific slide.

Parameters:

slide_name – Name of the slide (DIG_PAT_XXXXXXXX format)
roi_folder – Path to directory containing ROI CSV files
metadata – DataFrame containing ‘ID’, ‘I3LUNG_ID’, and ‘CENTER’ columns

Returns:

DataFrame with ROI coordinates or None if not found

cellmil.datamodels.datasets.utils.filter_cells_by_roi(centroids: Dict[int, Tuple[float, float]], roi_df: DataFrame) → set[int][source]¶

Filter cells to keep only those within ROI boundaries.

Parameters:

centroids – Dictionary mapping cell_id to (x, y) centroid coordinates
roi_df – DataFrame with ROI coordinates (columns: roi_name, label, x_base, y_base)

Returns:

Set of cell IDs that are within the ROI boundaries