cellmil.datamodels.datasets.utils¶
Functions
|
Randomly permute the order of instances (rows). |
|
Convert cell type names to their corresponding indices. |
|
Convert cell types dictionary to a tensor. |
|
Convert centroids dictionary to a tensor. |
|
Perform sanity checks on the input data. |
|
|
|
|
|
Extract slide name from full path. |
|
Filter cells to keep only those within ROI boundaries. |
|
Filter the DataFrame by the specified split. |
|
Get the path to the cell detection file for the given slide. |
|
Get the features for a specific slide using the specified extractor. |
|
|
|
Get centroids for cells from the segmentation data. |
|
Get the path to the feature file for the given slide. |
|
Load pre-computed graph from disk. |
|
Load ROI data for a specific slide. |
|
Merge pre-computed graph structure with extracted features, ensuring proper alignment. |
|
Process a single slide row to extract slide name and validate features. |
|
Randomly subsample or pad the bag to a fixed target size by replicating rows. |
|
Check if the feature file(s) exist and contain valid (non-empty) features. |
|
Compute weights for WeightedRandomSampler to handle class imbalance. |
|
Preprocess the data to ensure paths are correctly formatted. |
- cellmil.datamodels.datasets.utils.column_sanity_check(data: pandas.core.frame.DataFrame | None, label: Union[str, Tuple[str, str]]) None[source]¶
Perform sanity checks on the input data.
- cellmil.datamodels.datasets.utils.preprocess_row(row: Series, label: Union[str, Tuple[str, str]], folder: Path, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, do_validate_features: bool = True) Union[Tuple[str, Union[int, Tuple[float, int]]], Tuple[None, ...]][source]¶
Process a single slide row to extract slide name and validate features.
- Parameters:
row – A pandas Series representing a row from the Excel file
label – Either a single string (classification) or tuple of (duration, event) strings (survival)
- Returns:
Tuple of (slide_name, label) For survival: Tuple of (slide_name, (duration, event)) On error: Tuple of (None, None)
- Return type:
For classification
- cellmil.datamodels.datasets.utils.validate_features(folder: Path, slide_name: str, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None)[source]¶
Check if the feature file(s) exist and contain valid (non-empty) features.
- cellmil.datamodels.datasets.utils.get_feature_path(folder: Path, slide_name: str, extractor: ExtractorType, graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None) Path[source]¶
Get the path to the feature file for the given slide.
- cellmil.datamodels.datasets.utils.filter_split(data: DataFrame, split: str) DataFrame[source]¶
Filter the DataFrame by the specified split.
- cellmil.datamodels.datasets.utils.apply_permutation(features: Tensor) Tensor[source]¶
Randomly permute the order of instances (rows).
- Parameters:
features – Input feature tensor of shape (n_instances, n_features)
- Returns:
Feature tensor with rows permuted
- cellmil.datamodels.datasets.utils.subsample_and_pad(features: Tensor, target_size: int) Tensor[source]¶
Randomly subsample or pad the bag to a fixed target size by replicating rows.
- Parameters:
features – Input feature tensor of shape (n_instances, n_features)
target_size – Desired number of instances per bag
- Returns:
Tensor of shape (target_size, n_features)
- cellmil.datamodels.datasets.utils.wsl_preprocess(data: DataFrame) DataFrame[source]¶
Preprocess the data to ensure paths are correctly formatted.
- cellmil.datamodels.datasets.utils.extract_slide_name(file_path: Path) str[source]¶
Extract slide name from full path.
- cellmil.datamodels.datasets.utils.get_cell_detection_path(folder: Path, slide_name: str, segmentation_model: ModelType) Path[source]¶
Get the path to the cell detection file for the given slide.
- cellmil.datamodels.datasets.utils.get_cell_types(folder: Path, slide_name: str, segmentation_model: ModelType) Optional[Dict[int, int]][source]¶
- cellmil.datamodels.datasets.utils.get_centroids(folder: Path, slide_name: str, segmentation_model: ModelType) Optional[Dict[int, Tuple[float, float]]][source]¶
Get centroids for cells from the segmentation data.
- Parameters:
folder – Path to the dataset folder
slide_name – Name of the slide
segmentation_model – Segmentation model used
- Returns:
Dictionary mapping cell_id to (x, y) centroid coordinates, or None if data not found
- cellmil.datamodels.datasets.utils.centroids_to_tensor(centroids: Dict[int, Tuple[float, float]], cell_indices: Dict[int, int]) Tensor[source]¶
Convert centroids dictionary to a tensor.
- Parameters:
centroids – Dictionary mapping cell_id to (x, y) centroid coordinates
cell_indices – Dictionary mapping cell_id to its index in the tensor
- Returns:
A tensor containing the centroid coordinates [num_cells, 2]
- cellmil.datamodels.datasets.utils.cell_types_to_tensor(cell_types: Dict[int, int], cell_indices: Dict[int, int]) Tensor[source]¶
Convert cell types dictionary to a tensor.
- Parameters:
cell_types – Dictionary mapping cell_id to cell_type
cell_indices – Dictionary mapping cell_id to its index in the tensor
- Returns:
A tensor containing the cell types
- cellmil.datamodels.datasets.utils.get_cell_features(folder: Path, slide_name: str, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None) Tuple[torch.Tensor | None, Optional[Dict[int, int]], Optional[List[str]]][source]¶
Get the features for a specific slide using the specified extractor.
- Parameters:
folder – Path to the dataset folder
slide_name – Name of the slide
extractor – Feature extractor type or list of types to use for feature extraction.
graph_creator – Optional graph creator type, needed for some extractors
segmentation_model – Optional Segmentation model type, needed for some extractors
- Returns:
A tensor containing the extracted features, or None if extraction failed.
- cellmil.datamodels.datasets.utils.compute_normalization(features: Tensor) Tuple[Tensor, Tensor][source]¶
- cellmil.datamodels.datasets.utils.correlation_filter(features: Tensor, correlation_threshold: float, plot: bool = True)[source]¶
- cellmil.datamodels.datasets.utils.weights_for_sampler(labels: list[int]) Tensor[source]¶
Compute weights for WeightedRandomSampler to handle class imbalance.
The weight for each sample is computed as 1 / (class_frequency * num_samples_in_class). This gives higher weights to samples from underrepresented classes.
- Returns:
- Weights for each sample in the dataset, with shape (len(dataset),).
These weights can be used directly with torch.utils.data.WeightedRandomSampler.
- Return type:
- cellmil.datamodels.datasets.utils.load_precomputed_graph(folder: Path, slide_name: str, graph_creator: GraphCreatorType, segmentation_model: ModelType) Data[source]¶
Load pre-computed graph from disk.
- Parameters:
folder – Base folder containing slide data
slide_name – Name of the slide
graph_creator – Graph creator type used (string or enum)
segmentation_model – Segmentation model used (string or enum)
- Returns:
Data object containing the loaded graph
- Raises:
ValueError – If graph file doesn’t exist or has invalid format
- cellmil.datamodels.datasets.utils.merge_graph_with_features(graph_data: Data, features: Tensor, cell_indices: Dict[int, int], cell_coordinates: Tensor) Data[source]¶
Merge pre-computed graph structure with extracted features, ensuring proper alignment.
This function ensures features are correctly assigned to their corresponding graph nodes based on cell IDs, creating a proper subgraph with aligned features.
- Parameters:
graph_data – Data object containing graph structure
features – Feature tensor
cell_indices – Mapping from cell_id to feature tensor index
cell_coordinates – Optional cell coordinates tensor [num_cells, 2]
- Returns:
Data object with properly aligned features and graph structure
- cellmil.datamodels.datasets.utils.cell_type_name_to_index(cell_type_names: List[str]) List[int][source]¶
Convert cell type names to their corresponding indices.
- Parameters:
cell_type_names – List of cell type names (case-insensitive)
- Returns:
List of cell type indices (1-based, as used in TYPE_NUCLEI_DICT)
- Raises:
ValueError – If any cell type name is invalid
- cellmil.datamodels.datasets.utils.load_roi_for_slide(slide_name: str, roi_folder: Path, metadata: DataFrame) Optional[DataFrame][source]¶
Load ROI data for a specific slide.
- Parameters:
slide_name – Name of the slide (DIG_PAT_XXXXXXXX format)
roi_folder – Path to directory containing ROI CSV files
metadata – DataFrame containing ‘ID’, ‘I3LUNG_ID’, and ‘CENTER’ columns
- Returns:
DataFrame with ROI coordinates or None if not found
- cellmil.datamodels.datasets.utils.filter_cells_by_roi(centroids: Dict[int, Tuple[float, float]], roi_df: DataFrame) set[int][source]¶
Filter cells to keep only those within ROI boundaries.
- Parameters:
centroids – Dictionary mapping cell_id to (x, y) centroid coordinates
roi_df – DataFrame with ROI coordinates (columns: roi_name, label, x_base, y_base)
- Returns:
Set of cell IDs that are within the ROI boundaries