cellmil.datamodels.datasets¶
- cellmil.datamodels.datasets.MILDataset(root: Path, label: str | tuple[str, str], folder: Path, data: DataFrame, extractor: cellmil.interfaces.FeatureExtractorConfig.ExtractorType | list[cellmil.interfaces.FeatureExtractorConfig.ExtractorType], split: Literal['train', 'val', 'test', 'all'] = 'all', **kwargs: Any) cellmil.datamodels.datasets.cell_mil_dataset.CellMILDataset | cellmil.datamodels.datasets.patch_mil_dataset.PatchMILDataset[source]¶
- cellmil.datamodels.datasets.GNNMILDataset(root: Union[str, Path], folder: Union[str, Path], label: str | tuple[str, str], data: DataFrame, extractor: cellmil.interfaces.FeatureExtractorConfig.ExtractorType | list[cellmil.interfaces.FeatureExtractorConfig.ExtractorType], split: Literal['train', 'val', 'test', 'all'] = 'all', transform: Optional[Callable[[Data], Data]] = None, pre_transform: Optional[Callable[[Data], Data]] = None, pre_filter: Optional[Callable[[Data], bool]] = None, force_reload: bool = False, label_transforms: Optional[Any] = None, **kwargs: Any) cellmil.datamodels.datasets.cell_gnn_mil_dataset.CellGNNMILDataset | cellmil.datamodels.datasets.patch_gnn_mil_dataset.PatchGNNMILDataset[source]¶
- class cellmil.datamodels.datasets.NeoplasticDataset(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]¶
Bases:
Dataset[Tuple[Tensor,Tensor]]A PyTorch Dataset for neoplastic cell classification tasks. This dataset treats each individual cell as a sample, creating a cell-level dataset where each item is a single cell with its features and binary neoplastic label.
The target is binary: - Neoplastic cells (type 1): label = 1 - All other cell types (types 2, 3, 4, 5): label = 0
The dataset concatenates all cells from all slides in the split, hiding the slide-level structure from the model.
- __init__(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]¶
Initialize the dataset with data and cell-level neoplastic labels.
- Parameters:
folder – Path to the dataset folder
data – DataFrame containing metadata
extractor – Feature extractor type or list of types to use for feature extraction.
graph_creator – Optional graph creator type, needed for some extractors
segmentation_model – Segmentation model type, required for cell type information
split – Dataset split (train/val/test)
max_workers – Maximum number of threads for parallel processing
correlation_filter – Whether to apply correlation filtering to remove highly correlated features
correlation_threshold – Correlation threshold above which features will be filtered (default: 0.9)
correlation_mask – Optional tensor mask to apply correlation filtering
normalize_feature – Whether to apply min-max normalization to features
normalization_params – Optional tuple of (min_values, max_values) for normalization
- _preprocess_row_for_neoplastic(row: Series) Optional[str][source]¶
Process a single slide row to extract slide name and validate features. Modified version that doesn’t require a label column.
- Parameters:
row – A pandas Series representing a row from the Excel file
- Returns:
slide_name if valid, None otherwise
- _validate_features(folder: Path, slide_name: str) bool[source]¶
Validate that features exist for the given slide.
- _create_neoplastic_labels(cell_types: Dict[int, int], cell_indices: Dict[int, int]) Tensor[source]¶
Create binary neoplastic labels for cells.
- Parameters:
cell_types – Dictionary mapping cell_id to cell_type
cell_indices – Dictionary mapping cell_id to tensor index
- Returns:
1 for neoplastic cells (type 1)
0 for all other cell types (types 2, 3, 4, 5)
- Return type:
Tensor of shape (n_cells,) with binary labels
- _compute_correlation_filter() None[source]¶
Compute correlation filter mask based on training data. Features with correlation > threshold will have one feature removed.
- _compute_normalization_params() None[source]¶
Compute min-max normalization parameters based on training data. These parameters can be used to normalize validation and test sets consistently.
- get_normalization_params() Optional[Tuple[Tensor, Tensor]][source]¶
Get the normalization parameters (min, max) computed from training data.
- Returns:
Tuple of (min_values, max_values) tensors, or None if not computed.
- get_correlation_mask()[source]¶
Get the mask for features retained after correlation filtering.
- Returns:
A boolean tensor indicating which features are kept.
- get_class_distribution() Dict[str, int][source]¶
Get the distribution of neoplastic vs non-neoplastic cells.
- Returns:
Dictionary with counts for each class.
- get_weights_for_sampler() Tensor[source]¶
Compute weights for WeightedRandomSampler to handle class imbalance. Since this is cell-level classification, weights are computed based on the number of neoplastic vs non-neoplastic cells.
- Returns:
Weights for each cell in the dataset, with shape (len(dataset),).
- Return type:
- __getitem__(idx: int) tuple[torch.Tensor, torch.Tensor][source]¶
Get a sample from the dataset.
- Parameters:
idx – Index of the cell to retrieve
- Returns:
features is a tensor of shape (n_features,) for a single cell
label is an int with binary neoplastic label (0 or 1)
- Return type:
Tuple of (features, label) where
- class cellmil.datamodels.datasets.CellTypeDataset(root: Union[str, Path], folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test', 'all'] = 'all', transforms: Optional[Union[TransformPipeline, Transform]] = None, cell_types_to_keep: Optional[List[str]] = None, label_smoothing: Optional[Union[float, Dict[str, float]]] = None, max_workers: int = 8, force_reload: bool = False)[source]¶
Bases:
Dataset[Tuple[Tensor,Tensor]]A PyTorch Dataset for multi-class cell type classification.
This dataset treats each individual cell as a sample, creating a cell-level dataset where each item is a single cell with its features and one-hot encoded cell type label.
The labels are one-hot encoded for the 5 cell types: - Type 1: Neoplastic - Type 2: Inflammatory - Type 3: Connective - Type 4: Dead - Type 5: Epithelial
Supports label smoothing for specific cell types to handle annotation uncertainty.
- Returns:
- (features, label) where:
features is a tensor of shape (n_features,) for a single cell
label is a one-hot encoded tensor of shape (n_cell_types,) with optional label smoothing
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- __init__(root: Union[str, Path], folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test', 'all'] = 'all', transforms: Optional[Union[TransformPipeline, Transform]] = None, cell_types_to_keep: Optional[List[str]] = None, label_smoothing: Optional[Union[float, Dict[str, float]]] = None, max_workers: int = 8, force_reload: bool = False)[source]¶
Initialize the CellTypeDataset.
- Parameters:
root – Root directory where the processed dataset will be cached
folder – Path to the dataset folder containing slide data
data – DataFrame containing metadata with at least ‘FULL_PATH’ and ‘SPLIT’ columns
extractor – Feature extractor type or list of types to use for feature extraction
graph_creator – Optional graph creator type, needed for some extractors
segmentation_model – Segmentation model type (‘cellvit’ or ‘hovernet’), required for cell type info
split – Dataset split (‘train’, ‘val’, ‘test’, or ‘all’)
transforms – Optional TransformPipeline or Transform to apply to features at getitem time
cell_types_to_keep – Optional list of cell type names to keep (e.g., [“Neoplastic”, “Connective”]). Valid names: “Neoplastic”, “Inflammatory”, “Connective”, “Dead”, “Epithelial” (case-insensitive). If provided, only cells of these types will be included.
label_smoothing –
Label smoothing configuration. Can be: - None or 0.0: No smoothing applied - float (0.0 to 1.0): Same smoothing value applied to all cell types - Dict[str, float]: Custom smoothing value for each cell type, e.g.,
{“Neoplastic”: 0.0, “Inflammatory”: 0.1, “Dead”: 0.2, “Epithelial”: 0.15}
Cell types not in the dict will have no smoothing applied.
max_workers – Maximum number of threads for parallel processing
force_reload – Whether to force reprocessing even if processed files exist
- _process_label_smoothing(label_smoothing: Optional[Union[float, Dict[str, float]]]) Dict[int, float][source]¶
Process label smoothing configuration into a per-type dictionary.
- Parameters:
label_smoothing – Either None, a float, or a dict mapping cell type names to smoothing values
- Returns:
Dictionary mapping cell type indices (1-based) to smoothing values
- _cell_type_names_to_indices(cell_type_names: Optional[List[str]]) Optional[List[int]][source]¶
Convert cell type names to their corresponding indices.
- Parameters:
cell_type_names – List of cell type names (case-insensitive)
- Returns:
List of cell type indices (1-based as in TYPE_NUCLEI_DICT), or None if input is None
- _process_slide(row: Series, start_global_idx: int) Optional[str][source]¶
Process a single slide and extract cell-level features and labels.
- Parameters:
row – A pandas Series representing a row from the DataFrame
start_global_idx – Starting global index for cells in this slide
- Returns:
slide_name if successful, None otherwise
- _create_label(cell_type: int) Tensor[source]¶
Create a one-hot encoded label with optional label smoothing.
- Parameters:
cell_type – Cell type index (1-based, 1-5)
- Returns:
One-hot encoded tensor of shape (NUM_CELL_TYPES,) with optional smoothing
- get_class_distribution() Dict[str, int][source]¶
Get the distribution of cell types in the dataset.
- Returns:
Dictionary with cell type names as keys and counts as values
- get_num_classes() int[source]¶
Get the number of cell type classes.
- Returns:
Number of cell type classes (5 for all types, or fewer if filtered)
- get_weights_for_sampler() Tensor[source]¶
Compute weights for WeightedRandomSampler to handle class imbalance.
- Returns:
Weights for each cell in the dataset, shape (len(dataset),)
- Return type:
- __getitem__(idx: int) Tuple[Tensor, Tensor][source]¶
Get a sample from the dataset.
- Parameters:
idx – Index of the cell to retrieve
- Returns:
features is a tensor of shape (n_features,) for a single cell
label is a one-hot encoded tensor of shape (n_cell_types,)
- Return type:
Tuple of (features, label) where
- create_subset(indices: List[int]) CellTypeDataset[source]¶
Create a subset of the dataset using the specified indices.
- Parameters:
indices – List of cell indices to include in the subset
- Returns:
New CellTypeDataset instance containing only the specified cells
- create_train_val_datasets(train_indices: List[int], val_indices: List[int], transforms: Optional[Union[TransformPipeline, Transform]] = None) Tuple[CellTypeDataset, CellTypeDataset][source]¶
Create train and validation datasets from specified indices.
- Parameters:
train_indices – List of cell indices for training set
val_indices – List of cell indices for validation set
transforms – Optional pre-fitted transforms to apply to both datasets. These should already be fitted on training data before calling this method.
- Returns:
Tuple of (train_dataset, val_dataset) with the provided transforms
- create_train_val_datasets_by_slides(train_slides: List[str], val_slides: List[str], transforms: Optional[Union[TransformPipeline, Transform]] = None) Tuple[CellTypeDataset, CellTypeDataset][source]¶
Create train and validation datasets based on slide names.
This method is useful when you want to split the CellTypeDataset based on the same slide-level split used for another dataset (e.g., MILDataset). All cells from slides in train_slides go to training, and all cells from slides in val_slides go to validation.
- Parameters:
train_slides – List of slide names for training set
val_slides – List of slide names for validation set
transforms – Optional pre-fitted transforms to apply to both datasets. These should already be fitted on training data before calling this method.
- Returns:
Tuple of (train_dataset, val_dataset) with the provided transforms
Modules
CellTypeDataset: A PyTorch Dataset for multi-class cell type classification. |
|