cellmil.datamodels.datasets.neoplastic_dataset¶
Classes
|
A PyTorch Dataset for neoplastic cell classification tasks. |
- class cellmil.datamodels.datasets.neoplastic_dataset.NeoplasticDataset(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]¶
Bases:
Dataset[Tuple[Tensor,Tensor]]A PyTorch Dataset for neoplastic cell classification tasks. This dataset treats each individual cell as a sample, creating a cell-level dataset where each item is a single cell with its features and binary neoplastic label.
The target is binary: - Neoplastic cells (type 1): label = 1 - All other cell types (types 2, 3, 4, 5): label = 0
The dataset concatenates all cells from all slides in the split, hiding the slide-level structure from the model.
- __init__(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]¶
Initialize the dataset with data and cell-level neoplastic labels.
- Parameters:
folder – Path to the dataset folder
data – DataFrame containing metadata
extractor – Feature extractor type or list of types to use for feature extraction.
graph_creator – Optional graph creator type, needed for some extractors
segmentation_model – Segmentation model type, required for cell type information
split – Dataset split (train/val/test)
max_workers – Maximum number of threads for parallel processing
correlation_filter – Whether to apply correlation filtering to remove highly correlated features
correlation_threshold – Correlation threshold above which features will be filtered (default: 0.9)
correlation_mask – Optional tensor mask to apply correlation filtering
normalize_feature – Whether to apply min-max normalization to features
normalization_params – Optional tuple of (min_values, max_values) for normalization
- _preprocess_row_for_neoplastic(row: Series) Optional[str][source]¶
Process a single slide row to extract slide name and validate features. Modified version that doesn’t require a label column.
- Parameters:
row – A pandas Series representing a row from the Excel file
- Returns:
slide_name if valid, None otherwise
- _validate_features(folder: Path, slide_name: str) bool[source]¶
Validate that features exist for the given slide.
- _create_neoplastic_labels(cell_types: Dict[int, int], cell_indices: Dict[int, int]) Tensor[source]¶
Create binary neoplastic labels for cells.
- Parameters:
cell_types – Dictionary mapping cell_id to cell_type
cell_indices – Dictionary mapping cell_id to tensor index
- Returns:
1 for neoplastic cells (type 1)
0 for all other cell types (types 2, 3, 4, 5)
- Return type:
Tensor of shape (n_cells,) with binary labels
- _compute_correlation_filter() None[source]¶
Compute correlation filter mask based on training data. Features with correlation > threshold will have one feature removed.
- _compute_normalization_params() None[source]¶
Compute min-max normalization parameters based on training data. These parameters can be used to normalize validation and test sets consistently.
- get_normalization_params() Optional[Tuple[Tensor, Tensor]][source]¶
Get the normalization parameters (min, max) computed from training data.
- Returns:
Tuple of (min_values, max_values) tensors, or None if not computed.
- get_correlation_mask()[source]¶
Get the mask for features retained after correlation filtering.
- Returns:
A boolean tensor indicating which features are kept.
- get_class_distribution() Dict[str, int][source]¶
Get the distribution of neoplastic vs non-neoplastic cells.
- Returns:
Dictionary with counts for each class.
- get_weights_for_sampler() Tensor[source]¶
Compute weights for WeightedRandomSampler to handle class imbalance. Since this is cell-level classification, weights are computed based on the number of neoplastic vs non-neoplastic cells.
- Returns:
Weights for each cell in the dataset, with shape (len(dataset),).
- Return type:
- __getitem__(idx: int) tuple[torch.Tensor, torch.Tensor][source]¶
Get a sample from the dataset.
- Parameters:
idx – Index of the cell to retrieve
- Returns:
features is a tensor of shape (n_features,) for a single cell
label is an int with binary neoplastic label (0 or 1)
- Return type:
Tuple of (features, label) where