cellmil.datamodels.datasets.neoplastic_dataset¶

Classes

NeoplasticDataset(folder, data, extractor[, ...])

A PyTorch Dataset for neoplastic cell classification tasks.

class cellmil.datamodels.datasets.neoplastic_dataset.NeoplasticDataset(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]¶

Bases: Dataset[Tuple[Tensor, Tensor]]

A PyTorch Dataset for neoplastic cell classification tasks. This dataset treats each individual cell as a sample, creating a cell-level dataset where each item is a single cell with its features and binary neoplastic label.

The target is binary: - Neoplastic cells (type 1): label = 1 - All other cell types (types 2, 3, 4, 5): label = 0

The dataset concatenates all cells from all slides in the split, hiding the slide-level structure from the model.

__init__(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]¶

Initialize the dataset with data and cell-level neoplastic labels.

Parameters:

folder – Path to the dataset folder
data – DataFrame containing metadata
extractor – Feature extractor type or list of types to use for feature extraction.
graph_creator – Optional graph creator type, needed for some extractors
segmentation_model – Segmentation model type, required for cell type information
split – Dataset split (train/val/test)
max_workers – Maximum number of threads for parallel processing
correlation_filter – Whether to apply correlation filtering to remove highly correlated features
correlation_threshold – Correlation threshold above which features will be filtered (default: 0.9)
correlation_mask – Optional tensor mask to apply correlation filtering
normalize_feature – Whether to apply min-max normalization to features
normalization_params – Optional tuple of (min_values, max_values) for normalization

_read_data() → None[source]¶: Read the data specified in the configuration.

_preprocess_row_for_neoplastic(row: Series) → Optional[str][source]¶

Process a single slide row to extract slide name and validate features. Modified version that doesn’t require a label column.

Parameters:: row – A pandas Series representing a row from the Excel file
Returns:: slide_name if valid, None otherwise

_extract_slide_name(file_path: Path) → str[source]¶: Extract slide name from file path.

_validate_features(folder: Path, slide_name: str) → bool[source]¶: Validate that features exist for the given slide.

_create_neoplastic_labels(cell_types: Dict[int, int], cell_indices: Dict[int, int]) → Tensor[source]¶

Create binary neoplastic labels for cells.

Parameters:

cell_types – Dictionary mapping cell_id to cell_type
cell_indices – Dictionary mapping cell_id to tensor index

Returns:

1 for neoplastic cells (type 1)
0 for all other cell types (types 2, 3, 4, 5)

Return type:

Tensor of shape (n_cells,) with binary labels

_compute_correlation_filter() → None[source]¶: Compute correlation filter mask based on training data. Features with correlation > threshold will have one feature removed.

_compute_normalization_params() → None[source]¶: Compute min-max normalization parameters based on training data. These parameters can be used to normalize validation and test sets consistently.

get_normalization_params() → Optional[Tuple[Tensor, Tensor]][source]¶

Get the normalization parameters (min, max) computed from training data.

Returns:: Tuple of (min_values, max_values) tensors, or None if not computed.

get_correlation_mask()[source]¶

Get the mask for features retained after correlation filtering.

Returns:: A boolean tensor indicating which features are kept.

get_class_distribution() → Dict[str, int][source]¶

Get the distribution of neoplastic vs non-neoplastic cells.

Returns:: Dictionary with counts for each class.

get_weights_for_sampler() → Tensor[source]¶

Compute weights for WeightedRandomSampler to handle class imbalance. Since this is cell-level classification, weights are computed based on the number of neoplastic vs non-neoplastic cells.

Returns:: Weights for each cell in the dataset, with shape (len(dataset),).
Return type:: torch.Tensor

__len__() → int[source]¶: Return the number of cells in the dataset.

__getitem__(idx: int) → tuple[torch.Tensor, torch.Tensor][source]¶

Get a sample from the dataset.

Parameters:

idx – Index of the cell to retrieve

Returns:

features is a tensor of shape (n_features,) for a single cell
label is an int with binary neoplastic label (0 or 1)

Return type:

Tuple of (features, label) where