cellmil.datamodels.datasets.neoplastic_dataset

Classes

NeoplasticDataset(folder, data, extractor[, ...])

A PyTorch Dataset for neoplastic cell classification tasks.

class cellmil.datamodels.datasets.neoplastic_dataset.NeoplasticDataset(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]

Bases: Dataset[Tuple[Tensor, Tensor]]

A PyTorch Dataset for neoplastic cell classification tasks. This dataset treats each individual cell as a sample, creating a cell-level dataset where each item is a single cell with its features and binary neoplastic label.

The target is binary: - Neoplastic cells (type 1): label = 1 - All other cell types (types 2, 3, 4, 5): label = 0

The dataset concatenates all cells from all slides in the split, hiding the slide-level structure from the model.

__init__(folder: Path, data: DataFrame, extractor: Union[ExtractorType, List[ExtractorType]], graph_creator: Optional[GraphCreatorType] = None, segmentation_model: Optional[ModelType] = None, split: Literal['train', 'val', 'test'] = 'train', max_workers: int = 8, correlation_filter: bool = False, correlation_threshold: float = 0.9, correlation_mask: Optional[Tensor] = None, normalize_feature: bool = False, normalization_params: Optional[Tuple[Tensor, Tensor]] = None)[source]

Initialize the dataset with data and cell-level neoplastic labels.

Parameters:
  • folder – Path to the dataset folder

  • data – DataFrame containing metadata

  • extractor – Feature extractor type or list of types to use for feature extraction.

  • graph_creator – Optional graph creator type, needed for some extractors

  • segmentation_model – Segmentation model type, required for cell type information

  • split – Dataset split (train/val/test)

  • max_workers – Maximum number of threads for parallel processing

  • correlation_filter – Whether to apply correlation filtering to remove highly correlated features

  • correlation_threshold – Correlation threshold above which features will be filtered (default: 0.9)

  • correlation_mask – Optional tensor mask to apply correlation filtering

  • normalize_feature – Whether to apply min-max normalization to features

  • normalization_params – Optional tuple of (min_values, max_values) for normalization

_read_data() None[source]

Read the data specified in the configuration.

_preprocess_row_for_neoplastic(row: Series) Optional[str][source]

Process a single slide row to extract slide name and validate features. Modified version that doesn’t require a label column.

Parameters:

row – A pandas Series representing a row from the Excel file

Returns:

slide_name if valid, None otherwise

_extract_slide_name(file_path: Path) str[source]

Extract slide name from file path.

_validate_features(folder: Path, slide_name: str) bool[source]

Validate that features exist for the given slide.

_create_neoplastic_labels(cell_types: Dict[int, int], cell_indices: Dict[int, int]) Tensor[source]

Create binary neoplastic labels for cells.

Parameters:
  • cell_types – Dictionary mapping cell_id to cell_type

  • cell_indices – Dictionary mapping cell_id to tensor index

Returns:

  • 1 for neoplastic cells (type 1)

  • 0 for all other cell types (types 2, 3, 4, 5)

Return type:

Tensor of shape (n_cells,) with binary labels

_compute_correlation_filter() None[source]

Compute correlation filter mask based on training data. Features with correlation > threshold will have one feature removed.

_compute_normalization_params() None[source]

Compute min-max normalization parameters based on training data. These parameters can be used to normalize validation and test sets consistently.

get_normalization_params() Optional[Tuple[Tensor, Tensor]][source]

Get the normalization parameters (min, max) computed from training data.

Returns:

Tuple of (min_values, max_values) tensors, or None if not computed.

get_correlation_mask()[source]

Get the mask for features retained after correlation filtering.

Returns:

A boolean tensor indicating which features are kept.

get_class_distribution() Dict[str, int][source]

Get the distribution of neoplastic vs non-neoplastic cells.

Returns:

Dictionary with counts for each class.

get_weights_for_sampler() Tensor[source]

Compute weights for WeightedRandomSampler to handle class imbalance. Since this is cell-level classification, weights are computed based on the number of neoplastic vs non-neoplastic cells.

Returns:

Weights for each cell in the dataset, with shape (len(dataset),).

Return type:

torch.Tensor

__len__() int[source]

Return the number of cells in the dataset.

__getitem__(idx: int) tuple[torch.Tensor, torch.Tensor][source]

Get a sample from the dataset.

Parameters:

idx – Index of the cell to retrieve

Returns:

  • features is a tensor of shape (n_features,) for a single cell

  • label is an int with binary neoplastic label (0 or 1)

Return type:

Tuple of (features, label) where