`mleko.dataset.feature_select.missing_rate_feature_selector`#

Module for the missing rate feature selector.

Module Contents#

Classes#

MissingRateFeatureSelector

Selects features based on the missing rate.

Attributes#

logger

A module-level logger for the module.

mleko.dataset.feature_select.missing_rate_feature_selector.logger#: A module-level logger for the module.

class mleko.dataset.feature_select.missing_rate_feature_selector.MissingRateFeatureSelector(missing_rate_threshold: float, features: list[str] | tuple[str, Ellipsis] | None = None, ignore_features: list[str] | tuple[str, Ellipsis] | None = None, cache_directory: str | pathlib.Path = 'data/missing-rate-feature-selector', cache_size: int = 1)#

Bases: mleko.dataset.feature_select.base_feature_selector.BaseFeatureSelector

Selects features based on the missing rate.

Initializes the feature selector.

The feature selector will select all features with a missing rate below the specified threshold. The default set of features is all features in the DataFrame.

Note

Works with all types of features.

Warning

Make sure to ignore any important features that need to be kept, such as the target feature or some identifier.

Parameters:

missing_rate_threshold (float) – The maximum missing rate allowed for a feature to be selected.
features (list[str] | tuple[str, Ellipsis] | None) – List of feature names to be used by the feature selector.
ignore_features (list[str] | tuple[str, Ellipsis] | None) – List of feature names to be ignored by the feature selector.
cache_directory (str | pathlib.Path) – Directory where the cache will be stored locally.
cache_size (int) – The maximum number of entries to keep in the cache.

Examples

>>> import vaex
>>> from mleko.dataset.feature_select import MissingRateFeatureSelector
>>> from mleko.utils.vaex_helpers import get_column
>>> df = vaex.from_arrays(
...     a=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
...     b=[1, 2, 3, 4, 5, None, None, None, None, None],
...     c=[1, 2, 3, 4, 5, 6, None, None, None, None],
... )
>>> ds = DataSchema(numerical=["a", "b", "c"])
>>> ds, _, df = MissingRateFeatureSelector(
...     ignore_features=["c"],
...     missing_rate_threshold=0.3,
... ).fit_transform(ds, df)
>>> df.get_column_names()
['a', 'b']

_fit(data_schema: mleko.dataset.data_schema.DataSchema, dataframe: vaex.DataFrame) → tuple[mleko.dataset.data_schema.DataSchema, set[str]]#

Fits the feature selector on the input data.

Parameters:

data_schema (mleko.dataset.data_schema.DataSchema) – The DataSchema of the DataFrame.
dataframe (vaex.DataFrame) – The DataFrame to fit the feature selector on.

Returns:

Updated DataSchema and the set of features with a missing rate above the threshold.

Return type:

tuple[mleko.dataset.data_schema.DataSchema, set[str]]

_transform(data_schema: mleko.dataset.data_schema.DataSchema, dataframe: vaex.DataFrame) → tuple[mleko.dataset.data_schema.DataSchema, vaex.DataFrame]#

Selects features based on the missing rate.

Parameters:

data_schema (mleko.dataset.data_schema.DataSchema) – The DataSchema of the DataFrame.
dataframe (vaex.DataFrame) – The DataFrame to select features from.

Returns:

The DataFrame with the selected features.

Return type:

tuple[mleko.dataset.data_schema.DataSchema, vaex.DataFrame]

_default_features(data_schema: mleko.dataset.data_schema.DataSchema) → tuple[str, Ellipsis]#

Returns the default set of features.

Parameters:: data_schema (mleko.dataset.data_schema.DataSchema) – The DataSchema of the DataFrame.
Returns:: Tuple of default features.
Return type:: tuple[str, Ellipsis]

_fingerprint() → Hashable#

Returns the fingerprint of the feature selector.

Appends the missing rate threshold to the fingerprint.

Returns:: The fingerprint of the feature selector.
Return type:: Hashable

fit(data_schema: mleko.dataset.data_schema.DataSchema, dataframe: vaex.DataFrame, cache_group: str | None = None, force_recompute: bool = False, disable_cache: bool = False) → tuple[mleko.dataset.data_schema.DataSchema, Any]#

Fits the feature selector to the specified DataFrame, using the cached result if available.

Parameters:

data_schema (mleko.dataset.data_schema.DataSchema) – DataSchema of the DataFrame.
dataframe (vaex.DataFrame) – DataFrame to be fitted.
cache_group (str | None) – The cache group to use.
force_recompute (bool) – Whether to force the fitting to be recomputed even if the result is cached.
disable_cache (bool) – If set to True, disables the cache.

Returns:

Updated DataSchema and fitted feature selector.

Return type:

tuple[mleko.dataset.data_schema.DataSchema, Any]

transform(data_schema: mleko.dataset.data_schema.DataSchema, dataframe: vaex.DataFrame, cache_group: str | None = None, force_recompute: bool = False, disable_cache: bool = False) → tuple[mleko.dataset.data_schema.DataSchema, vaex.DataFrame]#

Extracts the selected features from the DataFrame, using the cached result if available.

Parameters:

data_schema (mleko.dataset.data_schema.DataSchema) – DataSchema of the DataFrame.
dataframe (vaex.DataFrame) – DataFrame to be transformed.
cache_group (str | None) – The cache group to use.
force_recompute (bool) – Whether to force the transformation to be recomputed even if the result is cached.
disable_cache (bool) – If set to True, disables the cache.

Raises:

RuntimeError – If the feature selector has not been fitted.

Returns:

Updated DataSchema and transformed DataFrame.

Return type:

tuple[mleko.dataset.data_schema.DataSchema, vaex.DataFrame]

fit_transform(data_schema: mleko.dataset.data_schema.DataSchema, dataframe: vaex.DataFrame, cache_group: str | None = None, force_recompute: bool = False, disable_cache: bool = False) → tuple[mleko.dataset.data_schema.DataSchema, Any, vaex.DataFrame]#

Fits the feature selector to the specified DataFrame and extracts the selected features from the DataFrame.

Parameters:

data_schema (mleko.dataset.data_schema.DataSchema) – DataSchema of the DataFrame.
dataframe (vaex.DataFrame) – DataFrame to be fitted and transformed.
cache_group (str | None) – The cache group to use.
force_recompute (bool) – Whether to force the fitting and transformation to be recomputed even if the result is cached.
disable_cache (bool) – If set to True, disables the cache.

Returns:

Tuple of updated DataSchema, fitted feature selector, and transformed DataFrame.

Return type:

tuple[mleko.dataset.data_schema.DataSchema, Any, vaex.DataFrame]

_fit_transform(data_schema: mleko.dataset.data_schema.DataSchema, dataframe: vaex.DataFrame) → tuple[mleko.dataset.data_schema.DataSchema, Any, vaex.DataFrame]#

Fits the feature selector to the specified DataFrame and extracts the selected features from the DataFrame.

Parameters:

data_schema (mleko.dataset.data_schema.DataSchema) – DataSchema of the DataFrame.
dataframe (vaex.DataFrame) – DataFrame used for feature selection.

Returns:

Tuple of updated DataSchema, fitted feature selector, and transformed DataFrame.

Return type:

tuple[mleko.dataset.data_schema.DataSchema, Any, vaex.DataFrame]

_assign_feature_selector(feature_selector: Any) → None#

Assigns the specified feature selector to the feature_selector attribute.

Can be overridden by subclasses to assign the feature selector using a different method.

Parameters:: feature_selector (Any) – Feature selector to be assigned.
Return type:: None

_feature_set(data_schema: mleko.dataset.data_schema.DataSchema) → list[str]#

Returns the list of features to be used by the feature selector.

It is the default set of features minus the features to be ignored if the features argument is None, or the list of names in the features argument if it is not None.

Parameters:: data_schema (mleko.dataset.data_schema.DataSchema) – DataSchema of the DataFrame.
Returns:: Sorted list of feature names to be used by the feature selector.
Return type:: list[str]

mleko.dataset.feature_select.missing_rate_feature_selector#

Module Contents#

Classes#

Attributes#

`mleko.dataset.feature_select.missing_rate_feature_selector`#