Changelog#

v4.3.0 (2024-06-08)#

✨ Features#

  • model: Add check for fitted model in LGBMModel fingerprint. (f6a0933)

πŸ› Bug Fixes#

  • tuning: Optional enqueue_trials parameter added to fingerprint of OptunaTuner. (80fa374)

  • transformer: Update LabelEncoder to use PyArrow implementation of unique to prevent vaex bug from crashing the transformer. (85059d7)

v4.2.0 (2024-05-21)#

✨ Features#

  • transformer: Update ExpressionTransformer to use TypedDict instead of tuples. (3950abd)

v4.1.0 (2024-05-18)#

✨ Features#

  • tuning: Add support for enqueuing trials in OptunaTuner. (9e0b6b2)

  • data splitting: Add support for stratification on multiple features in the RandomSplitter. (d745434)

  • transformer: Add metadata option for the ExpressionTransformer that allows for creation of meta features not tracked in the DataSchema. (f16ea8b)

  • transformer: Add ExpressionTransformer for creating features using the vaex expression system. (c0faf74)

v4.0.0 (2024-05-09)#

⛔️ BREAKING CHANGES#

  • exporter: Add S3Exporter that implements cached S3 exporting of files from the local disk. (d17b2d2)

  • exporter: Add BaseExporter and LocalExporter implementations that support exporting data to disk, along with corresponding Pipeline steps. (6ce13cf)

✨ Features#

  • exporter: Add LocalManifest support for LocalExporter which simplifies caching logic and enables S3 manifest translations. (2199ff0)

  • exporter: Add support for multiple data export using LocalExporter. (ff988b6)

  • data source: Add support for reading manifest files from S3 buckets in S3Ingester. (9c68a9b)

  • pipeline: Add disable_cache parameter to Pipeline execution. (da1e31a)

πŸ› Bug Fixes#

  • data cleaning: Fix newline characters breaking CSV reading using Arrow. (3a7e594)

  • tuning: Delete logging of storage URI to minimize risk of accidentally logging credentials. (054692d)

πŸ› οΈ Code Refactoring#

  • data source: Extract shared S3 logic to utils which can be then used by S3Exporter. (97a7974)

v3.2.0 (2024-04-18)#

✨ Features#

  • tuning: Add support for RDSStorage using the OptunaTuner (cc06ddd)

πŸ› Bug Fixes#

  • data source: Fix bug where dataset_id consisting of path components would break local metadata file creation (17c4866)

  • model: Add verbosity parameter to BaseModel to set log level in the base class. (0a3828f)

v3.1.0 (2024-04-12)#

✨ Features#

  • model: Add optional memoization to datasets during model training. (#209) (2ca4465)

  • model: Add optional memoization to datasets during model training. (6a955dc)

v3.0.0 (2024-04-05)#

⛔️ BREAKING CHANGES#

  • model: Update LGBMModel to use dependency injection, now expects a lightgbm.LGBMModel as argument. (7250f34)

πŸ› Bug Fixes#

  • Switch vaex file format to Arrow instead of HDF5 for better type support. (ac8e500)

  • data cleaning: Fix bug where boolean columns are stored as numerical in the data schema due to int8 conversion. (da358d8)

v2.2.0 (2024-03-22)#

✨ Features#

  • filter: Add ImblearnResamplingFilter which is a wrapper for imblearn over- and under-samplers. (77a3d7d)

  • filter: Add ExpressionFilter and base class for simple DataFrame filtering using vaex expressions. (dc679ff)

  • cache: Add disable_cache argument to all cached functions to completely bypass all caching functionality. (fbdfc5d)

πŸ“ Documentation#

  • Update CHANGELOG.md format to include missing categories. (d97b32c)

v2.1.0 (2024-02-24)#

✨ Features#

  • Update Titanic dataset to mleko 2.0 API. (62bf991)

  • tuning: Add optuna-dashboard support to OptunaTuner including automatically generated experiment notes. (29d81c2)

  • transformer: Improve flexibility of LabelEncoderTransformer by adding optional null encoding and manual dictionary mapping. (f7b30a9)

  • Set cache_directory as optional argument, with custom default locations. (08e8777)

πŸ› Bug Fixes#

  • data cleaning: Fix meta_columns not being forcefully cast to correct data type in CSVToVaexConverter. (b42b9ed)

πŸ“ Documentation#

  • Update year in Copyright in README.md (#192) (eeb56e1)

πŸ§ͺ Tests#

  • Fix test cases generating cache directory outside temporary directory. (ba57fbf)

v2.0.0 (2024-02-07)#

⛔️ BREAKING CHANGES#

  • pipeline: Refactor PipelineStep to use TypedDict for both inputs and outputs. (2eb623c)

✨ Features#

  • model: Refactor validation_dataframe parameter in BaseModel and LGBMModel to be optional. (d18ed29)

  • cache: Add cache support for None returns on fields using cache handlers not equipped to process None. (a489996)

  • model: Add support for custom evaluation function in LGBMModel. (4e70a55)

πŸ› Bug Fixes#

  • data cleaning: Rename empty column name to _empty to prevent vaex crashes. (da72b75)

  • data cleaning: Cast boolean columns to int8 during cleaning to reduce label encoding needs. (d94f7c9)

  • Added reserved keyword column name replacement to prevent evaluation errors from vaex. (3969ffd)

πŸ› οΈ Code Refactoring#

  • Improve error logging messages, and update codebase to new black format. (a29ad45)

  • cache: Break out cache handler retrieval method. (aba9e41)

πŸ“ Documentation#

  • Refactor mleko package documentation to format bullet list correctly. (76ee895)

πŸ€– Continous Integration#

  • Remove TypeGuard and PyUpgrade from build and pre-commit. (d374406)

  • Add custom template for release notes to follow changelog structure. (30518c0)

v1.2.6 (2024-01-25)#

πŸ› Bug Fixes#

v1.2.5 (2024-01-25)#

πŸ› Bug Fixes#

  • Fix CHANGELOG.md template location (141c9b7)

v1.2.4 (2024-01-25)#

πŸ› Bug Fixes#

πŸ—οΈ Build#

  • semantic versioning: Update CHANGELOG.md template and semantic versioning logic. (1727e09)

v1.2.3 (2024-01-25)#

πŸ› Bug Fixes#

  • Remove coverage from workflow (09eb09d)

v1.2.2 (2024-01-25)#

πŸ› Bug Fixes#

  • Switch to trusted publishing (e84712d)

v1.2.1 (2024-01-25)#

πŸ› Bug Fixes#

  • Experiment with semantic versioning (0942196)

πŸ—οΈ Build#

  • 🚧 Upgrade python-gitlab to 4.4.0 (15fff07)

  • 🚧 Fix failing builds (79f7d95)

v1.2.0 (2023-10-09)#

✨ Features#

  • data source: ✨ Add support for pattern matching in *Ingester and add LocalManifest to index fetched files. (75974a4)

πŸ› Bug Fixes#

  • logging: πŸ› Fix LGBM logging routing to correct log level. (0e5fa77)

🎨 Style#

  • remove unnecessary blank lines (a06edf2)

  • ✏️ Improve logging of CSVToVaexConverter and fix typo in write_vaex_dataframe. (197e56a)

πŸ—οΈ Build#

  • πŸ”’οΈ Bump gitpython to resolve CVE-2023-41040 and CVE-2023-40590. (79627bd)

v1.1.0 (2023-09-27)#

✨ Features#

  • tuning: ✨ Add hyperparameter tuning functionality, initially including OptunaTuner. (be38c07)

πŸ§ͺ Tests#

  • tuning: πŸ§ͺ Add test cases for TuneStep. (d811c7d)

v1.0.0 (2023-09-20)#

⛔️ BREAKING CHANGES#

  • πŸ“ Improve README.md with more up to date information. (b388b59)

✨ Features#

  • transformer: ✨ Add DataSchema API to transformers fit, transform and fit_transform. (e053c85)

πŸ“ Documentation#

  • πŸ“ Add example notebook for Titanic dataset. (e651af9)

v0.8.1 (2023-09-07)#

πŸ› Bug Fixes#

  • config: πŸ› Fix readthedocs build to only generate html. (13fc207)

v0.8.0 (2023-09-06)#

✨ Features#

  • model: ✨ Add LGBMModel along with base class which can be extended for all types of future models. (b47a241)

  • ✨ Add DataSchema which tracks dataset features throughout the pipeline and methods. (e03bd2c)

  • feature selection: ✨ Update BaseFeatureSelector and children to use the fit, transform and fit_transform pattern. (62e4dd1)

  • transformer: ✨ Add fit, transform and fit_transform to all Transformers, along with API and caching simplificatons. (5cc4ebc)

  • cache: ✨ Add CacheHandler which allows customization of read/write functions for each cached return value individually. (609e084)

πŸ› Bug Fixes#

  • feature selection: πŸ› Add DataSchema as partial return from all fit methods in feature selectors. (ebf2484)

πŸ› οΈ Code Refactoring#

  • cache: 🚸 Replace disable_cache with a check if cache_size=0 for LRUCacheMixin. (cfd7592)

v0.7.0 (2023-07-11)#

✨ Features#

  • ✨ Add fit transform support to all FeatureSelector along with refactoring the LRUCacheMixin. (3df0601)

  • ✨ Add support for separate fitting and transforming inside the pipeline. (bb9b7a4)

πŸ› Bug Fixes#

  • data cleaning: πŸ› Switched to HDF5 as file format for faster I/O and better SageMaker support. (61f9e42)

v0.6.1 (2023-06-30)#

πŸ› Bug Fixes#

  • data cleaning: πŸ› Fix date32/64[day] not converted to datetime. (98f4b26)

  • data source: πŸ› Fix bug where S3 buckets with no manifest caused crash. (9078845)

πŸ—οΈ Build#

  • config: πŸ”§ Switch mypy for pyright and update configuration. (5631aed)

v0.6.0 (2023-06-26)#

✨ Features#

  • cache: ✨ Add cache_group that can segment an instance cache into different isolated parts. (#66) (5fa8c9c)

  • cache: ✨ Add cache_group that can segment an instance cache into different isolated parts. (b5c3de5)

v0.5.0 (2023-06-17)#

✨ Features#

  • transformer: ✨ Add MinMaxScalerTransformer for normalizing numerical features. (9b26c00)

  • transformer: ✨ Add MaxAbsScalerTransformer that scales numerical features. (1fd2a93)

  • transformer: ✨ Add CompositeTransformer for chaining together multiple transformers sequentially. (006d741)

  • transformer: ✨ Add LabelEncoderTransformer for ordinal encoding. (41a4c45)

  • transformer: ✨ Add FrequencyEncoderTransformer along with support for pipeline. (465e6db)

πŸ› οΈ Code Refactoring#

  • πŸ’« Switch to tqdm.auto to prevent breaking in Jupyter notebooks. (dc139cf)

πŸ§ͺ Tests#

  • βœ… Now _get_local_filenames returns a sorted list of filenames to ensure stability. (774e8eb)

v0.4.2 (2023-06-11)#

πŸš€ Performance improvements#

  • ⚑️ Optimize VarianceFeatureSelector when threshold is 0. (906dde3)

πŸ› οΈ Code Refactoring#

  • βž– Remove pandas dependency. (40e264c)

πŸ€– Continous Integration#

  • semantic versioning: πŸ‘· Add more sections to changelog based on conventional commit categories. (e5b1594)

v0.4.1 (2023-06-04)#

πŸ› Bug Fixes#

  • feature selection: πŸ› Fix FeatureSelector cache to use tuple in… (#60) (758cf5e)

  • feature selection: πŸ› Fix FeatureSelector cache to use tuple instead of frozenset to have stable fingerprint. (cd82417)

v0.4.0 (2023-06-03)#

✨ Features#

  • feature selection: ✨ Add that filters out invariant features. (798c261)

  • feature selection: ✨ Add PearsonCorrelationFeatureSelector which drops highly correlated features. (66e5cd2)

  • feature selection: ✨ Add CompositeFeatureSelector, for chaining multiple feature selection steps on the same DataFrame. (3d75079)

  • feature selection: ✨ Add standard deviation feature selector. (c56177b)

  • feature selection: ✨ Add missing rate feature selector. (d5ba8b5)

πŸ› Bug Fixes#

  • πŸ› Fix typeguard breaking changes causing build to fail. (66c6a8e)

πŸ› οΈ Code Refactoring#

  • πŸ”₯ Unify dataset subpackage naming to verbs and modules to nouns. (3ffb909)

  • πŸ”₯ Rename subpackages in dataset to singular variant. (51a8297)

  • πŸ”₯ Refactor entire project to improve maintainability. (dd1d22c)

v0.3.1 (2023-05-21)#

πŸ› Bug Fixes#

  • :bug: Added notes to pipeline step docstrings. (d94f899)

πŸ› οΈ Code Refactoring#

  • data source: :bug: Added note to the KaggleDataSource init docstring. (d5f12d3)

πŸ€– Continous Integration#

  • :rocket: Removed semantic PR workflow and updated test workflow to not run on release commits. (8138745)

v0.3.0 (2023-05-21)#

✨ Features#

πŸ› Bug Fixes#

  • data splitting: :bug: Added notes and examples to splitters docstrings. (d162c86)

  • pipeline: :bug: Updated some docstrings. (56b36fd)

πŸ€– Continous Integration#

  • :rocket: Updated release to only trigger if the commit message does not contain chore(release). (c9f3f3f)

v0.2.0 (2023-05-21)#

✨ Features#

  • add data splitting step (#53) (a668b1a)

πŸ“ Documentation#

  • Removed duplicate row. (5d77131)

  • Adding pre-commit check for conventional commits. (dd2076e)

v0.1.3 (2023-05-13)#

πŸ› Bug Fixes#

  • cache: :bug: Cache modules exposed in subpackage init. (fd65e9d)

v0.1.2 (2023-05-13)#

πŸ› Bug Fixes#

  • cache: :bug: Fixed LRUCacheMixin eviction test case. (ce5bfc1)

  • :bug: Temporarely disabled failing tests for cache. (9c17960)

πŸ“ Documentation#

  • :memo: Fixed sphinx-autoapi build warnings. (040963a)

v0.1.0 (2023-05-12)#

✨ Features#

  • data source: :sparkles: Add KaggleDataSource to download the dataset from Kaggle by providing a destination directory, owner slug, dataset slug, and necessary API credentials. (3fa07b6)

πŸ› Bug Fixes#

  • cache: :bug: Fixed test by not testing it… (e3a0ce9)

  • cache: :bug: Try logging using assert to fix GH issue (5e247ec)

  • cache: :bug: Attempting to fix test case failing in GH actions. (4892591)

  • cache: :bug: LRUCacheMixin now relies on file modification time instead of access time due to system limitations. (127d657)

  • :bug: Fixed docstrings for private methods in KaggleDataSource and removed xdoctest from build steps (bb55cf5)