mleko.cache.fingerprinters.csv_fingerprinter#

The module contains a fingerprinter for CSV files supporting Gzipped and raw CSV files.

Module Contents#

Classes#

CSVFingerprinter

A fingerprinter for CSV files supporting Gzipped and raw CSV files.

Attributes#

logger

The logger for the module.

mleko.cache.fingerprinters.csv_fingerprinter.logger#

The logger for the module.

class mleko.cache.fingerprinters.csv_fingerprinter.CSVFingerprinter(n_rows: int = 1000)#

Bases: mleko.cache.fingerprinters.base_fingerprinter.BaseFingerprinter

A fingerprinter for CSV files supporting Gzipped and raw CSV files.

Initialize the CSVFingerprinter.

Warning

The fingerprint is generated by reading the first n_rows of each CSV file. If the CSV file is larger than n_rows, only the first n_rows are read. This means that the fingerprint is not unique for the entire CSV file, but only for the first n_rows.

Parameters:

n_rows (int) – The number of rows to sample from each CSV file for fingerprinting.

Examples

>>> fingerprinter = CSVFingerprinter(n_rows=1000)
>>> fingerprinter.fingerprint(["data.csv", "data2.csv"])
"fingerprint"
fingerprint(data: list[str] | list[pathlib.Path]) str#

Generate a fingerprint for the given list of CSV files.

The currently supported file types are .csv, .gz, and .csv.gz.

Parameters:

data (list[str] | list[pathlib.Path]) – A list of file paths to CSV files.

Returns:

The fingerprint as a hexadecimal string.

Return type:

str

_fingerprint_csv_file(file_path: pathlib.Path) str#

Generate a fingerprint for a single CSV file.

Parameters:

file_path (pathlib.Path) – The file path to a CSV file.

Raises:

ValueError – File is unsupported file type.

Returns:

The fingerprint as a hexadecimal string.

Return type:

str