Loading and Saving Objects

py_entitymatching.load_table(file_path, metadata_ext='.pklmetadata')

Loads a pickled DataFrame from a file along with its metadata.

This function loads a DataFrame from a file stored in pickle format.

Further, this function looks for a metadata file with the same file name but with an extension given by the user (defaults to ‘.pklmetadata’. If the metadata file is present, the function will update the metadata for that DataFrame in the catalog.

Parameters:
  • file_path (string) – The file path to load the file from.
  • metadata_ext (string) – The metadata file extension (defaults to ‘.pklmetadata’) that should be used to generate metadata file name.
Returns:

If the loading is successful, the function will return a pandas DataFrame read from the file. The catalog will be updated with the metadata read from the metadata file (if the file was present).

Raises:
  • AssertionError – If file_path is not of type string.
  • AssertionError – If metadata_ext is not of type string.

Examples

>>> A = em.load_table('./A.pkl')
>>> A = em.load_table('./A.pkl', metadata_ext='.pklmeta')

See also

save_table()

Note

This function is different from read_csv_metadata in two aspects. First, this function currently does not support reading in candidate set tables, where there are more metadata such as ltable, rtable than just ‘key’, and conceptually the user is expected to provide ltable and rtable information while calling this function. ( this support will be added shortly). Second, this function loads the table stored in a pickle format.

py_entitymatching.save_table(data_frame, file_path, metadata_ext='.pklmetadata')

Saves a DataFrame to disk along with its metadata in a pickle format.

This function saves a DataFrame to disk along with its metadata from the catalog.

Specifically, this function saves the DataFrame in the given file path, and saves the metadata in the same directory (as the file path) but with a different extension. This extension can be optionally given by the user (defaults to ‘.pklmetadata’).

Parameters:
  • data_frame (DataFrame) – The DataFrame that should be saved.
  • file_path (string) – The file path where the DataFrame must be stored.
  • metadata_ext (string) – The metadata extension that should be used while storing the metadata information. The default value is ‘.pklmetadata’.
Returns:

A Boolean value of True is returned if the DataFrame is successfully saved.

Raises:
  • AssertionError – If data_frame is not of type pandas DataFrame.
  • AssertionError – If file_path is not of type string.
  • AssertionError – If metadata_ext is not of type string.
  • AssertionError – If a file cannot written in the given file_path.

Examples

>>> A = pd.DataFrame({'id' : [1, 2], 'colA':['a', 'b'], 'colB' : [10, 20]})
>>> em.save_table(A, './A.pkl') # will store two files ./A.pkl and ./A.pklmetadata
>>> A = pd.DataFrame({'id' : [1, 2], 'colA':['a', 'b'], 'colB' : [10, 20]})
>>> em.save_table(A, './A.pkl', metadata_ext='.pklmeta') # will store two files ./A.pkl and ./A.pklmeta

See also

load_table()

Note

This function is a bit different from to_csv_metadata, where the DataFrame is stored in a CSV file format. The CSV file format can be viewed using a text editor. But a DataFrame stored using ‘save_table’ is stored in a special format, which cannot be viewed with a text editor. The reason we have save_table is, for larger DataFrames it is efficient to pickle the DataFrame to disk than writing the DataFrame in CSV format.

py_entitymatching.load_object(file_path)

Loads a Python object from disk.

This function loads py_entitymatching objects from disk such as blockers, matchers, feature table, etc.

Parameters:

file_path (string) – The file path to load the object from.

Returns:

A Python object read from the file path.

Raises:
  • AssertionError – If file_path is not of type string.
  • AssertionError – If a file does not exist at the given file_path.

Examples

>>> rb = em.load_object('./rule_blocker.pkl')

See also

save_object()

py_entitymatching.save_object(object_to_save, file_path)

Saves a Python object to disk.

This function is intended to be used to save py_entitymatching objects such as rule-based blocker, feature vectors, etc. A user would like to store py_entitymatching objects to disk, when he/she wants to save the workflow and resume it later. This function provides a way to save the required objects to disk.

This function takes in the object to save the file path. It pickles the object and stores it in the file path specified.

Parameters:
  • object_to_save (Python object) – The Python object to save. This can be a rule-based blocker, feature vectors, etc.
  • file_path (string) – The file path where the object must be saved.
Returns:

A Boolean value of True is returned, if the saving was successful.

Raises:
  • AssertionError – If file_path is not of type string.
  • AssertionError – If a file cannot be written in the given file_path.

Examples

>>> import pandas as pd
>>> A = pd.DataFrame({'id' : [1, 2], 'colA':['a', 'b'], 'colB' : [10, 20]})
>>> B = pd.DataFrame({'id' : [1, 2], 'colA':['c', 'd'], 'colB' : [30, 40]})
>>> rb = em.RuleBasebBlocker()
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule1 = ['colA_colA_lev_dist(ltuple, rtuple) > 3']
>>> rb.add_rule(rule1)
>>> em.save_object(rb, './rule_blocker.pkl')

See also

load_object()

Scroll To Top