Reading and Writing Data

py_entitymatching.read_csv_metadata(file_path, **kwargs)

Reads a CSV (comma-separated values) file into a pandas DataFrame and update the catalog with the metadata. The CSV files typically contain data for the input tables or a candidate set.

Specifically, this function first reads the CSV file from the given file path into a pandas DataFrame, by using pandas’ in-built ‘read_csv’ method. Then, it updates the catalog with the metadata. There are three ways to update the metadata: (1) using a metadata file, (2) using the key-value parameters supplied in the function, and (3) using both metadata file and key-value parameters.

To update the metadata in the catalog using the metadata file, the function will look for a file in the same directory with same file name but with a specific extension. This extension can be optionally given by the user (defaults to ‘.metadata’). If the metadata file is present, the function will read and update the catalog appropriately. If the metadata file is not present, the function will issue a warning that the metadata file is not present.

The metadata information can also be given as parameters to the function (see description of arguments for more details). If given, the function will update the catalog with the given information.

Further, the metadata can partly reside in the metdata file and partly as supplied parameters. The function will take a union of the two and update the catalog appropriately. If the same metadata is given in both the metadata file and the function, then the metadata in the function takes precedence over the metadata given in the file.

Parameters:
  • file_path (string) – The CSV file path
  • kwargs (dictionary) – A Python dictionary containing key-value arguments. There are a few key-value pairs that are specific to read_csv_metadata and all the other key-value pairs are passed to pandas read_csv method
Returns:

A pandas DataFrame read from the input CSV file.

Raises:
  • AssertionError – If file_path is not of type string.
  • AssertionError – If a file does not exist in the given file_path.

Examples

Example 1: Read from CSV file and set metadata

>>> A = em.read_csv_metadata('path_to_csv_file', key='id')
>>> em.get_key(A)
 # 'id'

Example 2: Read from CSV file (with metadata file in the same directory

Let the metadata file contain the following contents:

#key = id
>>> A = em.read_csv_metadata('path_to_csv_file')
>>> em.get_key(A)
 # 'id'
py_entitymatching.to_csv_metadata(data_frame, file_path, **kwargs)

Writes the DataFrame contents to a CSV file and the DataFrame’s metadata (to a separate text file).

This function writes the DataFrame contents to a CSV file in the given file path. It uses ‘to_csv’ method from pandas to write the CSV file. The metadata contents are written to the same directory derived from the file path but with the different extension. This extension can be optionally given by the user (with the default value set to .metadata).

Parameters:
  • data_frame (DataFrame) – The DataFrame that should be written to disk.
  • file_path (string) – The file path to which the DataFrame contents should be written. Metadata is written with the same file name with the extension given by the user (defaults to ‘.metadata’).
  • kwargs (dictionary) – A Python dictionary containing key-value pairs. There is one key-value pair that is specific to to_csv_metadata: metadata_extn. All the other key-value pairs are passed to pandas to_csv function. Here the metadata_extn is the metadata extension (defaults to ‘.metadata’), with which the metadata file must be written.
Returns:

A Boolean value of True is returned if the files were written successfully.

Raises:
  • AssertionError – If data_frame is not of type pandas DataFrame.
  • AssertionError – If file_path is not of type string.
  • AssertionError – If DataFrame cannot be written to the given file_path.

Examples

>>> import pandas as pd
>>> A = pd.DataFrame({'id' : [1, 2], 'colA':['a', 'b'], 'colB' : [10, 20]})
>>> em.set_key(A, 'id')
>>> em.to_csv_metadata(A, 'path_to_csv_file')
Scroll To Top