Miscellaneous

This section covers some miscellaneous things in py_entitymatching.

CSV Format

The CSV format is selected because it’s well known and can be read by numerous external programs. Further, it can be easily inspected and edited by the users. You can read more about CSV formats here.

There are two common CSV formats that are used to store CSV files: one with attribute names in the first line, and one without. Both these formats are supported by py_entitymatching.

An example of a CSV file with attribute names is shown below:

ID, name, birth_year, hourly_wage, zipcode
a1, Kevin Smith, 1989, 30, 94107
a2, Michael Franklin, 1988, 27.5, 94122
a3, William Bridge, 1988, 32, 94321

An example of a CSV file with out attribute names is shown below:

a1, Kevin Smith, 1989, 30, 94107
a2, Michael Franklin, 1988, 27.5, 94122
a3, William Bridge, 1988, 32, 94321

Metadata File Format

The CSV file can be accompanied with a metadata file containing the metadata information of the table. Typically, it contains information such as key, foreign key, etc. The metadata file is expected to be of the same name as the CSV file but with .metadata extension. For example, if the CSV file table_A.csv contains table A’s data, then table_A.metadata will contain table A’s metadata. So, the metadata is associated based on the names of the files. The metadata file contains key-value pairs one per line and each line starts with ‘#’.

An example of metadata file is shown below:

#key=ID

In the above, the pair key=ID states that ID is the key attribute.

Writing a Dataframe to Disk Along With Its Metadata

To write a Dataframe to disk along with its metadata, you can use to_csv_metadata command in py_entitymatching. An example of using to_csv_metadata is shown below:

>>> em.to_csv_metadata(A, './table_A.csv')

The above command will first write Dataframe pointed by A to table_A.csv file in the disk (in CSV format), next it will write the metadata of table A stored in the Catalog to table_A.metadata file in the disk.

Please refer to the API reference of to_csv_metadata() for more details.

Note

Once the Dataframe is written to disk along with metadata, it can read using read_csv_metadata() command.

Writing/Reading Other Types of py_entitymatching Objects

After creating a blocker or feature table, it is desirable to have a way to persist the objects to disk for future use. py_entitymatching provides two commands for that purpose: save_object and load_object.

An example of using save_object is shown below:

>>> block_f = em.get_features_for_blocking(A, B)
>>> rb = em.RuleBasedBlocker()
>>> rb.add_rule([name_name_lev(ltuple, rtuple) < 0.4], block_f)
>>> em.save_object(rb, './rule_based_blocker.pkl')

load_object loads the stored object from disk. An example of using load_object is shown below:

>>> rb = em.load_object('./rule_based_blocker.pkl')

Please refer to the API reference of save_object() and save_object() for more details.