Creating Features for Blocking

Recall that when doing blocking, you can use built-in blockers, blackbox blockers, or rule-based blockers. For rule-based blockers, you have to create a set of features. While creating features, you will have to refer to tokenizers, similarity functions, and attributes of the tables. Currently, in py_entitymatching, there are two ways to create features:

  • Automatically generate a set of features (then you can remove or add some more).
  • Skip the automatic process and generate features manually.

Note that features will also be used in the matching process, as we will discuss later.

If you are interested in just letting the system to automatically generate a set of features, then please see Generating Features Automatically.

If you want to generate features on your own, please read below.

Available Tokenizers and Similarity Functions

A tokenizer is a function that takes a string and optionally a number of other arguments, then tokenizes the string and returns a set of tokens. Currently, the following tokenizers are provided along with py_entitytmatching:

  • Alphabetic
  • Alphanumeric
  • White space
  • Delimiter based
  • Qgram based

A similarity function takes two arguments (can be strings, numeric values, etc.), which are typically two attribute values such as two book titles, then returns an output value which is typically a similarity score between the two attribute values. Currently, the following similarity functions are provided along with py_entitytmatching:

  • Affine
  • Hamming distance
  • Jaro
  • Jaro-Winkler
  • Levenshtein
  • Monge-Elkan
  • Needleman-Wunsch
  • Smith-Waterman
  • Jaccard
  • Cosine
  • Dice
  • Overlap coefficient
  • Exact match
  • Absolute norm

Obtaining Tokenizers and Similarity Functions

First you need to get tokenizers and similarity functions to refer them in features. In py_entitymatching, you can use get_tokenizers_for_blocking to get all the tokenizers available for blocking purposes.

>>> block_t = em.get_tokenizers_for_blocking()

In the above, block_t is a dictionary where keys are tokenizer names and values are tokenizer functions in Python. You can inspect block_t and delete/add tokenizers as appropriate. The above command will return single-argument tokenizers, i.e., those that take a string then produce a set of tokens.

Each of the keys of the default dictionary returned to ‘block_t’ by ‘get_tokenizers_for_blocking’ represent a tokenizer that can be used by similarity functions. The keys and the respective tokenizer they represent are shown below:

  • alphabetic: Alphabetic tokenizer
  • alphanumeric: Alphanumeric tokenizer
  • dlm_dc0: Delimiter tokenizer using spaces as the delimiter
  • qgm_2: Two Gram tokenizer
  • qgm_3: Three Gram tokenizer
  • wspace: Whitespace tokenizer

Please look at the API reference of get_tokenizers_for_blocking() for more details.

Similarly, the user can use get_sim_funs_for_blocking to get all the similarity functions available for blocking purposes.

>>> block_s = em.get_sim_funs_for_blocking()

In the above, block_s is a dictionary where keys are similarity function names and values are similarity functions in Python. Similar to block_t, you can inspect block_s and delete/add similarity functions as appropriate.

Each of the keys of the default dictionary returned to ‘block_s’ by ‘get_sim_funs_for_blocking’ represent a similarity function. The keys and the respective similarity function they represent are shown below:

  • abs_norm: Absolute Norm
  • affine: Affine Transformation
  • cosine: Cosine Similarity
  • dice: Dice similarity Coefficient
  • exact_match: Exact Match
  • hamming_dist: Hamming Distance
  • hamming_sim: Hamming Similarity
  • jaccard: Jaccard Similarity
  • jaro: Jaro Distance
  • jaro_winkler: Jaro-Winkler Distance
  • lev_dist: Levenshtein Distance
  • lev_sim: Levenshtein Similarity
  • monge_elkan: Monge-Elkan Algorithm
  • needleman_wunsch: Needleman-Wunsch Algorithm
  • overlap_coeff: Overlap Coefficient
  • rel_diff: Relative Difference
  • smith_waterman: Smith-Waterman Algorithm

Please look at the API reference of get_sim_funs_for_blocking() for more details.

Obtaining Attribute Types and Correspondences

In the next step, you need to obtain type and correspondence information about A and B so that the features can be generated.

First, you need to obtain the types of attributes in A and B, so that the right tokenizers/similarity functions can be applied to each of them. In py_entitymatching, you can use get_attr_types to get the attribute types. An example of using get_attr_types is shown below:

>>> atypes1 = em.get_attr_types(A)
>>> atypes2 = em.get_attr_types(B)

In the above, atypes1 and atypes2 are dictionaries. They contain, the type of attribute in each of the tables. Note that this type is different from basic Python types. Please look at the API reference of get_attr_types() for more details.

Next, we need to obtain correspondences between the attributes of A and B, so that the features can be generated based on those correspondences. In py_entitymatching, you can use get_attr_corres to get the attribute correspondences.

An example of using get_attr_corres is shown below:

>>> block_c = em.get_attr_corres(A, B)

In the above, block_c is a dictionary containing attribute correspondences. Currently, py_entitymatching returns attribute correspondences only based on the exact match of attribute names. You can inspect block_c and modify the attribute correspondences. Please look at the API reference of get_attr_corres() for more details.

Getting a Set of Features

Recall that so far we have obtained:

  • block_t, the set of tokenizers,
  • block_s, the set of sim functions
  • atypes1 and atypes2, the types of attributes in A and B
  • block_c, the correspondences of attributes in A and B

Next, to obtain a set of features, you can use get_features command. An example of using get_features command is shown below:

>>> block_f = em.get_features(A, B, atypes1, atypes2, block_c, block_t, block_s)

Briefly, this function will go through the correspondences. For each correspondence m, it examines the types of the involved attributes, then apply the appropriate tokenizers and similarity functions to generate all appropriate features for this correspondence. The features are returned as a Dataframe. Please look at the API reference of get_features() for more details.

Adding/Removing Features

Given the set of features block_f as a pandas Dataframe, you can delete certain features, add new features.

Deletion of a feature is straightforward, all that you have to do is delete the row from the feature table corresponding to the feature. You can use drop command from pandas Dataframe for this purpose. Please look at this API reference link for more details.

There are two ways to create and add a feature: (1) write a blackbox function and add it to feature table, and (2) define a feature declartively and add it to feature table.

Adding a Blackbox Function as Feature

To create and add a blackbox function as a feature, first you must define it. Specifically, the function must take in two tuples as input and return a numeric value. An example of a blackbox function is shown below:

def age_diff(ltuple, rtuple):
    # assume that the tuples have age attribute and values are valid numbers.
    return ltuple['age'] - rtuple['age']

Then add it to the feature table block_f using add_blackbox_feature like this:

>>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)

Please look at the API reference of add_blackbox_feature() for more details.

Adding a Feature Declaratively

Another way to add features is to write a feature expression in a declarative way. py_entitymatching will then compile it into a feature. For example, you can declaratively create and add a feature like this:

>>> r = em.get_feature_fn('jaccard(qgm_3(ltuple["name"]), qgm_3(rtuple["name"]))', block_t, block_s)
>>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r)

Here block_t and block_s refer to the dictionaries containing a set of tokenizers and similarity functions for blocking. Additionally, ‘jaccard’ refers to the key in ‘block_s’ that represents the Jaccard Similarity function and ‘qgm_3’ refers to the key in ‘block_t’ that represents a three gram tokenizer. The keys in ‘block_t’ and ‘block_s’ and which function or tokenizer they represent are explained above in the Obtaining Tokenizers and Similarity Functions section.

Conceptually, the first command, get_feature_fn, creates a feature which is a Python function that will take two tuples ltuple and rtuple, get the attribute publisher from ltuple, issuer from rtuple, tokenize them, then compute jaccard score.

Note

The feature must refer the tuple from the left table (say A) as ltuple and the tuple from the right table (say B) as rtuple.

The second command, add_feature tags the feature with the specified name, and adds it to the feature table.

As described, the feature that was just created is independent of any table (eg A and B). Instead, it expects as the input two tuples: ltuple and rtuple.

You can also create more complex features. Specifically, you are allowed to define arbitrary complex expression involving function names from block_t and block_s, and attribute names from ltuple and rtuple.

>>> r = em.get_feature_fn('jaccard(qgm_3(ltuple.address + ltuple.zipcode), qgm_3(rtuple.address + rtuple.zipcode)',block_t,block_s)
>>> em.add_feature(block_f, 'full_address_address_jac_qgm3_qgm3', r)

You can also create your own similarity functions and tokenizers for your custom features. For example, you can create a similarity function that changes all strings to lowercase before checking if they are equivalent.

>>> # This similarity function converts the two strings to lowercase before checking if they are an exact match
>>> def match_lowercase(l_attr, r_attr):
>>>     l_attr = l_attr.lower()
>>>     r_attr = r_attr.lower()
>>>     if l_attr == r_attr:
>>>         return 1
>>>     else:
>>>         return 0

You can then add a feature declarativly with your new similarity function.

>>> # The new similarity function is added to block_s and then a new feature is created
>>> block_t = em.get_tokenizers_for_blocking()
>>> block_s = em.get_sim_funs_for_blocking()
>>> block_s['match_lowercase'] = match_lowercase
>>> r = em.get_feature_fn('match_lowercase(ltuple["name"], rtuple["name"])', block_t, block_s)
>>> em.add_feature(block_f, 'name_name_match_lowercase', r)

It is also possible to create features with your own similarity functions that require tokenizers. The next example shows how to create a custom tokenizer that returns only the first and last words of a string.

>>> # This custom tokenizer returns the first and last words of a string
>>> def first_last_tok(attr):
>>>     all_toks = attr.split(" ")
>>>     toks = [all_toks[0], all_toks[len(all_toks) - 1]]
>>>     return toks

Next, a similarity function that can utilize the new tokenizer is created. This example shows how to create a similarity function that raises the score if the first words match and raises the score by one if the second words match.

>>> # This similarity function compares two tokens from each set.
>>> # Greater weight is placed on the equality of the first token.
>>> def first_last_sim(l_attr, r_attr):
>>>     score = 0
>>>     if l_attr[0] == r_attr[0]:
>>>         score += 2
>>>     if l_attr[1] == r_attr[1]:
>>>         score +=1
>>>     return score

Finally, with the tokenizer and similarity functions defined, the new feature can be created and added.

>>> # The new tokenizer is added to block_t and the new similarity function is added to block_s
>>> # then a new feature is created
>>> block_t = em.get_tokenizers_for_blocking()
>>> block_t['first_last_tok'] = first_last_tok
>>> block_s = em.get_sim_funs_for_blocking()
>>> block_s['first_last_sim'] = first_last_sim
>>> r = em.get_feature_fn('first_last_sim(first_last_tok(ltuple["name"]), first_last_tok(rtuple["name"]))',
>>>                  block_t, block_s)
>>> em.add_feature(block_f, 'name_name_fls_flt_flt', r)

Please look at the API reference of get_feature_fn() and add_feature() for more details.

Summary of the Manual Feature Generation Process

Here is the summary of commands for the entire manual feature generation process.

To generate features, you must execute the following commands:

>>> block_t = em.get_tokenizers_for_blocking()
>>> block_s = em.get_sim_funs_for_blocking()
>>> atypes1 = em.get_attr_types(A)
>>> atypes2 = em.get_attr_types(B)
>>> block_c = em.get_attr_corres(A, B)
>>> block_f = em.get_features(A, B, atypes1, atypes2, block_c, block_t, block_s)

The variable block_f points to a Dataframe containing features as rows.

Ways to Edit the Manual Feature Generation Process

Here is the summary of ways to edit the variables used in feature generation process.

  • The block_t, block_s, atypes1, atypes2, block_c are dictionaries. You can modify these variables based on your need, to add/remove tokenizers, similarity functions, attribute correspondences, etc.

  • block_f is a Dataframe. You can remove a feature by just deleting the corresponding tuple from the Dataframe.

  • There are two ways to create and add a feature: (1) write a blackbox function and add it to feature table, and (2) define the feature declartively and add it to feature table. To add a blackbox feature, first write a blackbox function like this:

    def age_diff(ltuple, rtuple):
        # assume that the tuples have age attribute and values are valid numbers.
        return ltuple['age'] - rtuple['age']
    

    Then add it to the table block_f using add_blackbox_feature like this:

    >>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)
    

    To add a feature declaratively, first write a feature expression and compile it to feature using get_feature_fn like this:

    >>> r = em.get_feature_fn('jaccard(qgm_3(ltuple.address + ltuple.zipcode), qgm_3(rtuple.address + rtuple.zipcode)',block_t,block_s)
    

    Then add it to the table block_f using add_feature like this:

    >>> em.add_feature(block_f, 'full_address_address_jac_qgm3_qgm3', r)
    

Generating Features Automatically

Recall that to get the features for blocking, eventually you must execute the following:

>>> block_f = em.get_features(A, B, atypes1, atypes2, block_c, block_t, block_s)

where atypes1/atypes2 are the attribute types of A and B, block_c is the correspondences between their attributes, block_t is the set of tokenizers, and block_s is the set of similarity functions.

If you don’t want to go through the hassle of creating these intermediate variables, then you can execute the following:

>>> block_f = em.get_features_for_blocking(A,B)

The system will automatically generate a set of features and return it as as a Dataframe which you can then use for blocking purposes. This Dataframe contains a few attributes that require further explanation, specifically ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, and ‘simfunction’. There are two types of similarity functions, those that use tokenizers and those that do not. Some similarity functions use tokenizers and all such features must designate a tokenizer for both the left table attribute in ‘left_attr_tokenizer’ and for the right table attribute in ‘right_attr_tokenizer’. The ‘simfunction’ attribute refers to the name of the function and comes from the keys in ‘block_s’. The various keys and the actual functions they correspond to are explained in the Obtaining Tokenizers and Similarity Functions section above.

The command get_features_for_blocking will set the following variables: _block_t, _block_s, _atypes1, _atypes2, and _block_c. You can access these variables like this:

>>> em._block_t
>>> em._block_s
>>> em._atypes1
>>> em._atypes2
>>> em._block_c

You can examine these variables, modify them as appropriate, and then perhaps re-generate the set of features using get_features command.

Please look at the API reference of get_features_for_blocking() for more details.

Scroll To Top