.. _label-create-features-blocking:

==============================
Creating Features for Blocking
==============================
Recall that when doing blocking, you can use built-in blockers,
blackbox blockers, or rule-based blockers. For rule-based blockers,
you have to create a set of features. While creating features, you will have to
refer to tokenizers, similarity functions, and attributes of the tables.
Currently, in py_entitymatching, there are two ways to create features:

* Automatically generate a set of features (then you can remove or add some more).
* Skip the automatic process and generate features manually.


Note that features will also be used in the matching process, as we
will discuss later.

.. The set of features for blocking and the set of features for matching can be quite different however. For example, for blocking we may only want to have features that are inexpensive to compute.

If you are interested in just letting the system to automatically
generate a set of features, then please see :ref:`label-gen-feats-automatically`.

If you want to generate features on your own, please read below.

Available Tokenizers and Similarity Functions
---------------------------------------------
A tokenizer is a function that takes a string and optionally a number
of other arguments, then tokenizes the string and returns a set of tokens.
Currently, the following tokenizers are provided along with *py_entitytmatching*:

* Alphabetic
* Alphanumeric
* White space
* Delimiter based
* Qgram based


A similarity function takes two arguments (can be strings, numeric values, etc.),
which are typically two attribute values such
as two book titles, then returns an output value which is typically a similarity score
between the two attribute values. Currently, the following similarity functions
are provided along with *py_entitytmatching*:

* Affine
* Hamming distance
* Jaro
* Jaro-Winkler
* Levenshtein
* Monge-Elkan
* Needleman-Wunsch
* Smith-Waterman
* Jaccard
* Cosine
* Dice
* Overlap coefficient
* Exact match
* Absolute norm


Obtaining Tokenizers and Similarity Functions
---------------------------------------------
First you need to get tokenizers and similarity functions to refer them in features.
In py_entitymatching, you can use
`get_tokenizers_for_blocking` to get all the tokenizers available for blocking purposes.

    >>> block_t = em.get_tokenizers_for_blocking()

In the above, `block_t` is a dictionary where keys are tokenizer names
and values are tokenizer functions in Python. You can inspect `block_t` and delete/add
tokenizers as appropriate. The above command will return single-argument tokenizers,
i.e., those that take a string then produce a set of tokens.

Each of the keys of the default dictionary returned to 'block_t' by
'get_tokenizers_for_blocking' represent a tokenizer that can be used by similarity
functions. The keys and the respective tokenizer they represent are shown below:

* alphabetic: Alphabetic tokenizer
* alphanumeric: Alphanumeric tokenizer
* dlm_dc0: Delimiter tokenizer using spaces as the delimiter
* qgm_2: Two Gram tokenizer
* qgm_3: Three Gram tokenizer
* wspace: Whitespace tokenizer


Please look at the API reference of :py:meth:`~py_entitymatching.get_tokenizers_for_blocking`
for more details.

Similarly, the user can use `get_sim_funs_for_blocking` to get all the similarity
functions available for blocking purposes.

    >>> block_s = em.get_sim_funs_for_blocking()

In the above, `block_s` is a dictionary where keys are similarity function names
and values are similarity functions in Python. Similar to `block_t`, you can
inspect `block_s` and delete/add similarity functions as appropriate.

Each of the keys of the default dictionary returned to 'block_s' by
'get_sim_funs_for_blocking' represent a similarity function. The keys and the
respective similarity function they represent are shown below:

* abs_norm: Absolute Norm
* affine: Affine Transformation
* cosine: Cosine Similarity
* dice: Dice similarity Coefficient
* exact_match: Exact Match
* hamming_dist: Hamming Distance
* hamming_sim: Hamming Similarity
* jaccard: Jaccard Similarity
* jaro: Jaro Distance
* jaro_winkler: Jaro-Winkler Distance
* lev_dist: Levenshtein Distance
* lev_sim: Levenshtein Similarity
* monge_elkan: Monge-Elkan Algorithm
* needleman_wunsch: Needleman-Wunsch Algorithm
* overlap_coeff: Overlap Coefficient
* rel_diff: Relative Difference
* smith_waterman: Smith-Waterman Algorithm


Please look at the API reference of :py:meth:`~py_entitymatching.get_sim_funs_for_blocking`
for more details.


Obtaining Attribute Types and Correspondences
---------------------------------------------
In the next step, you need to obtain type and correspondence information about A and B
so that the features can be generated.

First, you need to obtain the types of attributes in A and B,
so that the right tokenizers/similarity functions can be applied to each of them.
In py_entitymatching, you can use `get_attr_types` to get the attribute types.
An example of using `get_attr_types` is shown below:

    >>> atypes1 = em.get_attr_types(A)
    >>> atypes2 = em.get_attr_types(B)

In the above, `atypes1` and `atypes2` are dictionaries. They contain, the type of
attribute in each of the tables. Note that this `type` is different from basic
Python types. Please look at the API reference of
:py:meth:`~py_entitymatching.get_attr_types` for more details.

Next, we need to obtain correspondences between the attributes of A and B,
so that the features can be generated based on those correspondences.
In py_entitymatching, you can use `get_attr_corres` to get the attribute
correspondences.

An example of using `get_attr_corres` is shown below:

    >>> block_c = em.get_attr_corres(A, B)

In the above, `block_c` is a dictionary containing attribute correspondences.
Currently, py_entitymatching returns attribute correspondences only based on the exact
match of attribute names. You can inspect `block_c` and modify the attribute
correspondences. Please look at the API reference of
:py:meth:`~py_entitymatching.get_attr_corres` for more details.

.. _label-get-a-set-of-features-manual:

Getting a Set of Features
-------------------------
Recall that so far we have obtained:

+ block_t, the set of tokenizers,
+ block_s, the set of sim functions
+ atypes1 and atypes2, the types of attributes in A and B
+ block_c, the correspondences of attributes in A and B

Next, to obtain a set of features, you can use `get_features` command.
An example of using `get_features` command is shown below:

    >>> block_f = em.get_features(A, B, atypes1, atypes2, block_c, block_t, block_s)

Briefly, this function will go through the correspondences. For each
correspondence `m`, it examines the types of the involved attributes,
then apply the appropriate tokenizers and similarity functions to generate
all appropriate features for this correspondence. The features are returned as
a Dataframe. Please look at the API reference of
:py:meth:`~py_entitymatching.get_features` for more details.


.. _label-add-remove-features:

Adding/Removing Features
------------------------
Given the set of features `block_f` as a pandas Dataframe, you can delete certain features,
add new features.

Deletion of a feature is straightforward, all that you have to do is delete the row
from the feature table corresponding to the feature. You can use `drop` command
from pandas Dataframe for this purpose. Please look at this
`API reference link <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html>`_
for more details.


There are two ways to create and add a feature: (1) write a blackbox function and
add it to feature table, and (2) define a feature declartively and add it to
feature table.


**Adding a Blackbox Function as Feature**

To create and add a blackbox function as a feature, first you must define it. Specifically,
the function must take in two tuples as input and return a numeric value. An example of
a blackbox function is shown below:

::

    def age_diff(ltuple, rtuple):
        # assume that the tuples have age attribute and values are valid numbers.
        return ltuple['age'] - rtuple['age']

Then add it to the feature table `block_f` using `add_blackbox_feature` like this:

    >>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)

Please look at the API reference of
:py:meth:`~py_entitymatching.add_blackbox_feature` for more details.

**Adding a Feature Declaratively**

Another way to add features is to write a feature expression in
a `declarative` way. py_entitymatching will then compile it into a feature. For
example, you can declaratively create and add a feature like this:

    >>> r = em.get_feature_fn('jaccard(qgm_3(ltuple["name"]), qgm_3(rtuple["name"]))', block_t, block_s)
    >>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r)

Here `block_t` and `block_s` refer to the dictionaries containing a set of
tokenizers and similarity functions for blocking. Additionally, 'jaccard' refers
to the key in 'block_s' that represents the Jaccard Similarity function and
'qgm_3' refers to the key in 'block_t' that represents a three gram tokenizer.
The keys in 'block_t' and 'block_s' and which function or tokenizer they
represent are explained above in the Obtaining Tokenizers and Similarity Functions
section.

Conceptually, the first command, `get_feature_fn`, creates a feature which is a Python function
that will take two tuples `ltuple` and `rtuple`, get the attribute publisher from `ltuple`,
issuer from `rtuple`, tokenize them, then compute jaccard score.

.. note:: The feature must refer the tuple from the left table (say A) as **ltuple** and
 the tuple from the right table (say B) as **rtuple**.

The second command, `add_feature` tags the feature with the specified name,
and adds it to the feature table.

As described, the feature that was just created is *independent* of any table
(eg A and B). Instead, it expects as the input two tuples: ltuple and rtuple.


You can also create more complex features. Specifically,
you are allowed to define arbitrary complex expression involving function names from
`block_t` and `block_s`, and attribute names from ltuple and rtuple.

    >>> r = em.get_feature_fn('jaccard(qgm_3(ltuple.address + ltuple.zipcode), qgm_3(rtuple.address + rtuple.zipcode)',block_t,block_s)
    >>> em.add_feature(block_f, 'full_address_address_jac_qgm3_qgm3', r)


You can also create your own similarity functions and tokenizers for your custom features.
For example, you can create a similarity function that changes all strings to lowercase
before checking if they are equivalent.

    >>> # This similarity function converts the two strings to lowercase before checking if they are an exact match
    >>> def match_lowercase(l_attr, r_attr):
    >>>     l_attr = l_attr.lower()
    >>>     r_attr = r_attr.lower()
    >>>     if l_attr == r_attr:
    >>>         return 1
    >>>     else:
    >>>         return 0

You can then add a feature declarativly with your new similarity function.

    >>> # The new similarity function is added to block_s and then a new feature is created
    >>> block_t = em.get_tokenizers_for_blocking()
    >>> block_s = em.get_sim_funs_for_blocking()
    >>> block_s['match_lowercase'] = match_lowercase
    >>> r = em.get_feature_fn('match_lowercase(ltuple["name"], rtuple["name"])', block_t, block_s)
    >>> em.add_feature(block_f, 'name_name_match_lowercase', r)

It is also possible to create features with your own similarity functions that require
tokenizers. The next example shows how to create a custom tokenizer that returns only
the first and last words of a string.

    >>> # This custom tokenizer returns the first and last words of a string
    >>> def first_last_tok(attr):
    >>>     all_toks = attr.split(" ")
    >>>     toks = [all_toks[0], all_toks[len(all_toks) - 1]]
    >>>     return toks

Next, a similarity function that can utilize the new tokenizer is created. This example
shows how to create a similarity function that raises the score if the first words match
and raises the score by one if the second words match.

    >>> # This similarity function compares two tokens from each set.
    >>> # Greater weight is placed on the equality of the first token.
    >>> def first_last_sim(l_attr, r_attr):
    >>>     score = 0
    >>>     if l_attr[0] == r_attr[0]:
    >>>         score += 2
    >>>     if l_attr[1] == r_attr[1]:
    >>>         score +=1
    >>>     return score

Finally, with the tokenizer and similarity functions defined, the new feature can be
created and added.

    >>> # The new tokenizer is added to block_t and the new similarity function is added to block_s
    >>> # then a new feature is created
    >>> block_t = em.get_tokenizers_for_blocking()
    >>> block_t['first_last_tok'] = first_last_tok
    >>> block_s = em.get_sim_funs_for_blocking()
    >>> block_s['first_last_sim'] = first_last_sim
    >>> r = em.get_feature_fn('first_last_sim(first_last_tok(ltuple["name"]), first_last_tok(rtuple["name"]))',
    >>>                  block_t, block_s)
    >>> em.add_feature(block_f, 'name_name_fls_flt_flt', r)

Please look at the API reference of
:py:meth:`~py_entitymatching.get_feature_fn` and :py:meth:`~py_entitymatching.add_feature`
for more details.

Summary of the Manual Feature Generation Process
------------------------------------------------
Here is the summary of commands for the entire manual feature generation process.

To generate features, you must execute the following commands:

    >>> block_t = em.get_tokenizers_for_blocking()
    >>> block_s = em.get_sim_funs_for_blocking()
    >>> atypes1 = em.get_attr_types(A)
    >>> atypes2 = em.get_attr_types(B)
    >>> block_c = em.get_attr_corres(A, B)
    >>> block_f = em.get_features(A, B, atypes1, atypes2, block_c, block_t, block_s)

The variable `block_f` points to a Dataframe containing features as rows.


Ways to Edit the Manual Feature Generation Process
--------------------------------------------------
Here is the summary of ways to edit the variables used in feature generation process.

* The `block_t`, `block_s`, `atypes1`, `atypes2`, `block_c` are dictionaries. You
  can modify these variables based on your need, to add/remove tokenizers,
  similarity functions, attribute correspondences, etc.

* `block_f` is a Dataframe. You can remove a feature by just deleting the corresponding
  tuple from the Dataframe.

* There are two ways to create and add a feature: (1) write a blackbox function and
  add it to feature table, and (2) define the feature declartively and add it to
  feature table.
  To add a blackbox feature, first write a blackbox function like this:
  ::

    def age_diff(ltuple, rtuple):
        # assume that the tuples have age attribute and values are valid numbers.
        return ltuple['age'] - rtuple['age']

  Then add it to the table `block_f` using `add_blackbox_feature` like this:

        >>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)

  To add a feature declaratively, first write a feature expression and compile it to feature
  using `get_feature_fn` like this:

        >>> r = em.get_feature_fn('jaccard(qgm_3(ltuple.address + ltuple.zipcode), qgm_3(rtuple.address + rtuple.zipcode)',block_t,block_s)

  Then add it to the table `block_f` using `add_feature` like this:

        >>> em.add_feature(block_f, 'full_address_address_jac_qgm3_qgm3', r)

.. _label-gen-feats-automatically:

Generating Features Automatically
---------------------------------
Recall that to get the features for blocking, eventually you
must execute the following:

    >>> block_f = em.get_features(A, B, atypes1, atypes2, block_c, block_t, block_s)

where `atypes1`/`atypes2` are the attribute types of A and B, `block_c` is
the correspondences between their attributes, `block_t` is the set of tokenizers,
and `block_s` is the set of similarity functions.

If you don't want to go through the hassle of creating these intermediate
variables, then you can execute the following:

    >>> block_f = em.get_features_for_blocking(A,B)

The system will automatically generate a set of features and return it as
as a Dataframe which you can then use for blocking purposes. This Dataframe
contains a few attributes that require further explanation, specifically
'left_attr_tokenizer', 'right_attr_tokenizer', and 'simfunction'. There are
two types of similarity functions, those that use tokenizers and those that
do not. Some similarity functions use tokenizers and all such features must
designate a tokenizer for both the left table attribute in
'left_attr_tokenizer' and for the right table attribute in
'right_attr_tokenizer'. The 'simfunction' attribute refers to the name of
the function and comes from the keys in 'block_s'. The various keys and the
actual functions they correspond to are explained in the Obtaining
Tokenizers and Similarity Functions section above.

The command `get_features_for_blocking` will set the following variables: `_block_t`,
`_block_s`, `_atypes1`, `_atypes2`, and `_block_c`. You can access these variables like this:

    >>> em._block_t
    >>> em._block_s
    >>> em._atypes1
    >>> em._atypes2
    >>> em._block_c

You can examine these variables, modify them as appropriate, and
then perhaps re-generate the set of features using `get_features` command.

Please look at the API reference of
:py:meth:`~py_entitymatching.get_features_for_blocking` for more details.