Debugging Blocker Output¶
-
py_entitymatching.
debug_blocker
(candidate_set, ltable, rtable, output_size=200, attr_corres=None, verbose=True, n_jobs=1, n_configs=1)[source]¶ This function debugs the blocker output and reports a list of potential matches that are discarded by a blocker (or a blocker sequence). Specifically, this function takes in the two input tables for matching and the candidate set returned by a blocker (or a blocker sequence), and produces a list of tuple pairs which are rejected by the blocker but with high potential of being true matches.
- Parameters
candidate_set (DataFrame) – The candidate set generated by applying the blocker on the ltable and rtable.
ltable (DataFrame) – The input DataFrames that are used to generate the blocker output.
rtable (DataFrame) – The input DataFrames that are used to generate the blocker output.
output_size (int) – The number of tuple pairs that will be returned (defaults to 200).
attr_corres (list) – A list of attribute correspondence tuples. When ltable and rtable have different schemas, or the same schema but different words describing the attributes, the user needs to manually specify the attribute correspondence. Each element in this list should be a tuple of strings which are the corresponding attributes in ltable and rtable. The default value is None, and if the user doesn’t specify this list, a built-in function for finding the attribute correspondence list will be called. But we highly recommend the users manually specify the attribute correspondences, unless the schemas of ltable and rtable are identical (defaults to None).
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
n_configs (int) – The maximum number of configs to be used for calculating the topk list(defaults to 1). If -1, the config number is set as the number of cpu. If -2, all configs are used. if n_configs is less than the maximum number of generated configs, then n_configs will be used. Otherwise, all the generated configs will be used.
- Returns
A pandas DataFrame with ‘output_size’ number of rows. Each row in the DataFrame is a tuple pair which has potential of being a true match, but is rejected by the blocker (meaning that the tuple pair is in the Cartesian product of ltable and rtable subtracted by the candidate set). The fields in the returned DataFrame are from ltable and rtable, which are useful for determining similar tuple pairs.
- Raises
AssertionError – If ltable, rtable or candset is not of type pandas DataFrame.
AssertionError – If ltable or rtable is empty (size of 0).
AssertionError – If the output size parameter is less than or equal to 0.
AssertionError – If the attribute correspondence (attr_corres) list is not in the correct format (a list of tuples).
AssertionError – If the attribute correspondence (attr_corres) cannot be built correctly.
Examples
>>> import py_entitymatching as em >>> ob = em.OverlapBlocker() >>> C = ob.block_tables(A, B, l_overlap_attr='title', r_overlap_attr='title', overlap_size=3) >>> corres = [('ID','ssn'), ('name', 'ename'), ('address', 'location'),('zipcode', 'zipcode')] >>> D = em.debug_blocker(C, A, B, attr_corres=corres) >>> import py_entitymatching as em >>> ob = em.OverlapBlocker() >>> C = ob.block_tables(A, B, l_overlap_attr='name', r_overlap_attr='name', overlap_size=3) >>> D = em.debug_blocker(C, A, B, output_size=150)
-
py_entitymatching.
backup_debug_blocker
(candset, ltable, rtable, output_size=200, attr_corres=None, verbose=False)[source]¶ This is the old version of the blocker debugger. It is not reccomended to use this version unless the new blocker debugger is not working properly.
This function debugs the blocker output and reports a list of potential matches that are discarded by a blocker (or a blocker sequence).
Specifically, this function takes in the two input tables for matching and the candidate set returned by a blocker (or a blocker sequence), and produces a list of tuple pairs which are rejected by the blocker but with high potential of being true matches.
- Parameters
candset (DataFrame) – The candidate set generated by applying the blocker on the ltable and rtable.
ltable (DataFrame) – The input DataFrames that are used to generate the blocker output.
rtable (DataFrame) – The input DataFrames that are used to generate the blocker output.
output_size (int) – The number of tuple pairs that will be returned (defaults to 200).
attr_corres (list) – A list of attribute correspondence tuples. When ltable and rtable have different schemas, or the same schema but different words describing the attributes, the user needs to manually specify the attribute correspondence. Each element in this list should be a tuple of strings which are the corresponding attributes in ltable and rtable. The default value is None, and if the user doesn’t specify this list, a built-in function for finding the attribute correspondence list will be called. But we highly recommend the users manually specify the attribute correspondences, unless the schemas of ltable and rtable are identical (defaults to None).
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- Returns
A pandas DataFrame with ‘output_size’ number of rows. Each row in the DataFrame is a tuple pair which has potential of being a true match, but is rejected by the blocker (meaning that the tuple pair is in the Cartesian product of ltable and rtable subtracted by the candidate set). The fields in the returned DataFrame are from ltable and rtable, which are useful for determining similar tuple pairs.
- Raises
AssertionError – If ltable, rtable or candset is not of type pandas DataFrame.
AssertionError – If ltable or rtable is empty (size of 0).
AssertionError – If the output size parameter is less than or equal to 0.
AssertionError – If the attribute correspondence (attr_corres) list is not in the correct format (a list of tuples).
AssertionError – If the attribute correspondence (attr_corres) cannot be built correctly.
Examples
>>> import py_entitymatching as em >>> ob = em.OverlapBlocker() >>> C = ob.block_tables(A, B, l_overlap_attr='title', r_overlap_attr='title', overlap_size=3) >>> corres = [('ID','ssn'), ('name', 'ename'), ('address', 'location'),('zipcode', 'zipcode')] >>> D = em.backup_debug_blocker(C, A, B, attr_corres=corres)
>>> import py_entitymatching as em >>> ob = em.OverlapBlocker() >>> C = ob.block_tables(A, B, l_overlap_attr='name', r_overlap_attr='name', overlap_size=3) >>> D = em.backup_debug_blocker(C, A, B, output_size=150)