Installation

Requirements

  • Python 2.7, 3.5, 3.6, 3.7, or 3.8

Platforms

py_entitymatching has been tested on Linux (Ubuntu Xenial 16.04.6), macOS (High Sierra 10.13.6), and Windows 10.

Dependencies

  • pandas (provides data structures to store and manage tables)

  • scikit-learn (provides implementations for common machine learning algorithms)

  • joblib (provides multiprocessing capabilities)

  • pyqt5 (provides tools to build GUIs)

  • py_stringsimjoin (provides implementations for string similarity joins)

  • py_stringmatching (provides a set of string tokenizers and string similarity functions)

  • cloudpickle (provides functions to serialize Python constructs)

  • pyprind (library to display progress indicators)

  • pyparsing (library to parse strings)

  • six (provides functions to write compatible code across Python 2 and 3)

  • xgboost (provides an implementation for xgboost classifier)

  • pandas-profiling (provides implementation for profiling pandas dataframe)

  • pandas-table (provides data exploration tool for pandas dataframe)

  • openrefine (provides data exploration tool for tables)

  • ipython (provides better tools for displaying tables in notebooks)

  • scipy (dependency for skikit-learn)

See the project README <https://github.com/anhaidgroup/py_entitymatching/blob/master/README.rst> for more information.

C Compiler Required

Installing Using conda Before installing this package, you need to make sure that you have a C compiler installed. This is necessary because this package contains Cython files. Go here for more information about how to check whether you already have a C compiler and how to install a C compiler. After you have confirmed that you have a C compiler installed, you are ready to install py_entitymatching.

Installing Using pip

To install the package using pip, execute the following command:

pip install -U numpy scipy py_entitymatching

The above command will install py_entitymatching and all of its dependencies except XGBoost, pandastable, openrefine, and PyQt5. This is because pip can only install the dependency packages that are available in PyPI and PyQt5, XGBoost, pandastable are not in PyPI for Python 2.

  • To install PyQt5, follow the instructions at this page.

  • To install XGBoost, follow the instructions at this page.

  • To install pandastable follow the instructions at this page.

  • To install openrefine follow the instructions at this page.

Installing from Source Distribution

Clone the py_entitymatching package from GitHub

Then, execute the following commands from the package root:

pip install -U numpy scipy
python setup.py install

which installs py_stringmatching into the default Python directory on your machine. If you do not have installation permission for that directory then you can install the package in your home directory as follows:

python setup.py install --user

For more information see this StackOverflow link.

The above commands will install py_entitymatching and all of its dependencies, except PyQt5 and XGBoost.

This is because, similar to pip, setup.py can only install the dependency packages that are available in PyPI and PyQt5, pandastable, XGBoost are not in PyPI for Python 2.

  • To install PyQt5, follow the instructions at this page.

  • To install XGBoost, follow the instructions at this page.

  • To install pandastable follow the instructions at this page.

  • To install openrefine follow the instructions at this page.

Note

Currently, py_entitymatching supports a set of experimental commands that help users create an EM workflow. Some of these commands will require installing Dask. To install dask refer to this `page <http://dask.pydata.org/en/latest/install.html`_.