Profiling Data¶
Profiling data is used to help users get general information about their data. Before working with the data, it is useful for a user to have a high level understanding of the data because he or she will be able to take advantage of the the general trends to successfully and efficiently complete the rest of the workflow.
Data profiling specifically can show users important statistics such as type, uniqueness, missing values, quartile statistics, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness. It can also display information to the user visually such as in a histogram.
We recommend using the python package pandas-profiling because it is simple and easy to use. More information about the package can be found on the github page at https://github.com/JosPolfliet/pandas-profiling
Example Usage¶
After reading in a CSV file into a Dataframe, pandas-profiling shows the user a report containing useful profiling information. For example:
>>> import pandas_profiling
>>> # Read in csv file
>>> A = em.read_csv_metadata('path_to_csv_dir/table.csv', key='ID')
>>> # Use the profiler
>>> pandas_profiling.ProfileReport(A)
The user can also check to see if any variables are highly correlated:
>>> # Read in csv file
>>> import pandas_profiling
>>> A = em.read_csv_metadata('path_to_csv_dir/table.csv', key='ID')
>>> #Use the profiler
>>> profile = pandas_profiling.ProfileReport(A)
>>> # Check for rejected variables
>>> rejected_variables = profile.get_rejected_variables(threshold=0.9)
The report generated can also be saved into an html file:
>>> import pandas_profiling
>>> A = em.read_csv_metadata('path_to_csv_dir/table.csv', key='ID')
>>> # Save report to a variable
>>> profile = pandas_profiling.ProfileReport(A)
>>> # Save report to an html file
>>> profile.to_file(outputfile="/tmp/myoutputfile.html")
For more information about pandas-profiling please go to the github page at https://github.com/JosPolfliet/pandas-profiling