Should Dataframe Be Standardized When Logistic Regression With Continuous and Categorical Variables

Using numerical and categorical variables together#

In the previous notebooks, we showed the required preprocessing to apply when dealing with numerical and categorical variables. However, we decoupled the process to treat each type individually. In this notebook, we will show how to combine these preprocessing steps.

We will first load the entire adult census dataset.

                        import            pandas            as            pd            adult_census            =            pd            .            read_csv            (            "../datasets/adult-census.csv"            )            # drop the duplicated column `"education-num"` as stated in the first notebook            adult_census            =            adult_census            .            drop            (            columns            =            "education-num"            )            target_name            =            "class"            target            =            adult_census            [            target_name            ]            data            =            adult_census            .            drop            (            columns            =            [            target_name            ])

Selection based on data types#

We will separate categorical and numerical variables using their data types to identify them, as we saw previously that object corresponds to categorical columns (strings). We make use of make_column_selector helper to select the corresponding columns.

                            from              sklearn.compose              import              make_column_selector              as              selector              numerical_columns_selector              =              selector              (              dtype_exclude              =              object              )              categorical_columns_selector              =              selector              (              dtype_include              =              object              )              numerical_columns              =              numerical_columns_selector              (              data              )              categorical_columns              =              categorical_columns_selector              (              data              )

Caution

Here, we know that object data type is used to represent strings and thus categorical features. Be aware that this is not always the case. Sometimes object data type could contain other types of information, such as dates that were not properly formatted (strings) and yet relate to a quantity of elapsed time.

In a more general scenario you should manually introspect the content of your dataframe not to wrongly use make_column_selector .

Dispatch columns to a specific processor#

In the previous sections, we saw that we need to treat data differently depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a ColumnTransformer class which will send specific columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together (heterogeneously typed tabular data).

We first define the columns depending on their data type:

one-hot encoding will be applied to categorical columns. Besides, we use handle_unknown="ignore" to solve the potential issues due to rare categories.
numerical scaling numerical features which will be standardized.

Now, we create our ColumnTransfomer by specifying three values: the preprocessor name, the transformer, and the columns. First, let's create the preprocessors for the numerical and categorical parts.

                            from              sklearn.preprocessing              import              OneHotEncoder              ,              StandardScaler              categorical_preprocessor              =              OneHotEncoder              (              handle_unknown              =              "ignore"              )              numerical_preprocessor              =              StandardScaler              ()

Now, we create the transformer and associate each of these preprocessors with their respective columns.

                            from              sklearn.compose              import              ColumnTransformer              preprocessor              =              ColumnTransformer              ([              (              'one-hot-encoder'              ,              categorical_preprocessor              ,              categorical_columns              ),              (              'standard_scaler'              ,              numerical_preprocessor              ,              numerical_columns              )])

We can take a minute to represent graphically the structure of a ColumnTransformer :

columntransformer diagram

A ColumnTransformer does the following:

It splits the columns of the original dataset based on the column names or indices provided. We will obtain as many subsets as the number of transformers passed into the ColumnTransformer .
It transforms each subsets. A specific transformer is applied to each subset: it will internally call fit_transform or transform . The output of this step is a set of transformed datasets.
It then concatenates the transformed datasets into a single dataset.

The important thing is that ColumnTransformer is like any other scikit-learn transformer. In particular it can be combined with a classifier in a Pipeline :

                                from                sklearn.linear_model                import                LogisticRegression                from                sklearn.pipeline                import                make_pipeline                model                =                make_pipeline                (                preprocessor                ,                LogisticRegression                (                max_iter                =                500                ))                model

Pipeline(steps=[('columntransformer',                  ColumnTransformer(transformers=[('one-hot-encoder',                                                   OneHotEncoder(handle_unknown='ignore'),                                                   ['workclass', 'education',                                                    'marital-status',                                                    'occupation', 'relationship',                                                    'race', 'sex',                                                    'native-country']),                                                  ('standard_scaler',                                                   StandardScaler(),                                                   ['age', 'capital-gain',                                                    'capital-loss',                                                    'hours-per-week'])])),                 ('logisticregression', LogisticRegression(max_iter=500))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The final model is more complex than the previous models but still follows the same API (the same set of methods that can be called by the user):

the fit method is called to preprocess the data and then train the classifier of the preprocessed data;
the predict method makes predictions on new data;
the score method is used to predict on the test data and compare the predictions to the expected test labels to compute the accuracy.

Let's start by splitting our data into train and test sets.

                            from              sklearn.model_selection              import              train_test_split              data_train              ,              data_test              ,              target_train              ,              target_test              =              train_test_split              (              data              ,              target              ,              random_state              =              42              )

Caution

Be aware that we use train_test_split here for didactic purposes, to show the scikit-learn API. In a real setting one might prefer to use cross-validation to also be able to evaluate the uncertainty of our estimation of the generalization performance of a model, as previously demonstrated.

Now, we can train the model on the train set.

                            _              =              model              .              fit              (              data_train              ,              target_train              )

Then, we can send the raw dataset straight to the pipeline. Indeed, we do not need to make any manual preprocessing (calling the transform or fit_transform methods) as it will be handled when calling the predict method. As an example, we predict on the five first samples from the test set.

	age	workclass	education	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
7762	56	Private	HS-grad	Divorced	Other-service	Unmarried	White	Female	0	40	United-States
23881	25	Private	HS-grad	Married-civ-spouse	Transport-moving	Own-child	Other	Male	0	40	United-States
30507	43	Private	Bachelors	Divorced	Prof-specialty	Not-in-family	White	Female	14344	40	United-States
28911	32	Private	HS-grad	Married-civ-spouse	Transport-moving	Husband	White	Male	0	40	United-States
19484	39	Private	Bachelors	Married-civ-spouse	Sales	Wife	White	Female	0	30	United-States

                                model                .                predict                (                data_test                )[:                5                ]

                array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

                /tmp/ipykernel_5396/3838488189.py:1: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.   target_test[:5]

                7762      <=50K 23881     <=50K 30507      >50K 28911     <=50K 19484     <=50K Name: class, dtype: object

To get directly the accuracy score, we need to call the score method. Let's compute the accuracy score on the entire test set.

                            model              .              score              (              data_test              ,              target_test              )

Evaluation of the model with cross-validation#

As previously stated, a predictive model should be evaluated by cross-validation. Our model is usable with the cross-validation tools of scikit-learn as any other predictors:

                                from                sklearn.model_selection                import                cross_validate                cv_results                =                cross_validate                (                model                ,                data                ,                target                ,                cv                =                5                )                cv_results

                {'fit_time': array([0.91902041, 0.95967555, 0.89062309, 0.92098427, 0.92281842]),  'score_time': array([0.03798079, 0.04145241, 0.041013  , 0.0417161 , 0.03910518]),  'test_score': array([0.8512642 , 0.8498311 , 0.84756347, 0.8523751 , 0.85524161])}

                                scores                =                cv_results                [                "test_score"                ]                print                (                "The mean cross-validation accuracy is: "                f                "                {                scores                .                mean                ()                :                .3f                }                                  ±                                {                scores                .                std                ()                :                .3f                }                "                )

                The mean cross-validation accuracy is: 0.851 ± 0.003

The compound model has a higher predictive accuracy than the two models that used numerical and categorical variables in isolation.

Fitting a more powerful model#

Linear models are nice because they are usually cheap to train, small to deploy, fast to predict and give a good baseline.

However, it is often useful to check whether more complex models such as an ensemble of decision trees can lead to higher predictive performance. In this section we will use such a model called gradient-boosting trees and evaluate its generalization performance. More precisely, the scikit-learn model we will use is called HistGradientBoostingClassifier . Note that boosting models will be covered in more detail in a future module.

For tree-based models, the handling of numerical and categorical variables is simpler than for linear models:

we do not need to scale the numerical features
using an ordinal encoding for the categorical variables is fine even if the encoding results in an arbitrary ordering

Therefore, for HistGradientBoostingClassifier , the preprocessing pipeline is slightly simpler than the one we saw earlier for the LogisticRegression :

                            from              sklearn.ensemble              import              HistGradientBoostingClassifier              from              sklearn.preprocessing              import              OrdinalEncoder              categorical_preprocessor              =              OrdinalEncoder              (              handle_unknown              =              "use_encoded_value"              ,              unknown_value              =-              1              )              preprocessor              =              ColumnTransformer              ([              (              'categorical'              ,              categorical_preprocessor              ,              categorical_columns              )],              remainder              =              "passthrough"              )              model              =              make_pipeline              (              preprocessor              ,              HistGradientBoostingClassifier              ())

Now that we created our model, we can check its generalization performance.

                                %%time                _                =                model                .                fit                (                data_train                ,                target_train                )

                CPU times: user 897 ms, sys: 16 ms, total: 913 ms Wall time: 912 ms

                            model              .              score              (              data_test              ,              target_test              )

We can observe that we get significantly higher accuracies with the Gradient Boosting model. This is often what we observe whenever the dataset has a large number of samples and limited number of informative features (e.g. less than 1000) with a mix of numerical and categorical variables.

This explains why Gradient Boosted Machines are very popular among datascience practitioners who work with tabular data.

In this notebook we:

used a ColumnTransformer to apply different preprocessing for categorical and numerical variables;
used a pipeline to chain the ColumnTransformer preprocessing and logistic regression fitting;
saw that gradient boosting methods can outperform linear models.

rosariobasure.blogspot.com

Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

Should Dataframe Be Standardized When Logistic Regression With Continuous and Categorical Variables

Using numerical and categorical variables together#

Selection based on data types#

Dispatch columns to a specific processor#

Evaluation of the model with cross-validation#

Fitting a more powerful model#

0 Response to "Should Dataframe Be Standardized When Logistic Regression With Continuous and Categorical Variables"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Should Dataframe Be Standardized When Logistic Regression With Continuous and Categorical Variables

Using numerical and categorical variables together#

Selection based on data types#

Dispatch columns to a specific processor#

Evaluation of the model with cross-validation#

Fitting a more powerful model#

Related Posts

0 Response to "Should Dataframe Be Standardized When Logistic Regression With Continuous and Categorical Variables"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel