Standardizing Data | Python - DataCamp

Maybe your like

Standardization

1. Standardization

It's possible that you'll come across datasets with lots of numerical noise, perhaps due to feature variance or differently-scaled data. The preprocessing solution for that is standardization.

2. What is standardization?

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn, this is often a necessary step, because many models make underlying assumptions that the training data is normally distributed, and if it isn't, we could risk risk biasing your model. Data can be standardized in many different ways, but in this course, we're going to talk about two methods: log normalization and scaling. It's also important to note that standardization is a preprocessing method applied to continuous, numerical data. We'll cover methods for dealing with categorical data later in the course.

3. When to standardize: linear distances

There are a few different scenarios in which we'd want to standardize your data. First, if we're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features we're giving it are related in a linear fashion, or can be measured with a linear distance metric, which may not always be the case.

4. When to standardize: high variance

Standardization should also be used when dataset features have a high variance, which is also related to distance metrics. This could bias a model that assumes the data is normally distributed. If a feature in our dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.

5. When to standardize: different scales

Modeling a dataset that contains continuous features that are on different scales is another standardization scenario. For example, consider predicting house prices using two features: the number of bedrooms and the last sale price. These two features are on vastly different scales, which will confuse most models. To compare these features, we must standardize them to put them in the same linear space. All of these scenarios assume we're working with a model that makes some kind of linearity assumptions; however, there are a number of models that are perfectly fine operating in a nonlinear space, or do a certain amount of standardization upon input, but they're outside the scope of this course.

6. Let's practice!

Now that you've learned when to standardize your data, let's test your knowledge. Show Transcripts

Create Your Free Account

Email AddressPasswordStart Learning for FreeBy continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

This exercise is part of the course

Preprocessing for Machine Learning in Python

IntermediateSkill Level4.8+319 reviewsStart Course for Free

Chapter 1: Introduction to Data Preprocessing

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

Exercise 1: Introduction to preprocessingExercise 2: Exploring missing dataExercise 3: Dropping missing dataExercise 4: Working with data typesExercise 5: Exploring data typesExercise 6: Converting a column typeExercise 7: Training and test setsExercise 8: Class imbalanceExercise 9: Stratified sampling

Chapter 2: Standardizing Data

This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.

Exercise 1: Standardization

Current Exercise

Exercise 2: When to standardizeExercise 3: Modeling without normalizingExercise 4: Log normalizationExercise 5: Checking the varianceExercise 6: Log normalization in PythonExercise 7: Scaling data for feature comparisonExercise 8: Scaling data - investigating columnsExercise 9: Scaling data - standardizing columnsExercise 10: Standardized data and modelingExercise 11: KNN on non-scaled dataExercise 12: KNN on scaled data

Chapter 3: Feature Engineering

In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.

Exercise 1: Feature engineeringExercise 2: Feature engineering knowledge testExercise 3: Identifying areas for feature engineeringExercise 4: Encoding categorical variablesExercise 5: Encoding categorical variables - binaryExercise 6: Encoding categorical variables - one-hotExercise 7: Engineering numerical featuresExercise 8: Aggregating numerical featuresExercise 9: Extracting datetime componentsExercise 10: Engineering text featuresExercise 11: Extracting string patternsExercise 12: Vectorizing textExercise 13: Text classification using tf/idf vectors

Chapter 4: Selecting Features for Modeling

This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).

Exercise 1: Feature selectionExercise 2: When to use feature selectionExercise 3: Identifying areas for feature selectionExercise 4: Removing redundant featuresExercise 5: Selecting relevant featuresExercise 6: Checking for correlated featuresExercise 7: Selecting features using text vectorsExercise 8: Exploring text vectors, part 1Exercise 9: Exploring text vectors, part 2Exercise 10: Training Naive Bayes with feature selectionExercise 11: Dimensionality reductionExercise 12: Using PCAExercise 13: Training a model with PCA

Chapter 5: Putting It All Together

Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.

Exercise 1: UFOs and preprocessingExercise 2: Checking column typesExercise 3: Dropping missing dataExercise 4: Categorical variables and standardizationExercise 5: Extracting numbers from stringsExercise 6: Identifying features for standardizationExercise 7: Engineering new featuresExercise 8: Encoding categorical variablesExercise 9: Features from datesExercise 10: Text vectorizationExercise 11: Feature selection and modelingExercise 12: Selecting the ideal datasetExercise 13: Modeling the UFO dataset, part 1Exercise 14: Modeling the UFO dataset, part 2Exercise 15: Congratulations!

Tag » How To Standardize Data In Python

Standardizing Data | Python - DataCamp

1. Standardization

2. What is standardization?

3. When to standardize: linear distances

4. When to standardize: high variance

5. When to standardize: different scales

6. Let's practice!

Create Your Free Account

Preprocessing for Machine Learning in Python

Chapter 1: Introduction to Data Preprocessing

Chapter 2: Standardizing Data

Chapter 3: Feature Engineering

Chapter 4: Selecting Features for Modeling

Chapter 5: Putting It All Together

2 Easy Ways To Standardize Data In Python For Machine Learning

How To Standardize Data In Python - Python-bloggers

How And Why To Standardize Your Data: A Python Tutorial

2 Easy Ways To Normalize Data In Python - DigitalOcean

6.3. Preprocessing Data — Scikit-learn 1.1.2 Documentation

How To Standardize Data In Python (With Examples) - - Statology

How To Standardize Data In A Pandas DataFrame? - GeeksforGeeks

How To Use StandardScaler And MinMaxScaler Transforms In Python

How To Standardize Your Data ? [Data Standardization With Python]

How To Standardise Features In Python? - ProjectPro

Standardizing Data | Python - DataCamp

Machine-learning-articles/how-to-normalize-or-standardize ... - GitHub

Data Normalization In Python

How To Standardize Data Using Z-Score/Standard Scalar | Python

Contact