Standardizing Data | Python - DataCamp
Maybe your like
1. Standardization
It's possible that you'll come across datasets with lots of numerical noise, perhaps due to feature variance or differently-scaled data. The preprocessing solution for that is standardization.2. What is standardization?
Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn, this is often a necessary step, because many models make underlying assumptions that the training data is normally distributed, and if it isn't, we could risk risk biasing your model. Data can be standardized in many different ways, but in this course, we're going to talk about two methods: log normalization and scaling. It's also important to note that standardization is a preprocessing method applied to continuous, numerical data. We'll cover methods for dealing with categorical data later in the course.3. When to standardize: linear distances
There are a few different scenarios in which we'd want to standardize your data. First, if we're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features we're giving it are related in a linear fashion, or can be measured with a linear distance metric, which may not always be the case.4. When to standardize: high variance
Standardization should also be used when dataset features have a high variance, which is also related to distance metrics. This could bias a model that assumes the data is normally distributed. If a feature in our dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.5. When to standardize: different scales
Modeling a dataset that contains continuous features that are on different scales is another standardization scenario. For example, consider predicting house prices using two features: the number of bedrooms and the last sale price. These two features are on vastly different scales, which will confuse most models. To compare these features, we must standardize them to put them in the same linear space. All of these scenarios assume we're working with a model that makes some kind of linearity assumptions; however, there are a number of models that are perfectly fine operating in a nonlinear space, or do a certain amount of standardization upon input, but they're outside the scope of this course.6. Let's practice!
Now that you've learned when to standardize your data, let's test your knowledge. Show TranscriptsCreate Your Free Account
or
Email AddressPasswordVisibleStart Learning for FreeBy continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.This exercise is part of the course
Preprocessing for Machine Learning in Python
IntermediateSkill Level4.8+301 reviewsStart Course for FreeChapter 1: Introduction to Data Preprocessing
In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.
Exercise 1: Introduction to preprocessingExercise 2: Exploring missing dataExercise 3: Dropping missing dataExercise 4: Working with data typesExercise 5: Exploring data typesExercise 6: Converting a column typeExercise 7: Training and test setsExercise 8: Class imbalanceExercise 9: Stratified samplingChapter 2: Standardizing Data
This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.
Exercise 1: StandardizationCurrent Exercise
Exercise 2: When to standardizeExercise 3: Modeling without normalizingExercise 4: Log normalizationExercise 5: Checking the varianceExercise 6: Log normalization in PythonExercise 7: Scaling data for feature comparisonExercise 8: Scaling data - investigating columnsExercise 9: Scaling data - standardizing columnsExercise 10: Standardized data and modelingExercise 11: KNN on non-scaled dataExercise 12: KNN on scaled dataChapter 3: Feature Engineering
In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.
Exercise 1: Feature engineeringExercise 2: Feature engineering knowledge testExercise 3: Identifying areas for feature engineeringExercise 4: Encoding categorical variablesExercise 5: Encoding categorical variables - binaryExercise 6: Encoding categorical variables - one-hotExercise 7: Engineering numerical featuresExercise 8: Aggregating numerical featuresExercise 9: Extracting datetime componentsExercise 10: Engineering text featuresExercise 11: Extracting string patternsExercise 12: Vectorizing textExercise 13: Text classification using tf/idf vectorsChapter 4: Selecting Features for Modeling
This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).
Exercise 1: Feature selectionExercise 2: When to use feature selectionExercise 3: Identifying areas for feature selectionExercise 4: Removing redundant featuresExercise 5: Selecting relevant featuresExercise 6: Checking for correlated featuresExercise 7: Selecting features using text vectorsExercise 8: Exploring text vectors, part 1Exercise 9: Exploring text vectors, part 2Exercise 10: Training Naive Bayes with feature selectionExercise 11: Dimensionality reductionExercise 12: Using PCAExercise 13: Training a model with PCAChapter 5: Putting It All Together
Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.
Exercise 1: UFOs and preprocessingExercise 2: Checking column typesExercise 3: Dropping missing dataExercise 4: Categorical variables and standardizationExercise 5: Extracting numbers from stringsExercise 6: Identifying features for standardizationExercise 7: Engineering new featuresExercise 8: Encoding categorical variablesExercise 9: Features from datesExercise 10: Text vectorizationExercise 11: Feature selection and modelingExercise 12: Selecting the ideal datasetExercise 13: Modeling the UFO dataset, part 1Exercise 14: Modeling the UFO dataset, part 2Exercise 15: Congratulations!Tag » How To Standardize Data In Python
-
2 Easy Ways To Standardize Data In Python For Machine Learning
-
How To Standardize Data In Python - Python-bloggers
-
How And Why To Standardize Your Data: A Python Tutorial
-
2 Easy Ways To Normalize Data In Python - DigitalOcean
-
6.3. Preprocessing Data — Scikit-learn 1.1.2 Documentation
-
How To Standardize Data In Python (With Examples) - - Statology
-
How To Standardize Data In A Pandas DataFrame? - GeeksforGeeks
-
How To Use StandardScaler And MinMaxScaler Transforms In Python
-
How To Standardize Your Data ? [Data Standardization With Python]
-
How To Standardise Features In Python? - ProjectPro
-
Standardizing Data | Python - DataCamp
-
Machine-learning-articles/how-to-normalize-or-standardize ... - GitHub
-
Data Normalization In Python
-
How To Standardize Data Using Z-Score/Standard Scalar | Python