6.2. Feature Extraction — Scikit-learn 1.1.1 Documentation
7.2.1. Loading features from dicts#
The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.
While not particularly fast to process, Python’s dict has the advantages of being convenient to use, being sparse (absent features need not be stored) and storing feature names in addition to values.
DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features. Categorical features are “attribute-value” pairs where the value is restricted to a list of discrete possibilities without ordering (e.g. topic identifiers, types of objects, tags, names…).
In the following, “city” is a categorical attribute while “temperature” is a traditional numerical feature:
>>> measurements = [ ... {'city': 'Dubai', 'temperature': 33.}, ... {'city': 'London', 'temperature': 12.}, ... {'city': 'San Francisco', 'temperature': 18.}, ... ] >>> fromsklearn.feature_extractionimport DictVectorizer >>> vec = DictVectorizer() >>> vec.fit_transform(measurements).toarray() array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]]) >>> vec.get_feature_names_out() array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)DictVectorizer accepts multiple string values for one feature, like, e.g., multiple categories for a movie.
Assume a database classifies each movie using some categories (not mandatory) and its year of release.
>>> movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003}, ... {'category': ['animation', 'family'], 'year': 2011}, ... {'year': 1974}] >>> vec.fit_transform(movie_entry).toarray() array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03], [1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03], [0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]]) >>> vec.get_feature_names_out() array(['category=animation', 'category=drama', 'category=family', 'category=thriller', 'year'], ...) >>> vec.transform({'category': ['thriller'], ... 'unseen_feature': '3'}).toarray() array([[0., 0., 0., 1., 0.]])DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Language Processing models that typically work by extracting feature windows around a particular word of interest.
For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
>>> pos_window = [ ... { ... 'word-2': 'the', ... 'pos-2': 'DT', ... 'word-1': 'cat', ... 'pos-1': 'NN', ... 'word+1': 'on', ... 'pos+1': 'PP', ... }, ... # in a real application one would extract many such dictionaries ... ]This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a TfidfTransformer for normalization):
>>> vec = DictVectorizer() >>> pos_vectorized = vec.fit_transform(pos_window) >>> pos_vectorized <Compressed Sparse...dtype 'float64' with 6 stored elements and shape (1, 6)> >>> pos_vectorized.toarray() array([[1., 1., 1., 1., 1., 1.]]) >>> vec.get_feature_names_out() array(['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the'], ...)As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix by default instead of a numpy.ndarray.
Từ khóa » B' X00 N'
-
Python - Python3 Bytes Encoding - Stack Overflow
-
Python Concepts/Bytes Objects And Bytearrays - Wikiversity
-
Python Bytes, Bytearray - W3resource
-
Python: Bytes() Function - W3resource
-
Python Bytearray() - Programiz
-
[PDF] Binary Data In Python3
-
Struct — Interpret Bytes As Packed Binary Data — Python 3.10.5 ...
-
Convert Bytes To String In Python - Stack Abuse
-
How To Read A File In Python - Able
-
BF.SCANDUMP - Redis
-
Go Playground - The Go Programming Language
-
Decode Bytes And Encode Strings - Elisabeth Irgens | Notes
-
TFRecord And ain.Example | TensorFlow Core