Saturday , July 20 2019
Home / Uncategorized / How to use the Scikit-learn Python library for data science projects

How to use the Scikit-learn Python library for data science projects



The Scikit-learn Python library, initially released in 2007, is commonly used to solve machine learning and data science problems, from the beginning to the end. The versatile library offers a neat, consistent and efficient API and accurate online documentation.

What is Scikit-learn?

Scikit-learn is an open source Python library with powerful tools for data analysis and data mining. It is available under the BSD license and is based on the following machine learning libraries:

  • NumPy, a library for manipulating multidimensional matrices and matrices. It also has a large collection of mathematical functions for performing various calculations.
  • SciPy, an ecosystem made up of various libraries to complete technical calculation tasks.
  • matplotlib, a library to draw various graphs and diagrams.

Scikit-learn offers a wide range of integrated algorithms that take full advantage of data science projects.

Here are the main ways in which the Scikit-learn library is used.

1. Classification

The classification tools identify the category associated with the data provided. For example, they can be used to classify e-mail messages as spam or not.

The classification algorithms in Scikit-learn include:

  • Support vector machines (SVM)
  • Closer neighbors
  • Casual forest

2. Regression

Regression involves creating a model that seeks to understand the relationship between input and output data. For example, regression tools can be used to understand the behavior of stock prices.

Regression algorithms include:

  • SVM
  • Ridge regression
  • snare

3. Clustering

The Scikit-learn clustering tools are used to automatically group data with the same characteristics in sets. For example, customer data can be segmented based on their locations.

Clustering algorithms include:

  • K-means
  • Spectral clustering
  • Middle-shift

4. Dimensional reduction

Dimensional reduction reduces the number of random variables for analysis. For example, to increase the efficiency of views, peripheral data may not be considered.

Dimensional reduction algorithms include:

  • Analysis of the main components (PCA)
  • Selection of features
  • Factorization of the non-negative matrix

5. Model selection

Model selection algorithms offer tools to compare, validate and select the best parameters and models to be used in data science projects.

Model selection modules that offer greater accuracy through tuning parameters include:

  • Grid search
  • Cross validation
  • Metric

6. Pre-processing

The Scikit-learn preprocessing tools are important for the extraction and normalization of features during data analysis. For example, you can use these tools to transform input data, such as text, and apply their functionality in the analysis.

Pre-processing modules include:

  • Pre-processing
  • Feature extraction

An example of the Scikit-learn library

We use a simple example to illustrate how you can use the Scikit-learn library in your data science projects.

We will use the Iris flower data set, which is incorporated into the Scikit-learn library. The Iris flower dataset contains 150 details on three species of flowers:

  • With Setosa 0 label
  • Tagged Versicolor 1
  • With Virginica label 2

The data set includes the following characteristics of each species of flowers (in centimeters):

  • Cuttlefish length
  • Burial width
  • Length of the petal
  • Width of the petal

Step 1: Import the library

Because the Iris dataset is included in the Scikit-learn data science library, we can load it into our workspace as follows:

from sklearn import datasets
iris = data set.load_iris()

These commands import the datasets form from sklearn, then use the load_digits () method from datasets to include data in the workspace.

Step 2: Acquire the characteristics of the data set

The datasets module contains several methods that make it easier to get acquainted with data management.

In Scikit-learn, a data set refers to an object similar to a dictionary that has all the details about the data. The data is stored using the .data key, which is a list of arrays.

For example, we can use iris.data to obtain information on the Iris flower data set.

print(iris.data)

Here is the output (the results have been truncated):

[[5.1 3.5 1.4 0.2]
[4.9 3.  1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.  3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5.  3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3.  1.4 0.1]
[4.3 3.  1.1 0.1]
[5.8 4.  1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]

We also use iris.target to give us information about the different flower labels.

print(iris.target)

Here is the output:

[0000000000000000000000000000000000000[0000000000000000000000000000000000000[0000000000000000000000000000000000000[0000000000000000000000000000000000000
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

If we use iris.target_names, we will produce an array of the names of the labels found in the data set.

print(iris.target_names)

Here is the result after running the Python code:

['setosa' 'versicolor' 'virginica']

Step 3: Display the data set

We can use the box plot to produce a visual representation of the Iris flower dataset. The box plot illustrates how the data are distributed across the plan through their quartiles.

Here's how to get this:

import Seaborn as SNS
box_data = iris.data #variable representing the array of data
box_target = iris.target #variable that represents the array of labels
SNS.Box Plots(data = box_data,width=0.5,fliersize=5)
SNS.set to(rc={& # 39; Figure.figsize & # 39;:(2,15)})

Let's see the result:

On the horizontal axis:

  • 0 is the length of the sepal
  • 1 is the width of the sepal
  • 2 is the length of the petal
  • 3 is the width of the petal

The vertical axis is dimensions in centimeters.

Wrapping up

Here is the whole code for this simple Scikit: learn the tutorial on data science.

from sklearn import datasets
iris = data set.load_iris()
print(iris.data)
print(iris.target)
print(iris.target_names)
import Seaborn as SNS
box_data = iris.data #variable representing the array of data
box_target = iris.target #variable that represents the array of labels
SNS.Box Plots(data = box_data,width=0.5,fliersize=5)
SNS.set to(rc={& # 39; Figure.figsize & # 39;:(2,15)})

Scikit-learn is a versatile Python library that you can use to efficiently complete data science projects.

If you want to know more, check out the tutorials on LiveEdu, such as the video by Andrey Bulezyuk on the use of the Scikit-learn library to create a machine learning application.

Do you have any questions or comments? Feel free to share them below.


Source link

Leave a Reply

Your email address will not be published.