Basics of Data Science using Python

After going through the topic above, what are the questions that come to your mind? It must be, why python? How can we use python to implement data science? And, what are the advantages and disadvantages?

We will be answering all these questions and we are also going to talk about the libraries that we can use to implement data science. Let’s have a look at our first question i.e. Why python?

Python is open-source; therefore, we have numerous libraries available for data science applications, Python provides great functionality for mathematical, numerical, statistical, and scientific functions. Easy syntax also makes it one of the best languages for implementation.

Now, let’s talk about the second question.

There are a number of libraries available to us while implementing. Let’s look at some of these-

Advertisements

1. NumPy

NumPy (short for Numerical Python) is a fast interface for storing and manipulating dense data buffers. NumPy arrays are similar to Python’s built-in list type in some ways, but NumPy arrays provide much more efficient storage and data operations as the arrays grow in size. NumPy arrays are at the heart of nearly the entire ecosystem of Python data science tools, so learning to use NumPy effectively will be beneficial regardless of which aspect of data science you are interested in.

This is how we implement NumPy in a program – 

import numpy as np

a = np.array((1 , 2, 3)) # for creating a single dimensional array
b = np.zeros((2, 5)) # for creating a 2X5 array with all zeros
c = np.random.random((2, 2)) #For creating an array with random values

print("A 1D array: {}".format(a))
print()
print("A 2D array: {}".format(b))
print()
print("A random 2 X 2 array: {}".format(c))	
Advertisements

2. SciPy

SciPy is a Python library that can be used to solve a variety of mathematical equations and algorithms. It is built on top of the Numpy library, which provides more extensions for finding scientific mathematical formulae such as Matrix Rank, Inverse, polynomial equations, LU Decomposition, and so on. Using its high-level functions significantly reduces the complexity of the code and aids in better data analysis. SciPy is an interactive Python session that is used as a data-processing library and is designed to compete with competitors such as MATLAB, Octave, R-Lab, and others. It has a plethora of user-friendly, efficient, and simple-to-use functions that aid in the resolution of problems such as numerical integration, interpolation, optimization, linear algebra, and statistics.

This is how we implement linear algebra from SciPy in a program –

import numpy as np
from scipy import linalg

A = np.array([[1,2,3],[4,5,6],[7,8,8]])

# Compute the determinant of a matrix
print("Determinant of matrix A: {}".format(linalg.det(A)))
print()

P, L, U = linalg.lu(A)

print("Matrix P: \n {}".format(P))
print()
print("Matrix L: \n {}".format(L))
print()
print("Matrix U: \n {}".format(U))
print()

# Print LU decomposition
print("Decomposition of LU: \n {}".format(np.dot(L,U)))
Advertisements

3. Pandas

Pandas is a Python open-source library that allows for the manipulation of tabular data (i.e. explore, clean, and process). The term PAN(el)-DA(ta)-S is derived from the econometrics term panel data.

At a high level, Pandas functions similarly to a spreadsheet (think Microsoft Excel or Google Sheets) in that you work with rows and columns. Pandas is a pillar library in any data science workflow because it allows you to perform data processing, wrangling, and munging. This is especially important because many people believe that the data pre-processing stage takes up the majority of a data scientist’s time.

This is how we implement Pandas in a program –

import pandas as pd

# Making data frame from csv file 
data = pd.read_csv("abc.csv")

# Retrieving rows by loc method 
row1 = data.loc[3]
Advertisements

4. Matplotlib

Matploptlib is a Python low-level library used for data visualization. It is simple to use and emulates MATLAB features such as graphs and visualization. This library is built on NumPy arrays and includes several plots such as line charts, bar charts, histograms, and so on. It provides a lot of flexibility, but at the expense of having to write more code. It was created in 2002 by John Hunter as a patch to IPython to enable interactive MATLAB-style plotting via Gnuplot from the IPython command line.

This is how we install Matplotlib in a python program –

pip install matplotlib.pyplot

Examples –

import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.axis([0, 6, 0, 20])

plt.show()
Advertisements

5. Scikit Learn

Scikit-learn is one of many scikits (SciPy Toolkits abbreviation) that specialize in machine learning. A scikit is a package that is too specialized to be included in SciPy and is therefore packaged as one of many scikits. The scikit-image is another popular scikit (i.e. collection of algorithms for image processing).

Scikit-learn is by far one of the most important Python libraries for machine learning, as it allows you to create machine learning models while also providing utility functions for data preparation, post-model analysis, and evaluation.

This is how we import a dataset in python using Scikit Learn-

pip install sklearn
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Load the data
X, y = load_iris(return_X_y = True, as_frame = True)

# Scale the train data
min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)

# Split into training & test set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Building a simple Logistic Regression (Linear) model
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)

# Print the accuracy score on the train & the test set
train_acc = lr_clf.score(X_train, y_train)
test_acc = lr_clf.score(X_test, y_test)

print('Train Accuracy: {} %'.format(train_acc * 100))
print('Test Accuracy: {} %'.format(test_acc * 100))
Advertisements

6. Tensorflow

TensorFlow is a free and open software library. TensorFlow was created by researchers and engineers on Google’s Machine Intelligence research team for the purpose of conducting machine learning and deep neural network research, but the system is general enough to be applicable in a wide range of other domains as well.

TensorFlow is a software library that uses data flow graphs to perform numerical computations.

  • The graph’s nodes represent mathematical operations.
  • The graph’s edges represent the multidimensional data arrays (called tensors) that are communicated between them.

This is how we implement Tensorflow in a program –

import tensorflow as tf
Advertisements

Some examples –

import tensorflow as tf

# creating nodes in computation graph
node1 = tf.constant(3, dtype=tf.int32)
node2 = tf.constant(5, dtype=tf.int32)
node3 = tf.add(node1, node2)

print("Addition of node 1 & node 2 gives: {}".format(node3))
Advertisements

7. Keras

Keras is a Python-based deep learning API that runs on top of the TensorFlow machine learning platform. It was created with the goal of allowing for quick experimentation. Keras is a Python library that is widely used for deep learning model training. Finding the right dataset for developing models is a common issue in deep learning.

Many people prefer Keras over TensorFlow because it provides a much better “user experience.” Keras was developed in Python, making it easier for Python developers to understand. It is an easy-to-use library with a lot of power.

This is how we implement Keras in a program –

from keras.datasets import mnist

Examples –

# First neural network with keras tutorial
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

# Load the dataset
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')

# Split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]

# Define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=10)

# Evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
Advertisements

8. PyTorch

In many ways, PyTorch behaves similarly to the arrays we know and love from Numpy. After all, these Numpy arrays are just tensors. PyTorch takes these tensors and makes it simple to move them to GPUs for faster processing during neural network training. It also includes a module for automatically calculating gradients (for backpropagation) and another for building neural networks. Overall, PyTorch is more adaptable with Python and the Numpy stack than TensorFlow and other frameworks.

Examples –

# Importing torch
import torch

# Creating a tensors
t1=torch.tensor([1, 2, 3, 4])
t2=torch.tensor([[1, 2, 3, 4],
				[5, 6, 7, 8],
				[9, 10, 11, 12]])

# Printing the tensors:
print("Tensor t1: \n", t1)
print("\nTensor t2: \n", t2)

# Rank of tensors
print("\nRank of t1: ", len(t1.shape))
print("Rank of t2: ", len(t2.shape))

# Shape of tensors
print("\nRank of t1: ", t1.shape)
print("Rank of t2: ", t2.shape)
Advertisements

Conclusion

So, this blog post covered the very basics of Data Science using Python, some of the topics that weren’t covered in this post but are of utmost importance are Statistics, Probability Theory, Linear Algebra & Optimization. However, these can be implemented with the libraries in Python that are discussed above. We hope that you find this blog post informative. This blog post is written by Jasmeer Singh. You can connect with him on LinkedIn using the link attached here. To read more of our content, please refer to this link or the section attached below. If you like our content please do share it within you community & do let us know your feedback. Thanks for reading & Happy Learning 🙂

Advertisements
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: