## Introduction

With the **IPL season** coming up (for those who are not familiar with IPL, it’s the **EUROPA League** or the **NBA** of **Cricket** in India and all the cricketing nations) I wanted to share the use case of **Data Science** in cricket. Data Science and Analytics are being used in Sports extensively. You can read more about Data Science and **Sports Analytics **over here.

In this blog post, I will walk you through a Data Science use case called the IPL score predictor. Before going into this project I would like to explain the game of cricket in a jiffy to those who are not aware of it. If you are aware of the game feel free to skip it and move to the code part.

## What is Cricket?

Cricket is played with two teams of 11 players each. Each team takes turns batting and playing the field, as in baseball. In cricket, the batter is a batsman and the pitcher is a bowler. The bowler tries to knock down the bail of the wicket. A batsman tries to prevent the bowler from hitting the wicket by hitting the ball. Two batsmen are on the pitch at the same time.

I hope you have understood the basics of the game. Now, let’s just dive into the problem statement.

## IPL Score Predictor

## Problem Statement

The problem statement states that,

Using the IPL dataset, Predict the score of your favourite team.

Let us read the data file and take an initial look at the data. You can access the data from **this link**.

## Data Analysis and Data Preprocessing

```
# Load the data
ipl_data = pd.read_csv("/content/ipl_data.csv")
ipl_data.head()
```

Now we know how the data looks. The next step involves analysing the** basic metrics** like the shape of the data, number of null values, data types of each variables etc. All this can be easily done by simply using the pandas **info() function**. The below code cell demonstrates the same.

`ipl_data.info()`

From the above output, it becomes pretty clear that we have around **76014** **non-null entries** and a total of **14 variables/features** in our dataset. Next steps in our data analysis and preprocessing involves finding out the relationships/dependencies amongst variables, removing unnecessary columns, encoding the categorical columns etc.

Next, we drop some **unnecessary variables** and move for further pre-processing.

```
# Drop the unnecessary cols like mid, striker and non striker
columns_to_drop = ['mid', 'striker', 'non-striker']
ipl_data.drop(columns_to_drop, axis = 1, inplace = True)
```

Having dropped the unnecessary variables next, we examine the teams that are present in our dataset.

```
# Print all the teams
print(np.unique(ipl_data['bat_team']))
```

Those who are cricketing fans and followers of the IPL must be aware of that fact that some of the teams shown in above output have either completely dropped out or have changed their names. So, in the next step I will drop those teams who are not associated with the IPL anymore and consider only the current teams. For example, the Gujarat Lions do not play anymore, hence, all the data points associated with it will be dropped.

```
# We don't want the teams which are not playing rn
all_teams = np.unique(ipl_data['bat_team'])
old_teams = ['Deccan Chargers', 'Kochi Tuskers Kerala', 'Gujarat Lions', 'Kochi Tuskers Kerala', 'Rising Pune Supergiant', 'Pune Warriors', 'Rising Pune Supergiants']
current_teams = [teams for teams in all_teams if teams not in old_teams]
# Use the rows which have current_teams
ipl_data = ipl_data[ipl_data['bat_team'].isin(current_teams) & ipl_data['bowl_team'].isin(current_teams)]
```

Having removed the old teams, we examine the correlation between the variables. Correlation is a statistical term describing **the degree to which two variables move in coordination with one another**. This can be done in Python using the pandas.DataFrame.corr() function.

There seems to be **high correlation** between **runs and overs and runs and runs_last_5**. This makes sense as well as the number of over increase the runs seem to increase as well.

Next steps, in the data preprocessing involves the encoding of the categorical variables. So, the categorical variables can be of two types: **Ordinal** and **Nominal**. **Ordinal categorical variables include those which have an underlying order**. For example, Good, Better, Best or High, Medium, Low. **The other type is the Nominal categorical variables which do not have any underlying orde**r. For example, Gender.

I will use sklearn’s **LabelEncoder** for ordinal variables and pandas **get_dummies**() for the nominal variables. After encoding the categorical variables the data looks something like this.

Next step involves, bringing all the variables to the same scale, this is known as feature scaling. So, this step is very important if you are going to build a **Linear Regression model** or a **KNN** model, feature scaling can help improve the performance of these models. I will be using sklearn’s **StandardScaler** class to scale the variables. It brings down the mean of all the variables to 0 and to a standard deviation of 1.

After scaling the data looks something like this,

After all the preprocessing steps we are good for the machine learning modelling.

## Machine Learning Modelling

Next steps involve, building a Machine Learning model which predicts the total runs scored in a match. **A decent approach can be starting with simple models and moving sequentially to complex models. Once we have tested decent number of models we can shortlist the top performing models and start tuning the hyperparameters to further improve the results. **I have chosen the following models in the given order.

- KNN Regressor
- Linear Regression
- Decision Tree Regressor
- RandomForest Regressor
- XGBoost Regressor

The following table compares these models on the basis of various regression performance metrics.

From the above table, it is clear that the **XGBoost** model performs the best as compared to every other model. **Linear Regression’s poor performance can be accounted to the multicollinearity amongst variables.**

## Conclusion

So, this was a use case of Data Science in Sports. You can access the entire code repository **here**. I hope you found the blog post informative. Please do share it with fellow learners and help the community grow together. Please do let me know your feedback in the comments. You can also reach out to me over **LinkedIn** and also follow me on **twitter**. Happy Learning. ðŸ™‚