Introduction

With the IPL season coming up (for those who are not familiar with IPL, it’s the EUROPA League or the NBA of Cricket in India and all the cricketing nations) I wanted to share the use case of Data Science in cricket. Data Science and Analytics are being used in Sports extensively. You can read more about Data Science and Sports Analytics over here.

In this blog post, I will walk you through a Data Science use case called the IPL score predictor. Before going into this project I would like to explain the game of cricket in a jiffy to those who are not aware of it. If you are aware of the game feel free to skip it and move to the code part.

What is Cricket?

Cricket is played with two teams of 11 players each. Each team takes turns batting and playing the field, as in baseball. In cricket, the batter is a batsman and the pitcher is a bowler. The bowler tries to knock down the bail of the wicket. A batsman tries to prevent the bowler from hitting the wicket by hitting the ball. Two batsmen are on the pitch at the same time.

I hope you have understood the basics of the game. Now, let’s just dive into the problem statement.

Problem Statement

The problem statement states that,

Using the IPL dataset, Predict the score of your favourite team.

Let us read the data file and take an initial look at the data. You can access the data from this link.

Data Analysis and Data Preprocessing

``````# Load the data

Now we know how the data looks. The next step involves analysing the basic metrics like the shape of the data, number of null values, data types of each variables etc. All this can be easily done by simply using the pandas info() function. The below code cell demonstrates the same.

``ipl_data.info()``

From the above output, it becomes pretty clear that we have around 76014 non-null entries and a total of 14 variables/features in our dataset. Next steps in our data analysis and preprocessing involves finding out the relationships/dependencies amongst variables, removing unnecessary columns, encoding the categorical columns etc.

Next, we drop some unnecessary variables and move for further pre-processing.

``````# Drop the unnecessary cols like mid, striker and non striker

columns_to_drop = ['mid', 'striker', 'non-striker']
ipl_data.drop(columns_to_drop, axis = 1, inplace = True)``````

Having dropped the unnecessary variables next, we examine the teams that are present in our dataset.

``````# Print all the teams
print(np.unique(ipl_data['bat_team']))``````

Those who are cricketing fans and followers of the IPL must be aware of that fact that some of the teams shown in above output have either completely dropped out or have changed their names. So, in the next step I will drop those teams who are not associated with the IPL anymore and consider only the current teams. For example, the Gujarat Lions do not play anymore, hence, all the data points associated with it will be dropped.

``````# We don't want the teams which are not playing rn

all_teams = np.unique(ipl_data['bat_team'])
old_teams = ['Deccan Chargers', 'Kochi Tuskers Kerala', 'Gujarat Lions', 'Kochi Tuskers Kerala', 'Rising Pune Supergiant', 'Pune Warriors', 'Rising Pune Supergiants']
current_teams = [teams for teams in all_teams if teams not in old_teams]

# Use the rows which have current_teams

ipl_data = ipl_data[ipl_data['bat_team'].isin(current_teams) & ipl_data['bowl_team'].isin(current_teams)]``````

Having removed the old teams, we examine the correlation between the variables. Correlation is a statistical term describing the degree to which two variables move in coordination with one another. This can be done in Python using the pandas.DataFrame.corr() function.

There seems to be high correlation between runs and overs and runs and runs_last_5. This makes sense as well as the number of over increase the runs seem to increase as well.

Next steps, in the data preprocessing involves the encoding of the categorical variables. So, the categorical variables can be of two types: Ordinal and Nominal. Ordinal categorical variables include those which have an underlying order. For example, Good, Better, Best or High, Medium, Low. The other type is the Nominal categorical variables which do not have any underlying order. For example, Gender.

I will use sklearn’s LabelEncoder for ordinal variables and pandas get_dummies() for the nominal variables. After encoding the categorical variables the data looks something like this.

Next step involves, bringing all the variables to the same scale, this is known as feature scaling. So, this step is very important if you are going to build a Linear Regression model or a KNN model, feature scaling can help improve the performance of these models. I will be using sklearn’s StandardScaler class to scale the variables. It brings down the mean of all the variables to 0 and to a standard deviation of 1.

After scaling the data looks something like this,

After all the preprocessing steps we are good for the machine learning modelling.

Machine Learning Modelling

Next steps involve, building a Machine Learning model which predicts the total runs scored in a match. A decent approach can be starting with simple models and moving sequentially to complex models. Once we have tested decent number of models we can shortlist the top performing models and start tuning the hyperparameters to further improve the results. I have chosen the following models in the given order.

1. KNN Regressor
2. Linear Regression
3. Decision Tree Regressor
4. RandomForest Regressor
5. XGBoost Regressor

The following table compares these models on the basis of various regression performance metrics.

From the above table, it is clear that the XGBoost model performs the best as compared to every other model. Linear Regression’s poor performance can be accounted to the multicollinearity amongst variables.

Conclusion

So, this was a use case of Data Science in Sports. You can access the entire code repository here. I hope you found the blog post informative. Please do share it with fellow learners and help the community grow together. Please do let me know your feedback in the comments. You can also reach out to me over LinkedIn and also follow me on twitter. Happy Learning. ðŸ™‚