{ "cells": [ { "cell_type": "markdown", "id": "0c7ae510-5d86-4d43-9826-4c75064ac1cb", "metadata": {}, "source": [ "# Exploratory Data Analysis\n", "\n", "Analyzing in the data lifecycle confirms that the data can answer the questions that are proposed or solving a particular problem. This step can also focus on confirming a model is correctly addressing these questions and problems. This lesson is focused on Exploratory Data Analysis or EDA, which are techniques for defining features and relationships within the data and can be used to prepare the data for modeling.\n", "\n", "We'll be using an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to show how this can be applied with Python and the [Pandas](../pandas/intro_to_pandas) library. This dataset contains a count of some common words found in emails, the sources of these emails are anonymous.\n", "\n", "```{figure} ../images/ds/eda.png\n", "---\n", "name: eda\n", "---\n", "Exploratory Data Analysis\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "764bf506-22ac-43b8-98e7-e622f25c2fb9", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "a1366cf8-1639-4829-8568-043b3ca4fc15", "metadata": {}, "source": [ "## Load the dataset" ] }, { "cell_type": "code", "execution_count": 2, "id": "c248e201-c734-4c13-8014-9f837cae9a9e", "metadata": {}, "outputs": [], "source": [ "email_df = pd.read_csv(\"../../data/emails.csv\")" ] }, { "cell_type": "markdown", "id": "5884e24c-fa03-48de-8f5e-8782eb72a5a5", "metadata": {}, "source": [ "## Data Profiling and Descriptive Statistics\n", "\n", "How do we evaluate if we have enough data to solve this problem? Data profiling can summarize and gather some general overall information about our dataset through techniques of descriptive statistics. Data profiling helps us understand what is available to us, and descriptive statistics helps us understand how many things are available to us.\n", "\n", "We can use Pandas's [`describe()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) for this. It provides the count, max and min values, mean, standard deviation and quantiles on the numerical data. Using descriptive statistics like the `describe()` function can help you assess how much you have and if you need more." ] }, { "cell_type": "code", "execution_count": 3, "id": "337c8aed-724b-457d-b7db-e7c4d55c486d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thetoectandforofayouhouin...conneveyjayvaluedlayinfrastructuremilitaryallowingffdryPrediction
count5172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.000000...5172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.0000005172.000000
mean6.6405656.1881285.1438523.0755993.1247102.62703055.5174012.4665512.02436210.600155...0.0050270.0125680.0106340.0980280.0042540.0065740.0040600.9147330.0069610.290023
std11.7450099.53457614.1011426.0459704.6805226.22984587.5741724.3144446.96787819.281892...0.1057880.1996820.1166930.5695320.0962520.1389080.0721452.7802030.0980860.453817
min0.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000001.0000001.0000000.0000001.0000000.00000012.0000000.0000000.0000001.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%3.0000003.0000001.0000001.0000002.0000001.00000028.0000001.0000000.0000005.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
75%8.0000007.0000004.0000003.0000004.0000002.00000062.2500003.0000001.00000012.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000000.0000001.000000
max210.000000132.000000344.00000089.00000047.00000077.0000001898.00000070.000000167.000000223.000000...4.0000007.0000002.00000012.0000003.0000004.0000003.000000114.0000004.0000001.000000
\n", "

8 rows × 3001 columns

\n", "
" ], "text/plain": [ " the to ect and for \\\n", "count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 \n", "mean 6.640565 6.188128 5.143852 3.075599 3.124710 \n", "std 11.745009 9.534576 14.101142 6.045970 4.680522 \n", "min 0.000000 0.000000 1.000000 0.000000 0.000000 \n", "25% 0.000000 1.000000 1.000000 0.000000 1.000000 \n", "50% 3.000000 3.000000 1.000000 1.000000 2.000000 \n", "75% 8.000000 7.000000 4.000000 3.000000 4.000000 \n", "max 210.000000 132.000000 344.000000 89.000000 47.000000 \n", "\n", " of a you hou in ... \\\n", "count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 ... \n", "mean 2.627030 55.517401 2.466551 2.024362 10.600155 ... \n", "std 6.229845 87.574172 4.314444 6.967878 19.281892 ... \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "25% 0.000000 12.000000 0.000000 0.000000 1.000000 ... \n", "50% 1.000000 28.000000 1.000000 0.000000 5.000000 ... \n", "75% 2.000000 62.250000 3.000000 1.000000 12.000000 ... \n", "max 77.000000 1898.000000 70.000000 167.000000 223.000000 ... \n", "\n", " connevey jay valued lay infrastructure \\\n", "count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 \n", "mean 0.005027 0.012568 0.010634 0.098028 0.004254 \n", "std 0.105788 0.199682 0.116693 0.569532 0.096252 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "max 4.000000 7.000000 2.000000 12.000000 3.000000 \n", "\n", " military allowing ff dry Prediction \n", "count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 \n", "mean 0.006574 0.004060 0.914733 0.006961 0.290023 \n", "std 0.138908 0.072145 2.780203 0.098086 0.453817 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 1.000000 0.000000 1.000000 \n", "max 4.000000 3.000000 114.000000 4.000000 1.000000 \n", "\n", "[8 rows x 3001 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_df.describe()" ] }, { "cell_type": "markdown", "id": "dccdf488-c18d-476d-a690-4eb49c928312", "metadata": {}, "source": [ "## Sampling and Querying\n", "\n", "Exploring everything in a large dataset can be very time consuming and a task that’s usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of what’s in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While there’s no defined rule on how much data you should sample it’s important to note that the more data you sample, the more precise of a generalization you can make of about data. Pandas has the [`sample()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) in its library where you can pass an argument of how many random samples you’d like to receive and use.\n", "\n", "General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about. The [`query()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved." ] }, { "cell_type": "code", "execution_count": 4, "id": "59dd83c5-b7b2-4544-895f-d7ac3ba69cbf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Email No.thetoectandforofayouhou...conneveyjayvaluedlayinfrastructuremilitaryallowingffdryPrediction
323Email 3244611325630...0000000101
1615Email 16166112112610...0000000000
315Email 3166523113480...0000000000
250Email 251221013448511...0000000000
3414Email 34152230313122...0000000000
4725Email 4726101244418720...0000000000
2999Email 3000001010500...0000000000
4131Email 41323421111330...0000000101
2353Email 2354001100900...0000000001
1359Email 1360111010300...0000000000
\n", "

10 rows × 3002 columns

\n", "
" ], "text/plain": [ " Email No. the to ect and for of a you hou ... connevey \\\n", "323 Email 324 4 6 1 1 3 2 56 3 0 ... 0 \n", "1615 Email 1616 6 1 1 2 1 1 26 1 0 ... 0 \n", "315 Email 316 6 5 2 3 1 1 34 8 0 ... 0 \n", "250 Email 251 22 10 1 3 4 4 85 1 1 ... 0 \n", "3414 Email 3415 2 2 3 0 3 1 31 2 2 ... 0 \n", "4725 Email 4726 10 12 4 4 4 1 87 2 0 ... 0 \n", "2999 Email 3000 0 0 1 0 1 0 5 0 0 ... 0 \n", "4131 Email 4132 3 4 2 1 1 1 13 3 0 ... 0 \n", "2353 Email 2354 0 0 1 1 0 0 9 0 0 ... 0 \n", "1359 Email 1360 1 1 1 0 1 0 3 0 0 ... 0 \n", "\n", " jay valued lay infrastructure military allowing ff dry \\\n", "323 0 0 0 0 0 0 1 0 \n", "1615 0 0 0 0 0 0 0 0 \n", "315 0 0 0 0 0 0 0 0 \n", "250 0 0 0 0 0 0 0 0 \n", "3414 0 0 0 0 0 0 0 0 \n", "4725 0 0 0 0 0 0 0 0 \n", "2999 0 0 0 0 0 0 0 0 \n", "4131 0 0 0 0 0 0 1 0 \n", "2353 0 0 0 0 0 0 0 0 \n", "1359 0 0 0 0 0 0 0 0 \n", "\n", " Prediction \n", "323 1 \n", "1615 0 \n", "315 0 \n", "250 0 \n", "3414 0 \n", "4725 0 \n", "2999 0 \n", "4131 1 \n", "2353 1 \n", "1359 0 \n", "\n", "[10 rows x 3002 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sampling 10 emails\n", "email_df.sample(10)" ] }, { "cell_type": "code", "execution_count": 5, "id": "3939e97e-d391-49ad-9f12-251a750b8de2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Email No.thetoectandforofayouhou...conneveyjayvaluedlayinfrastructuremilitaryallowingffdryPrediction
1Email 281324662102127...0000000100
3Email 4052205151210...0000000000
5Email 64514234510...0000000001
7Email 80223122160...0000000101
13Email 144571513713...0000000200
..................................................................
5156Email 515741310314820...0000000101
5159Email 516021310213820...0000000101
5162Email 51632312123200...0000000001
5170Email 51712710212820...0000000101
5171Email 51722224516514882...0000000000
\n", "

2033 rows × 3002 columns

\n", "
" ], "text/plain": [ " Email No. the to ect and for of a you hou ... connevey \\\n", "1 Email 2 8 13 24 6 6 2 102 1 27 ... 0 \n", "3 Email 4 0 5 22 0 5 1 51 2 10 ... 0 \n", "5 Email 6 4 5 1 4 2 3 45 1 0 ... 0 \n", "7 Email 8 0 2 2 3 1 2 21 6 0 ... 0 \n", "13 Email 14 4 5 7 1 5 1 37 1 3 ... 0 \n", "... ... ... .. ... ... ... .. ... ... ... ... ... \n", "5156 Email 5157 4 13 1 0 3 1 48 2 0 ... 0 \n", "5159 Email 5160 2 13 1 0 2 1 38 2 0 ... 0 \n", "5162 Email 5163 2 3 1 2 1 2 32 0 0 ... 0 \n", "5170 Email 5171 2 7 1 0 2 1 28 2 0 ... 0 \n", "5171 Email 5172 22 24 5 1 6 5 148 8 2 ... 0 \n", "\n", " jay valued lay infrastructure military allowing ff dry \\\n", "1 0 0 0 0 0 0 1 0 \n", "3 0 0 0 0 0 0 0 0 \n", "5 0 0 0 0 0 0 0 0 \n", "7 0 0 0 0 0 0 1 0 \n", "13 0 0 0 0 0 0 2 0 \n", "... ... ... ... ... ... ... .. ... \n", "5156 0 0 0 0 0 0 1 0 \n", "5159 0 0 0 0 0 0 1 0 \n", "5162 0 0 0 0 0 0 0 0 \n", "5170 0 0 0 0 0 0 1 0 \n", "5171 0 0 0 0 0 0 0 0 \n", "\n", " Prediction \n", "1 0 \n", "3 0 \n", "5 1 \n", "7 1 \n", "13 0 \n", "... ... \n", "5156 1 \n", "5159 1 \n", "5162 1 \n", "5170 1 \n", "5171 0 \n", "\n", "[2033 rows x 3002 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Returns rows where there are more occurrences of \"to\" than \"the\"\n", "email_df.query(\"the < to\")" ] }, { "cell_type": "markdown", "id": "8cf8ab3b-7d67-4697-afd9-22fdeaa25c8f", "metadata": {}, "source": [ "## Exploring to identify inconsistencies\n", "\n", "All the topics in this lesson can help identify missing or inconsistent values, but Pandas provides functions to check for some of these. [`isna()` or `isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can check for missing values. One important piece of exploring for these values within your data is to explore why they ended up that way in the first place." ] }, { "cell_type": "code", "execution_count": 6, "id": "0163b7bf-6fa0-4664-b887-c650130571a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Email No.thetoectandforofayouhou...conneveyjayvaluedlayinfrastructuremilitaryallowingffdryPrediction
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
5167FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5168FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5169FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5170FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5171FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "

5172 rows × 3002 columns

\n", "
" ], "text/plain": [ " Email No. the to ect and for of a you \\\n", "0 False False False False False False False False False \n", "1 False False False False False False False False False \n", "2 False False False False False False False False False \n", "3 False False False False False False False False False \n", "4 False False False False False False False False False \n", "... ... ... ... ... ... ... ... ... ... \n", "5167 False False False False False False False False False \n", "5168 False False False False False False False False False \n", "5169 False False False False False False False False False \n", "5170 False False False False False False False False False \n", "5171 False False False False False False False False False \n", "\n", " hou ... connevey jay valued lay infrastructure military \\\n", "0 False ... False False False False False False \n", "1 False ... False False False False False False \n", "2 False ... False False False False False False \n", "3 False ... False False False False False False \n", "4 False ... False False False False False False \n", "... ... ... ... ... ... ... ... ... \n", "5167 False ... False False False False False False \n", "5168 False ... False False False False False False \n", "5169 False ... False False False False False False \n", "5170 False ... False False False False False False \n", "5171 False ... False False False False False False \n", "\n", " allowing ff dry Prediction \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "... ... ... ... ... \n", "5167 False False False False \n", "5168 False False False False \n", "5169 False False False False \n", "5170 False False False False \n", "5171 False False False False \n", "\n", "[5172 rows x 3002 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_df.isna()" ] }, { "cell_type": "code", "execution_count": 7, "id": "12ad6c70-f5ef-473e-aeb0-4fbedcc4fd83", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Email No. 0\n", "the 0\n", "to 0\n", "ect 0\n", "and 0\n", " ..\n", "military 0\n", "allowing 0\n", "ff 0\n", "dry 0\n", "Prediction 0\n", "Length: 3002, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 8, "id": "6dcf5172-83d3-4cc2-ae58-7a9c0a8f0001", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Email No.thetoectandforofayouhou...conneveyjayvaluedlayinfrastructuremilitaryallowingffdryPrediction
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
5167FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5168FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5169FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5170FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5171FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "

5172 rows × 3002 columns

\n", "
" ], "text/plain": [ " Email No. the to ect and for of a you \\\n", "0 False False False False False False False False False \n", "1 False False False False False False False False False \n", "2 False False False False False False False False False \n", "3 False False False False False False False False False \n", "4 False False False False False False False False False \n", "... ... ... ... ... ... ... ... ... ... \n", "5167 False False False False False False False False False \n", "5168 False False False False False False False False False \n", "5169 False False False False False False False False False \n", "5170 False False False False False False False False False \n", "5171 False False False False False False False False False \n", "\n", " hou ... connevey jay valued lay infrastructure military \\\n", "0 False ... False False False False False False \n", "1 False ... False False False False False False \n", "2 False ... False False False False False False \n", "3 False ... False False False False False False \n", "4 False ... False False False False False False \n", "... ... ... ... ... ... ... ... ... \n", "5167 False ... False False False False False False \n", "5168 False ... False False False False False False \n", "5169 False ... False False False False False False \n", "5170 False ... False False False False False False \n", "5171 False ... False False False False False False \n", "\n", " allowing ff dry Prediction \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "... ... ... ... ... \n", "5167 False False False False \n", "5168 False False False False \n", "5169 False False False False \n", "5170 False False False False \n", "5171 False False False False \n", "\n", "[5172 rows x 3002 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_df.isnull()" ] }, { "cell_type": "code", "execution_count": 9, "id": "498c6a1c-cfa5-45de-95ee-82fbec91ae63", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Email No. 0\n", "the 0\n", "to 0\n", "ect 0\n", "and 0\n", " ..\n", "military 0\n", "allowing 0\n", "ff 0\n", "dry 0\n", "Prediction 0\n", "Length: 3002, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_df.isnull().sum()" ] }, { "cell_type": "markdown", "id": "1bbfbcaa-5411-41f7-8918-444258c6c219", "metadata": {}, "source": [ "## More exploration and analysis" ] }, { "cell_type": "code", "execution_count": 10, "id": "9c173f90-5e9e-48b6-9476-b5c3c09ca86d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of spam emails: 1500\n", "Number of non-spam emails: 3672\n", "Total number of emails: 5172\n" ] } ], "source": [ "# Count the number of spam and non-spam emails\n", "spam_count = email_df[email_df[\"Prediction\"] == 1].shape[0]\n", "non_spam_count = email_df[email_df[\"Prediction\"] == 0].shape[0]\n", "total_emails = email_df.shape[0]\n", "\n", "# Print the count of spam and non-spam emails\n", "print(\"Number of spam emails:\", spam_count)\n", "print(\"Number of non-spam emails:\", non_spam_count)\n", "print(\"Total number of emails:\", total_emails)" ] }, { "cell_type": "code", "execution_count": 11, "id": "966d0096-a395-4ff3-a633-9afb1ccb072b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Percentage of spam emails: 29.00%\n", "Percentage of non-spam emails: 71.00%\n" ] } ], "source": [ "# Percentage of spam and non-spam emails\n", "spam_percentage = (spam_count / total_emails) * 100\n", "non_spam_percentage = (non_spam_count / total_emails) * 100\n", "\n", "# Print the percentage of spam and non-spam emails\n", "print(\"Percentage of spam emails: {:.2f}%\".format(spam_percentage))\n", "print(\"Percentage of non-spam emails: {:.2f}%\".format(non_spam_percentage))" ] }, { "cell_type": "code", "execution_count": 12, "id": "99602377-6e87-4d17-ba13-1e38f6db8bfb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_words
053
12203
2113
31019
41075
\n", "
" ], "text/plain": [ " total_words\n", "0 53\n", "1 2203\n", "2 113\n", "3 1019\n", "4 1075" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sum up all column values for each row\n", "numeric_columns = email_df.select_dtypes(include=np.number)\n", "\n", "email_df[\"total_words\"] = numeric_columns.sum(axis=1)\n", "email_df[[\"total_words\"]].head()" ] }, { "cell_type": "markdown", "id": "c2d7e66a-5b6c-47a4-b9a2-ff56525974e7", "metadata": {}, "source": [ "## Exploring with Visualizations\n", "\n", "You don’t have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](../data_visualization/intro_to_matplotlib) to learn more about some popular ways to explore visually." ] }, { "cell_type": "code", "execution_count": 13, "id": "4058b61b-b087-448a-a2a9-77def9e88915", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Prediction\n", "0 3672\n", "1 1500\n", "Name: count, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions = email_df[\"Prediction\"].value_counts()\n", "predictions" ] }, { "cell_type": "code", "execution_count": 14, "id": "a7134aac-2ea8-494b-91d1-5ed4104e61ae", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "predictions.plot(kind=\"bar\")\n", "\n", "plt.xticks([0, 1], [\"Not Spam\", \"Spam\"])\n", "plt.ylabel(\"Quantities\")\n", "plt.title(\"Non-spam vs Spam Emails\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 15, "id": "e753dfa0-93ec-45df-88c1-cc2c801aadb9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Email No.thetoectandforofayouhou...jayvaluedlayinfrastructuremilitaryallowingffdryPredictiontotal_words
5Email 64514234510...0000000011307
7Email 80223122160...000000101565
16Email 173122011700...000000101219
17Email 183621614717194255...0000003014323
25Email 261253214181428702...0000006014927
..................................................................
5162Email 51632312123200...000000001667
5163Email 5164001000100...00000000164
5166Email 5167101100400...000000001102
5169Email 51700011001100...000000001179
5170Email 51712710212820...000000101788
\n", "

1500 rows × 3003 columns

\n", "
" ], "text/plain": [ " Email No. the to ect and for of a you hou ... jay valued \\\n", "5 Email 6 4 5 1 4 2 3 45 1 0 ... 0 0 \n", "7 Email 8 0 2 2 3 1 2 21 6 0 ... 0 0 \n", "16 Email 17 3 1 2 2 0 1 17 0 0 ... 0 0 \n", "17 Email 18 36 21 6 14 7 17 194 25 5 ... 0 0 \n", "25 Email 26 12 53 2 14 18 14 287 0 2 ... 0 0 \n", "... ... ... .. ... ... ... .. ... ... ... ... ... ... \n", "5162 Email 5163 2 3 1 2 1 2 32 0 0 ... 0 0 \n", "5163 Email 5164 0 0 1 0 0 0 1 0 0 ... 0 0 \n", "5166 Email 5167 1 0 1 1 0 0 4 0 0 ... 0 0 \n", "5169 Email 5170 0 0 1 1 0 0 11 0 0 ... 0 0 \n", "5170 Email 5171 2 7 1 0 2 1 28 2 0 ... 0 0 \n", "\n", " lay infrastructure military allowing ff dry Prediction \\\n", "5 0 0 0 0 0 0 1 \n", "7 0 0 0 0 1 0 1 \n", "16 0 0 0 0 1 0 1 \n", "17 0 0 0 0 3 0 1 \n", "25 0 0 0 0 6 0 1 \n", "... ... ... ... ... .. ... ... \n", "5162 0 0 0 0 0 0 1 \n", "5163 0 0 0 0 0 0 1 \n", "5166 0 0 0 0 0 0 1 \n", "5169 0 0 0 0 0 0 1 \n", "5170 0 0 0 0 1 0 1 \n", "\n", " total_words \n", "5 1307 \n", "7 565 \n", "16 219 \n", "17 4323 \n", "25 4927 \n", "... ... \n", "5162 667 \n", "5163 64 \n", "5166 102 \n", "5169 179 \n", "5170 788 \n", "\n", "[1500 rows x 3003 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spam_emails = email_df.query(\"Prediction == 1\")\n", "spam_emails" ] }, { "cell_type": "code", "execution_count": 16, "id": "15d4ce82-f698-4101-a7d9-8c4e80b24887", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Email No.thetoectandforofayouhou...jayvaluedlayinfrastructuremilitaryallowingffdryPredictiontotal_words
5Email 64514234510...0000000011307
7Email 80223122160...000000101565
16Email 173122011700...000000101219
17Email 183621614717194255...0000003014323
25Email 261253214181428702...0000006014927
\n", "

5 rows × 3003 columns

\n", "
" ], "text/plain": [ " Email No. the to ect and for of a you hou ... jay valued \\\n", "5 Email 6 4 5 1 4 2 3 45 1 0 ... 0 0 \n", "7 Email 8 0 2 2 3 1 2 21 6 0 ... 0 0 \n", "16 Email 17 3 1 2 2 0 1 17 0 0 ... 0 0 \n", "17 Email 18 36 21 6 14 7 17 194 25 5 ... 0 0 \n", "25 Email 26 12 53 2 14 18 14 287 0 2 ... 0 0 \n", "\n", " lay infrastructure military allowing ff dry Prediction total_words \n", "5 0 0 0 0 0 0 1 1307 \n", "7 0 0 0 0 1 0 1 565 \n", "16 0 0 0 0 1 0 1 219 \n", "17 0 0 0 0 3 0 1 4323 \n", "25 0 0 0 0 6 0 1 4927 \n", "\n", "[5 rows x 3003 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spam_emails.head()" ] }, { "cell_type": "code", "execution_count": 17, "id": "cf339d1f-e82a-49bb-85dc-fd142e2f1290", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "this 2303\n", "your 2096\n", "with 2006\n", "that 1361\n", "here 1295\n", " ... \n", "intrastate 0\n", "hakemack 0\n", "heather 0\n", "gomes 0\n", "payback 0\n", "Length: 2574, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "long_word_columns = [word for word in spam_emails.columns[1:-2] if len(word) > 3]\n", "word_totals = spam_emails[long_word_columns].sum().sort_values(ascending=False)\n", "word_totals" ] }, { "cell_type": "code", "execution_count": 18, "id": "fb04e4aa-79f0-431e-a483-faac1649343f", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the top 20 most common words\n", "plt.figure(figsize=(10, 8))\n", "word_totals.head(20).plot(kind=\"bar\")\n", "plt.title(\"Top 20 Most Common Words in Emails\")\n", "plt.xlabel(\"Words\")\n", "plt.ylabel(\"Total Occurrences\")\n", "plt.xticks(rotation=45)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }