{
"cells": [
{
"cell_type": "markdown",
"id": "49019be7-2986-40c3-aeb5-56ff875cba32",
"metadata": {},
"source": [
"# Visual Introduction to Pandas\n",
"\n",
"```{contents}\n",
":local:\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d27c4667-0d89-4ebd-a820-329c8f2932d7",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "398c4c90-6dff-41ac-998d-cc1f7e6e7e99",
"metadata": {},
"source": [
"## Loading Data\n",
"\n",
"Dengan menggunakan Pandas, kamu bisa memuat data spreadsheet (tabel) lalu memanipulasinya menggunakan Python. Konsep utama dari Pandas adalah objek **DataFrame**, yaitu tipe data yang merepresentastikan tabel beserta isinya dan memiliki label pada setiap baris dan kolom.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "947e4668-10a1-470b-a513-1ca486c07189",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Billie Holiday | \n",
" Jazz | \n",
" 1300000 | \n",
" 27000000 | \n",
"
\n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
" Rock | \n",
" 2700000 | \n",
" 70000000 | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000 | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
" Pop | \n",
" 2000000 | \n",
" 74000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"0 Billie Holiday Jazz 1300000 27000000\n",
"1 Jimi Hendrix Rock 2700000 70000000\n",
"2 Miles Davis Jazz 1500000 48000000\n",
"3 SIA Pop 2000000 74000000"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"../../data/music.csv\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "f51dada6-ad12-478a-84f2-1e2b1f71a4fc",
"metadata": {},
"source": [
"Variabel `df` adalah Pandas DataFrame:\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "d69f560d-bbed-49b3-9ae6-96cd414f2f89",
"metadata": {},
"source": [
"## Selection\n",
"\n",
"Kita bisa memilih data suatu kolom menggunakan labelnya:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f612dbfa-d9ba-4f8c-97df-2e31c0ecbf41",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Billie Holiday\n",
"1 Jimi Hendrix\n",
"2 Miles Davis\n",
"3 SIA\n",
"Name: Artist, dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"Artist\"]"
]
},
{
"cell_type": "markdown",
"id": "8c8a15ea-e409-4afe-82ad-17c17680504b",
"metadata": {},
"source": [
"\n",
"\n",
"Kita juga bisa memilih baris menggunakan nomor barisnya:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0058c60c-ab1b-43ef-ae81-d7df55bf422d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
" Rock | \n",
" 2700000 | \n",
" 70000000 | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"1 Jimi Hendrix Rock 2700000 70000000\n",
"2 Miles Davis Jazz 1500000 48000000"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[1:3]"
]
},
{
"cell_type": "markdown",
"id": "ef921b6e-c357-4414-9e25-03a48655f80e",
"metadata": {},
"source": [
"\n",
"\n",
"Kita juga bisa men-slice table menggunakan nomor baris dan kolom menggunakan method `loc()`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "12014542-583a-4d1d-bb06-e1a609a3602b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist\n",
"1 Jimi Hendrix\n",
"2 Miles Davis\n",
"3 SIA"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[1:3, [\"Artist\"]]"
]
},
{
"cell_type": "markdown",
"id": "0274cc56-eedf-467a-9a9e-dc996fd20a75",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "dd3a9d5d-244c-4051-bf00-66a24c26edfa",
"metadata": {},
"source": [
"## Filtering\n",
"\n",
"> Ini semakin seru atau perasaan saya saja? - Desta\n",
"\n",
"Dengan Pandas, kita bisa dengan mudah mem-filter baris. Misalnya:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "d8700361-26cc-4729-9c45-fb88f159dddf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Billie Holiday | \n",
" Jazz | \n",
" 1300000 | \n",
" 27000000 | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"0 Billie Holiday Jazz 1300000 27000000\n",
"2 Miles Davis Jazz 1500000 48000000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"Genre\"] == \"Jazz\"]"
]
},
{
"cell_type": "markdown",
"id": "f7cbccf3-5004-46d8-a80e-dce7cdaafeb4",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "da17b4e2-7ad7-4087-a88f-cc893a516723",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
" Rock | \n",
" 2700000 | \n",
" 70000000 | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
" Pop | \n",
" 2000000 | \n",
" 74000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"1 Jimi Hendrix Rock 2700000 70000000\n",
"3 SIA Pop 2000000 74000000"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"Listeners\"] > 1800000]"
]
},
{
"cell_type": "markdown",
"id": "bf232eb8-43f1-4527-b6ac-984851e3feda",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "309e52df-044d-4b8d-8c50-2e79ffce0b1a",
"metadata": {},
"source": [
"## Dealing with Missing Values\n",
"\n",
"Sering kali dataset yang kamu proses pada proyek data science atau machine learning akan memiliki data yang tidak lengkap. Misalnya seperti ini:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "36bf574d-48c1-4f1b-b389-eaa31218c47b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Billie Holiday | \n",
" Jazz | \n",
" 1300000 | \n",
" 27000000.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
" Rock | \n",
" 2700000 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000.0 | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
" Pop | \n",
" 2000000 | \n",
" 74000000.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"0 Billie Holiday Jazz 1300000 27000000.0\n",
"1 Jimi Hendrix Rock 2700000 NaN\n",
"2 Miles Davis Jazz 1500000 48000000.0\n",
"3 SIA Pop 2000000 74000000.0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"../../data/music_with_na.csv\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "10ff86a8-c0f5-463a-bba5-62b209a0d29a",
"metadata": {},
"source": [
"\n",
"\n",
"Pandas memiliki beberapa cara untuk mengatasi hal ini, salah satunya dengan menggunakan `dropna()` dimana kita akan menghapus baris yang datanya tidak lengkap:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f21239d1-f17e-4533-a810-b807e311ffec",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Billie Holiday | \n",
" Jazz | \n",
" 1300000 | \n",
" 27000000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000.0 | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
" Pop | \n",
" 2000000 | \n",
" 74000000.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"0 Billie Holiday Jazz 1300000 27000000.0\n",
"2 Miles Davis Jazz 1500000 48000000.0\n",
"3 SIA Pop 2000000 74000000.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna()"
]
},
{
"cell_type": "markdown",
"id": "e3fe3aaa-93a2-4498-a8a1-23d62756a627",
"metadata": {},
"source": [
"\n",
"\n",
"Atau, kita juga bisa menggunakan `fillna()` dimana nilai yang tidak ada akan digantikan dengan `0`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "94ce8f19-506c-4f2c-86b9-d12a3ee3fcb8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Billie Holiday | \n",
" Jazz | \n",
" 1300000 | \n",
" 27000000.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
" Rock | \n",
" 2700000 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000.0 | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
" Pop | \n",
" 2000000 | \n",
" 74000000.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays\n",
"0 Billie Holiday Jazz 1300000 27000000.0\n",
"1 Jimi Hendrix Rock 2700000 0.0\n",
"2 Miles Davis Jazz 1500000 48000000.0\n",
"3 SIA Pop 2000000 74000000.0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(0)"
]
},
{
"cell_type": "markdown",
"id": "79c93669-71e0-4f5c-8406-a673a73d4f27",
"metadata": {},
"source": [
"## Grouping\n",
"\n",
"Kita juga bisa melakukan pengelompokan dan agregasi pada data kita menggunakan Pandas."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b535406e-c442-4eb0-80fd-42602d5c4833",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Listeners | \n",
" Plays | \n",
"
\n",
" \n",
" Genre | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Jazz | \n",
" Billie HolidayMiles Davis | \n",
" 2800000 | \n",
" 75000000 | \n",
"
\n",
" \n",
" Pop | \n",
" SIA | \n",
" 2000000 | \n",
" 74000000 | \n",
"
\n",
" \n",
" Rock | \n",
" Jimi Hendrix | \n",
" 2700000 | \n",
" 70000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Listeners Plays\n",
"Genre \n",
"Jazz Billie HolidayMiles Davis 2800000 75000000\n",
"Pop SIA 2000000 74000000\n",
"Rock Jimi Hendrix 2700000 70000000"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"../../data/music.csv\")\n",
"df.groupby(\"Genre\").sum()"
]
},
{
"cell_type": "markdown",
"id": "ab0a7a57-605c-4f9e-af1d-142ef5e75e04",
"metadata": {},
"source": [
"\n",
"\n",
"Pandas mengelompokkan dua artis \"Jazz\" menjadi satu baris, dan karena kita menggunakan `sum()` untuk agregasi, Pandas menjumlahkan total listeneres dan plays untuk dua artis Jass tersebut.\n",
"\n",
"Fitur-fitur ini akan sangat berguna untuk data analisis. Grouping memungkinkan kita untuk mengelompokkan data dan menemukan insight, dan agregasi adalah pilar pertama dan fundamental untuk statistika.\n",
"\n",
"Selain `sum()`, Pandas juga memiliki method agregasi lain seperti `mean()`, `min()`, `max()`, dan lainnya.\n",
"\n",
"Untuk mempelejari lebih dalam tentang `groupby()`, kamu bisa membacanya di [Group By User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)."
]
},
{
"cell_type": "markdown",
"id": "79bda923-ee7f-4e34-95bb-f4f2afbadcbf",
"metadata": {},
"source": [
"## Creating New Columns from Existing Columns\n",
"\n",
"Sering kali dalam proses analisa data, kita diharuskan untuk membuat kolum baru."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0279e403-33d3-46e0-b3f0-cebe86ec9dc1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Artist | \n",
" Genre | \n",
" Listeners | \n",
" Plays | \n",
" Avg Plays | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Billie Holiday | \n",
" Jazz | \n",
" 1300000 | \n",
" 27000000 | \n",
" 20.769231 | \n",
"
\n",
" \n",
" 1 | \n",
" Jimi Hendrix | \n",
" Rock | \n",
" 2700000 | \n",
" 70000000 | \n",
" 25.925926 | \n",
"
\n",
" \n",
" 2 | \n",
" Miles Davis | \n",
" Jazz | \n",
" 1500000 | \n",
" 48000000 | \n",
" 32.000000 | \n",
"
\n",
" \n",
" 3 | \n",
" SIA | \n",
" Pop | \n",
" 2000000 | \n",
" 74000000 | \n",
" 37.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Artist Genre Listeners Plays Avg Plays\n",
"0 Billie Holiday Jazz 1300000 27000000 20.769231\n",
"1 Jimi Hendrix Rock 2700000 70000000 25.925926\n",
"2 Miles Davis Jazz 1500000 48000000 32.000000\n",
"3 SIA Pop 2000000 74000000 37.000000"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"Avg Plays\"] = df[\"Plays\"] / df[\"Listeners\"]\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "3f5ca26f-2f91-4990-bc57-fc95f840e2df",
"metadata": {},
"source": [
""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}