{ "cells": [ { "cell_type": "markdown", "id": "49019be7-2986-40c3-aeb5-56ff875cba32", "metadata": {}, "source": [ "# Visual Introduction to Pandas\n", "\n", "```{contents}\n", ":local:\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "d27c4667-0d89-4ebd-a820-329c8f2932d7", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "398c4c90-6dff-41ac-998d-cc1f7e6e7e99", "metadata": {}, "source": [ "## Loading Data\n", "\n", "Dengan menggunakan Pandas, kamu bisa memuat data spreadsheet (tabel) lalu memanipulasinya menggunakan Python. Konsep utama dari Pandas adalah objek **DataFrame**, yaitu tipe data yang merepresentastikan tabel beserta isinya dan memiliki label pada setiap baris dan kolom.\n", "\n", "![excel_to_pandas](../images/pandas/excel_to_pandas.png)" ] }, { "cell_type": "code", "execution_count": 2, "id": "947e4668-10a1-470b-a513-1ca486c07189", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
0Billie HolidayJazz130000027000000
1Jimi HendrixRock270000070000000
2Miles DavisJazz150000048000000
3SIAPop200000074000000
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "0 Billie Holiday Jazz 1300000 27000000\n", "1 Jimi Hendrix Rock 2700000 70000000\n", "2 Miles Davis Jazz 1500000 48000000\n", "3 SIA Pop 2000000 74000000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../../data/music.csv\")\n", "df" ] }, { "cell_type": "markdown", "id": "f51dada6-ad12-478a-84f2-1e2b1f71a4fc", "metadata": {}, "source": [ "Variabel `df` adalah Pandas DataFrame:\n", "\n", "![view_pandas_dataframe](../images/pandas/view_pandas_dataframe.png)" ] }, { "cell_type": "markdown", "id": "d69f560d-bbed-49b3-9ae6-96cd414f2f89", "metadata": {}, "source": [ "## Selection\n", "\n", "Kita bisa memilih data suatu kolom menggunakan labelnya:" ] }, { "cell_type": "code", "execution_count": 3, "id": "f612dbfa-d9ba-4f8c-97df-2e31c0ecbf41", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Billie Holiday\n", "1 Jimi Hendrix\n", "2 Miles Davis\n", "3 SIA\n", "Name: Artist, dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Artist\"]" ] }, { "cell_type": "markdown", "id": "8c8a15ea-e409-4afe-82ad-17c17680504b", "metadata": {}, "source": [ "![select_column](../images/pandas/select_column.png)\n", "\n", "Kita juga bisa memilih baris menggunakan nomor barisnya:" ] }, { "cell_type": "code", "execution_count": 4, "id": "0058c60c-ab1b-43ef-ae81-d7df55bf422d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
1Jimi HendrixRock270000070000000
2Miles DavisJazz150000048000000
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "1 Jimi Hendrix Rock 2700000 70000000\n", "2 Miles Davis Jazz 1500000 48000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[1:3]" ] }, { "cell_type": "markdown", "id": "ef921b6e-c357-4414-9e25-03a48655f80e", "metadata": {}, "source": [ "![select_rows](../images/pandas/select_rows.png)\n", "\n", "Kita juga bisa men-slice table menggunakan nomor baris dan kolom menggunakan method `loc()`" ] }, { "cell_type": "code", "execution_count": 5, "id": "12014542-583a-4d1d-bb06-e1a609a3602b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Artist
1Jimi Hendrix
2Miles Davis
3SIA
\n", "
" ], "text/plain": [ " Artist\n", "1 Jimi Hendrix\n", "2 Miles Davis\n", "3 SIA" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1:3, [\"Artist\"]]" ] }, { "cell_type": "markdown", "id": "0274cc56-eedf-467a-9a9e-dc996fd20a75", "metadata": {}, "source": [ "![select_column_and_rows](../images/pandas/select_column_and_rows.png)" ] }, { "cell_type": "markdown", "id": "dd3a9d5d-244c-4051-bf00-66a24c26edfa", "metadata": {}, "source": [ "## Filtering\n", "\n", "> Ini semakin seru atau perasaan saya saja? - Desta\n", "\n", "Dengan Pandas, kita bisa dengan mudah mem-filter baris. Misalnya:" ] }, { "cell_type": "code", "execution_count": 6, "id": "d8700361-26cc-4729-9c45-fb88f159dddf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
0Billie HolidayJazz130000027000000
2Miles DavisJazz150000048000000
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "0 Billie Holiday Jazz 1300000 27000000\n", "2 Miles Davis Jazz 1500000 48000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Genre\"] == \"Jazz\"]" ] }, { "cell_type": "markdown", "id": "f7cbccf3-5004-46d8-a80e-dce7cdaafeb4", "metadata": {}, "source": [ "![filter_1](../images/pandas/filter_1.png)" ] }, { "cell_type": "code", "execution_count": 7, "id": "da17b4e2-7ad7-4087-a88f-cc893a516723", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
1Jimi HendrixRock270000070000000
3SIAPop200000074000000
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "1 Jimi Hendrix Rock 2700000 70000000\n", "3 SIA Pop 2000000 74000000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Listeners\"] > 1800000]" ] }, { "cell_type": "markdown", "id": "bf232eb8-43f1-4527-b6ac-984851e3feda", "metadata": {}, "source": [ "![filter_2](../images/pandas/filter_2.png)" ] }, { "cell_type": "markdown", "id": "309e52df-044d-4b8d-8c50-2e79ffce0b1a", "metadata": {}, "source": [ "## Dealing with Missing Values\n", "\n", "Sering kali dataset yang kamu proses pada proyek data science atau machine learning akan memiliki data yang tidak lengkap. Misalnya seperti ini:" ] }, { "cell_type": "code", "execution_count": 8, "id": "36bf574d-48c1-4f1b-b389-eaa31218c47b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
0Billie HolidayJazz130000027000000.0
1Jimi HendrixRock2700000NaN
2Miles DavisJazz150000048000000.0
3SIAPop200000074000000.0
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "0 Billie Holiday Jazz 1300000 27000000.0\n", "1 Jimi Hendrix Rock 2700000 NaN\n", "2 Miles Davis Jazz 1500000 48000000.0\n", "3 SIA Pop 2000000 74000000.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../../data/music_with_na.csv\")\n", "df" ] }, { "cell_type": "markdown", "id": "10ff86a8-c0f5-463a-bba5-62b209a0d29a", "metadata": {}, "source": [ "![set_missing_value](../images/pandas/set_missing_value.png)\n", "\n", "Pandas memiliki beberapa cara untuk mengatasi hal ini, salah satunya dengan menggunakan `dropna()` dimana kita akan menghapus baris yang datanya tidak lengkap:" ] }, { "cell_type": "code", "execution_count": 9, "id": "f21239d1-f17e-4533-a810-b807e311ffec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
0Billie HolidayJazz130000027000000.0
2Miles DavisJazz150000048000000.0
3SIAPop200000074000000.0
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "0 Billie Holiday Jazz 1300000 27000000.0\n", "2 Miles Davis Jazz 1500000 48000000.0\n", "3 SIA Pop 2000000 74000000.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "id": "e3fe3aaa-93a2-4498-a8a1-23d62756a627", "metadata": {}, "source": [ "![filter_missing_values](../images/pandas/filter_missing_values.png)\n", "\n", "Atau, kita juga bisa menggunakan `fillna()` dimana nilai yang tidak ada akan digantikan dengan `0`." ] }, { "cell_type": "code", "execution_count": 10, "id": "94ce8f19-506c-4f2c-86b9-d12a3ee3fcb8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlays
0Billie HolidayJazz130000027000000.0
1Jimi HendrixRock27000000.0
2Miles DavisJazz150000048000000.0
3SIAPop200000074000000.0
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays\n", "0 Billie Holiday Jazz 1300000 27000000.0\n", "1 Jimi Hendrix Rock 2700000 0.0\n", "2 Miles Davis Jazz 1500000 48000000.0\n", "3 SIA Pop 2000000 74000000.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0)" ] }, { "cell_type": "markdown", "id": "79c93669-71e0-4f5c-8406-a673a73d4f27", "metadata": {}, "source": [ "## Grouping\n", "\n", "Kita juga bisa melakukan pengelompokan dan agregasi pada data kita menggunakan Pandas." ] }, { "cell_type": "code", "execution_count": 11, "id": "b535406e-c442-4eb0-80fd-42602d5c4833", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistListenersPlays
Genre
JazzBillie HolidayMiles Davis280000075000000
PopSIA200000074000000
RockJimi Hendrix270000070000000
\n", "
" ], "text/plain": [ " Artist Listeners Plays\n", "Genre \n", "Jazz Billie HolidayMiles Davis 2800000 75000000\n", "Pop SIA 2000000 74000000\n", "Rock Jimi Hendrix 2700000 70000000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../../data/music.csv\")\n", "df.groupby(\"Genre\").sum()" ] }, { "cell_type": "markdown", "id": "ab0a7a57-605c-4f9e-af1d-142ef5e75e04", "metadata": {}, "source": [ "![group_by](../images/pandas/group_by.png)\n", "\n", "Pandas mengelompokkan dua artis \"Jazz\" menjadi satu baris, dan karena kita menggunakan `sum()` untuk agregasi, Pandas menjumlahkan total listeneres dan plays untuk dua artis Jass tersebut.\n", "\n", "Fitur-fitur ini akan sangat berguna untuk data analisis. Grouping memungkinkan kita untuk mengelompokkan data dan menemukan insight, dan agregasi adalah pilar pertama dan fundamental untuk statistika.\n", "\n", "Selain `sum()`, Pandas juga memiliki method agregasi lain seperti `mean()`, `min()`, `max()`, dan lainnya.\n", "\n", "Untuk mempelejari lebih dalam tentang `groupby()`, kamu bisa membacanya di [Group By User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)." ] }, { "cell_type": "markdown", "id": "79bda923-ee7f-4e34-95bb-f4f2afbadcbf", "metadata": {}, "source": [ "## Creating New Columns from Existing Columns\n", "\n", "Sering kali dalam proses analisa data, kita diharuskan untuk membuat kolum baru." ] }, { "cell_type": "code", "execution_count": 12, "id": "0279e403-33d3-46e0-b3f0-cebe86ec9dc1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistGenreListenersPlaysAvg Plays
0Billie HolidayJazz13000002700000020.769231
1Jimi HendrixRock27000007000000025.925926
2Miles DavisJazz15000004800000032.000000
3SIAPop20000007400000037.000000
\n", "
" ], "text/plain": [ " Artist Genre Listeners Plays Avg Plays\n", "0 Billie Holiday Jazz 1300000 27000000 20.769231\n", "1 Jimi Hendrix Rock 2700000 70000000 25.925926\n", "2 Miles Davis Jazz 1500000 48000000 32.000000\n", "3 SIA Pop 2000000 74000000 37.000000" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Avg Plays\"] = df[\"Plays\"] / df[\"Listeners\"]\n", "df" ] }, { "cell_type": "markdown", "id": "3f5ca26f-2f91-4990-bc57-fc95f840e2df", "metadata": {}, "source": [ "![create_new_column](../images/pandas/create_new_column.png)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }