/ W3SCHOOLS

W3schools - Python_Pandas

이 페이지는 다음에 대한 공부 기록입니다
Lecture에서 배웠던 내용을 복습하며 작성했습니다

찾으시는 정보가 있으시다면
주제별reference를 이용하시거나
우측 상단에 있는 검색기능을 이용해주세요

Pandas

Is a python library used for working data sets

It has functions for analyzing, cleaning, exploring, and manipulating data

Allows to analyze big data and make conclusions based on statistical theories

Can clean messy data sets, and make them readable and relevant

  • Relevant data is very important in data science

  • Data Science : is a branch of computer science where we study how to store, use and analyze data for deriving information from it

Start

Install it using this command

pip install pandas
# ready to use
import pandas as pd

#check version
print(pd.__version__)

Series

Is like a column in a table

Is a one-dimensional array holding data of any type

import pandas as pd
a = [1,7,2]
var = pd.Series(a)
print(var)
output
0	1
1	7
2	2
dtype: int64
# If nothing else is specified, the values are labeled with their index number
# With the index arg, you can name your own labels
var2 = pd.Series(a, index = [x, y, z])
print(var2)
output
x	1
y	7
z	2
dtype:int64
# Can access an item by referring to the label or index
print(var2[x])		# output 1

# use dictionary, the keys of the dict become the labels
b = {day1: 400, day2:500, day3:300}
var3=pd.Series(b)
print(var3)
output
day1	400
day2	500
day3	300
dtype:int64
# To select only some of the items in the dict, use the index arg and specify the items
var4 = pd.Series(b, index=[day1, day2])
print(var4)
output
day1	400
day2	500
dtype:int64

DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames

Series is like a column, DataFrame is the whole table

import pandas as pd
data = {
a : [1,2,3],
b: [4,5,6]
}

var5 = pd.DataFrame(data)
print(var5)
output
a	b
0	1	4
1	2	5
2	3	6
# Use the loc attribute to return one or more specified row(s)
print(var5.loc[0])
# This returns a Series
output
a	1
b	4
Name : 0, dtype : int64
print(var5.loc[[0,1]])
# This returns DataFrame
output
a	b
0	1	4
1	2	5
# Can also use index arg, and use named index in the loc attribute

Read file

If data sets are stored in a file, Pandas can load them into a DataFrame

import pandas as pd

# Read csv
df = pd.read_csv(data.csv)

# If you have a large DataFrame with many rows, Pandas will only return the first 5 rows and the last 5 rows
print(df)
# Use to_string() to print the entire DataFrame
print(df.to_string())
# Can change the maximum rows number
pd.options.display.max_rows = 999
print(df)

# Read json
df2 = pd.read_json(data.json)
print(df2.to_string())

Analyzing DataFrame

head() : Returns the headers and a specified number of rows, starting from the top

  • If the number of rows is not specified, will return the top 5 rows

tail() : Returns the headers and a specified number of rows, starting from the bottom

info() : Gives more information about the data set(rangeindex, memory usage, data columns etc.)

  • Also tells us how many Non-Null values there are present in each column

Cleaning Data

Fixing bad data(empty cells, wrong format, wrong data, duplicates etc.) in data set

Empty cells

dropna() : Returns a new Data Frame with no empty cells

  • By default, it will not change the original. if you want to change the original DataFrame, use the inplace=True arg

fillna(A) : Replace empty cells with a value A, also can use inplace arg

  • dataframename[“columnname”].fillna(A) : Can specify the column name

  • Common way to replace empty cells, is to calculate the mean, median or mode value of the column

Wrong Format

# Convert all cells in the column
import pandas as pd
df = pd.read_csv(data.csv)
df[column_name] = pd.to_datetime(df[column_name])
print(df.to_string)
# And then, remove empty value
df.dropna(subset=[column_name], inplace=True)

Fix Wrong Data

# Replace value
dataframe_name.loc[row, column_name] = fix_value
# Or remove value
for x in dataframe_name.index:
if dataframe_name.loc[x, column_name] > boundaries:
dataframe_name.drop(x, inplace = True)

Remove Duplicates

# Find duplicate
print(dataframe_name.duplicated())		# output True or False
# Remove duplicate
dataframe_name.drop_duplicates(inplace=True)

corr()

A great aspect of the Pandas module

It calculates the relationship between each column in data set

It ignores “not numeric” columns

The result of the corr() is a table with a lot of numbers that represents how well the relationship is between two columns(number varies from -1 to 1)

  • 1 means that there is a 1 to 1 relationship(perfect correlation)

  • 0 means that not a good relationship

Have to have at least 0.6(or -0.6) to call it a good correlation

plot()

To create diagrams

Can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(data.csv)
df.plot()
plt.show()

# Specify that you want a scatter plot with the kind arg
# A scatter plot needs an x- and y- axis
df.plot(kind= scatter, x = column_name1, y = column_name2)
plt.show()

# Histogram that shows us the frequency of each interval
df[Duration].plot(kind = hist)