W3schools - Python_Pandas
찾으시는 정보가 있으시다면
주제별reference를 이용하시거나
우측 상단에 있는 검색기능을 이용해주세요
Pandas
Is a python library used for working data sets
It has functions for analyzing, cleaning, exploring, and manipulating data
Allows to analyze big data and make conclusions based on statistical theories
Can clean messy data sets, and make them readable and relevant
-
Relevant data is very important in data science
-
Data Science : is a branch of computer science where we study how to store, use and analyze data for deriving information from it
Start
Install it using this command
pip install pandas
# ready to use
import pandas as pd
#check version
print(pd.__version__)
Series
Is like a column in a table
Is a one-dimensional array holding data of any type
import pandas as pd
a = [1,7,2]
var = pd.Series(a)
print(var)
output
0 1
1 7
2 2
dtype: int64
# If nothing else is specified, the values are labeled with their index number
# With the index arg, you can name your own labels
var2 = pd.Series(a, index = [“x”, “y”, “z”])
print(var2)
output
x 1
y 7
z 2
dtype:int64
# Can access an item by referring to the label or index
print(var2[“x”]) # output 1
# use dictionary, the keys of the dict become the labels
b = {“day1”: 400, “day2”:500, “day3”:300}
var3=pd.Series(b)
print(var3)
output
day1 400
day2 500
day3 300
dtype:int64
# To select only some of the items in the dict, use the index arg and specify the items
var4 = pd.Series(b, index=[“day1”, “day2”])
print(var4)
output
day1 400
day2 500
dtype:int64
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames
Series is like a column, DataFrame is the whole table
import pandas as pd
data = {
“a” : [1,2,3],
“b”: [4,5,6]
}
var5 = pd.DataFrame(data)
print(var5)
output
a b
0 1 4
1 2 5
2 3 6
# Use the loc attribute to return one or more specified row(s)
print(var5.loc[0])
# This returns a Series
output
a 1
b 4
Name : 0, dtype : int64
print(var5.loc[[0,1]])
# This returns DataFrame
output
a b
0 1 4
1 2 5
# Can also use index arg, and use named index in the loc attribute
Read file
If data sets are stored in a file, Pandas can load them into a DataFrame
import pandas as pd
# Read csv
df = pd.read_csv(‘data.csv’)
# If you have a large DataFrame with many rows, Pandas will only return the first 5 rows and the last 5 rows
print(df)
# Use to_string() to print the entire DataFrame
print(df.to_string())
# Can change the maximum rows number
pd.options.display.max_rows = 999
print(df)
# Read json
df2 = pd.read_json(‘data.json’)
print(df2.to_string())
Analyzing DataFrame
head() : Returns the headers and a specified number of rows, starting from the top
- If the number of rows is not specified, will return the top 5 rows
tail() : Returns the headers and a specified number of rows, starting from the bottom
info() : Gives more information about the data set(rangeindex, memory usage, data columns etc.)
- Also tells us how many Non-Null values there are present in each column
Cleaning Data
Fixing bad data(empty cells, wrong format, wrong data, duplicates etc.) in data set
Empty cells
dropna() : Returns a new Data Frame with no empty cells
- By default, it will not change the original. if you want to change the original DataFrame, use the
inplace=True
arg
fillna(A) : Replace empty cells with a value A, also can use inplace arg
-
dataframename[“columnname”].fillna(A) : Can specify the column name
-
Common way to replace empty cells, is to calculate the mean, median or mode value of the column
Wrong Format
# Convert all cells in the column
import pandas as pd
df = pd.read_csv(‘data.csv’)
df[‘column_name’] = pd.to_datetime(df[‘column_name’])
print(df.to_string)
# And then, remove empty value
df.dropna(subset=[‘column_name’], inplace=True)
Fix Wrong Data
# Replace value
dataframe_name.loc[row, column_name] = fix_value
# Or remove value
for x in dataframe_name.index:
if dataframe_name.loc[x, column_name] > boundaries:
dataframe_name.drop(x, inplace = True)
Remove Duplicates
# Find duplicate
print(dataframe_name.duplicated()) # output True or False
# Remove duplicate
dataframe_name.drop_duplicates(inplace=True)
corr()
A great aspect of the Pandas module
It calculates the relationship between each column in data set
It ignores “not numeric” columns
The result of the corr()
is a table with a lot of numbers that represents how well the relationship is between two columns(number varies from -1 to 1)
-
1 means that there is a 1 to 1 relationship(perfect correlation)
-
0 means that not a good relationship
Have to have at least 0.6(or -0.6) to call it a good correlation
plot()
To create diagrams
Can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(‘data.csv’)
df.plot()
plt.show()
# Specify that you want a scatter plot with the kind arg
# A scatter plot needs an x- and y- axis
df.plot(kind= ‘scatter’, x = ‘column_name1’, y = ‘column_name2’)
plt.show()
# Histogram that shows us the frequency of each interval
df[“Duration”].plot(kind = ‘hist’)