61.Pandas-Panal Data and Python Data Analysis
Pandas is a Python library used for working with data sets.It has functions for analyzing, cleaning, exploring, and manipulating data.The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Pandas is an open-source library that is built on top of NumPy library. It is a Python package that offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance and productivity for users.
Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management. He convinced the AQR to allow him to open source the Pandas. Another AQR employee, Chang She, joined as the second major contributor to the library in 2012.
The source code for Pandas is located at this github repository https://github.com/pandas-dev/pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Install it using this command:
pip install pandas
If this command fails, then use a python distribution that already has Pandas installed like, Anaconda, Spyder etc.
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas
Now Pandas is imported and ready to use.
print(pd.__version__)
Pandas generally provide two data structure for manipulating data, They are:
Series
DataFrame
Pandas DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Advantages
- Fast and efficient for manipulating and analyzing data.
- Data from different file objects can be loaded.
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Data set merging and joining.
- Flexible reshaping and pivoting of data sets
- Provides time-series functionality.
- Powerful group by functionality for performing split-apply-combine operations on data sets.
Note:Pandas is widely used in data analysis, data cleaning, and data preprocessing tasks because of its powerful features and easy-to-use API. It is an essential library for anyone working with data in Python, particularly when dealing with tabular or structured data.
Pandas generally provide two data structure for manipulating data, They are:
Series
DataFrame
Pandas Series:
A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, etc.).
It is similar to a column in a spreadsheet or a single column in a database table.
A Series consists of two parts: the data and the index. The data contains the actual values, and the index provides labels for the data points.
You can create a Series using the pandas.Series() constructor, passing in a list, array, or dictionary as the data.
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:
import pandas as pd
obj=pd.Series([3,5,-8,7,9])
print(obj)
0 3 1 5 2 -8 3 7 4 9 dtype: int64print(obj.index)
RangeIndex(start=0, stop=5, step=1)
print(obj.values)
[ 3 5 -8 7 9]Often it will be desirable to create a Series with an index identifying each data point:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
d 4 b 7 a -5 c 3 dtype: int64NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:
print(obj2[obj2>0])
d 4 b 7 c 3 dtype: int64print(obj2*2)
d 8 b 14 a -10 c 6 dtype: int64Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict:
'b' in obj2
True
'e' in obj2
False
If you have data contained in a Python dict, you can create a Series from it by passing the dict:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3=pd.Series(sdata)
print(obj3)
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4=pd.Series(sdata,index=states)
print(obj4)
California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float643 values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values. The isnull and notnull functions in pandas should be used to detect missing data:
obj4.isnull()
California True Ohio False Oregon False Texas False dtype: boolobj4.notnull()
California False Ohio True Oregon True Texas True dtype: bool
A critical Series feature for many applications is that it automatically aligns differently
indexed data in arithmetic operations:
print(obj3)ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64print(obj4)California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64print(obj3+obj4)California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN dtype: float64obj4.name='Population'
obj4.index.name='states'
print(obj4)states California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: Population, dtype: float64
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
Features:
- A DataFrame is a two-dimensional, tabular data structure that is similar to a spreadsheet or a SQL table.
- It consists of rows and columns, where each column can be of a different data type.
- You can think of a DataFrame as a collection of Series, where each Series represents a column, and the columns share the same index.
- DataFrames are commonly used for data analysis and are very versatile for handling and analyzing structured data.
- You can create a DataFrame using the pandas.DataFrame() constructor, passing in data (usually a dictionary, a list of dictionaries, or a NumPy array) with column labels.
We will get a brief insight on all these basic operation which can be performed on Pandas DataFrame :
Creating a DataFrameDealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Manipulating Data
Iterating over rows and columns
Iterating over rows and columns
Exporting and Saving data
Dealing with Rows and Columns- selecting data
Row Selection
#filling the NaN values by interpolation
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example:
df.drop_duplicates(inplace = True)
Iterating over rows and columns
# importing pandas as pd
Creating a DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:
import pandas as pd
# list of strings
lst = ['mec', 'minor', 'stud', 'eee', 'bio']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
0
0 mec
1 minor
2 stud
3 eee
4 bio
Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.
import pandas as pd
# initialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
Dealing with Rows and Columns- selecting data
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.
Column Selection: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
#print the data frame
print(df)
Name Age Address Qualification
0 Jai 27 Delhi Msc 1 Princi 24 Kanpur MA 2 Gaurav 22 Allahabad MCA 3 Anuj 32 Kannauj Phd
#select one column
print(df['Name'])
0 Jai 1 Princi 2 Gaurav 3 Anuj Name: Name, dtype: object
# select two columns
print(df[['Name', 'Qualification']])
Name Qualification
0 Jai Msc 1 Princi MA 2 Gaurav MCA 3 Anuj Phd
Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.
Create a data file using excel and save it in CSV(Comma Separated Values) format as shown below
# Import pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("stud.csv", index_col ="rollno")
print(data)
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48
105 biju kotayam 25
106 ann thrisur 30
print("retrieving row by loc method")
print(data.loc[101])
retrieving row by loc method
name binu
place ernkulam
mark 45
Name: 101, dtype: object
print("retrieving row by iloc method")
print(data.iloc[1])
retrieving row by iloc method
name ashik
place alleppey
mark 35
Name: 103, dtype: object
Selecting name and mark
#select rows based on conditions ( three different methods)
print(df[df['Age']>=25])
print(df[df.Age >= 25])
print(df.query("Age >= 25 "))
Name Age Address Qualification
0 Jai 27 Delhi Msc
3 Anuj 32 Kannauj Phd
Working with Missing Data
Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.
Checking for missing values using isnull() and notnull() :
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# using isnull() function
print(df.isnull())
First Score Second Score Third Score
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False
#using notnull() functionprint(df.notnull()) First Score Second Score Third Score 0 True True False 1 True True True 2 False True True 3 True False TrueFilling missing values using fillna(), replace() and interpolate() :
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
print(df)
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
# filling missing value using fillna()
print(df.fillna(0))
First Score Second Score Third Score 0 100.0 30.0 0.0 1 90.0 45.0 40.0 2 0.0 56.0 80.0 3 95.0 0.0 98.0#filling the empty First Score with mean value
m=df['First Score'].mean()
df['First score'].fillna(m,Inplace=True)
df['First score'].fillna(m,Inplace=True)
#filling the empty First Score with median value
m=df['First Score'].median()
df['First score'].fillna(m,Inplace=True)
df['First score'].fillna(m,Inplace=True)
#filling the empty First Score with mod value
m=df['First Score'].mod()[0]
df['First score'].fillna(m,Inplace=True)
df['First score'].fillna(m,Inplace=True)
print(df.interpolate())
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 92.5 56.0 80.0
3 95.0 56.0 98.0
#replacing the nan values with -1
print(df.replace(np.nan,-1))
First Score Second Score Third Score 0 100.0 30.0 -1.0 1 90.0 45.0 40.0 2 -1.0 56.0 80.0 3 95.0 -1.0 98.0
#dropping the rows containing null values
print(df.dropna())
First Score Second Score Third Score
1 90.0 45.0 40.0
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example:
df.drop_duplicates(inplace = True)
Iterating over rows and columns
Iteration is a general term for taking each item of something, one after another. Pandas DataFrame consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a dictionary.
In order to iterate over rows, we can use three function iteritems(), iterrows(), itertuples() . These three function will help in iteration over rows.
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print(df)
for i in df.itertuples(): # this will get each row as a tuple
print(i)
print()
Output
name degree score 0 aparna MBA 90 1 pankaj BCA 40 2 sudhir M.Tech 80 3 Geeku MBA 98 Pandas(Index=0, name='aparna', degree='MBA', score=90) Pandas(Index=1, name='pankaj', degree='BCA', score=40) Pandas(Index=2, name='sudhir', degree='M.Tech', score=80) Pandas(Index=3, name='Geeku', degree='MBA', score=98)
for i,j in df.iterrows(): # this will get each index and each row values
print(i,j)
print()
0 name aparna degree MBA score 90 Name: 0, dtype: object 1 name pankaj degree BCA score 40 Name: 1, dtype: object 2 name sudhir degree M.Tech score 80 Name: 2, dtype: object 3 name Geeku degree MBA score 98 Name: 3, dtype: object
for i,j in df.iteritems():# this will extract each field separately
print(i,j)
print()
name 0 aparna 1 pankaj 2 sudhir 3 Geeku Name: name, dtype: object degree 0 MBA 1 BCA 2 M.Tech 3 MBA Name: degree, dtype: object score 0 90 1 40 2 80 3 98 Name: score, dtype: int64
You can convert a column to list and later process the list easily
sc=df['score'].to_list()
sc is a list of score
Data Manipulation Inserting/Deleting rows and columns
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print(df)
name degree score
0 aparna MBA 90 1 pankaj BCA 40 2 sudhir M.Tech 80 3 Geeku MBA 98
df.loc[len(df.index)]=['binu','Phd',47] #adding a new row
print(df)
name degree score 0 aparna MBA 90 1 pankaj BCA 40 2 sudhir M.Tech 80 3 Geeku MBA 98 4 binu Phd 47
lst=[47,45,26,34,45]
df['age']=lst # adding a new column age at the end
print(df)
name degree score age
0 aparna MBA 90 47
1 pankaj BCA 40 45
2 sudhir M.Tech 80 26
3 Geeku MBA 98 34
4 binu Phd 47 45
lst=[2002,2003,2004,2005,2017]
df.insert(1,'year',lst) # adding a new column at a particular position
print(df)
name year degree score age
0 aparna 2002 MBA 90 47
1 pankaj 2003 BCA 40 45
2 sudhir 2004 M.Tech 80 26
3 Geeku 2005 MBA 98 34
4 binu 2017 Phd 47 45
df.drop([0,1],inplace=True) # use the index values to remove the rows
print(df)
name year degree score age
2 sudhir 2004 M.Tech 80 26
3 Geeku 2005 MBA 98 34
4 binu 2017 Phd 47 45
df.drop(['score'],axis=1,inplace=True) # use the column name to drop a column
print(df)
name year degree age 2 sudhir 2004 M.Tech 26 3 Geeku 2005 MBA 34 4 binu 2017 Phd 45
Updating a particular value
The following will change the score in the 3rd row. You can also use index values with at command.
df.at[3,'score']=100
print(df)
This will add value 2 to all values in age column
df['age'] +=2
print(df)
Different methods for updating values
Using Indexing: You can use DataFrame indexing to access and modify specific values. For example:
Grouping and Aggregation:
df=pd.read_csv('stud.csv')
print("data frame stud")
print(df)
df['mark'].plot(kind='hist') # df.hist(column='mark') #df.hist() will choose all columns
df.plot(x='rollno',y='mark')
df.plot(kind='scatter',x='name',y='mark')
Merging and Joining DataFrames:
# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data1': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data2': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data3': [31, 32, 33, 34]})
df4 = pd.DataFrame({'Data4': [35, 40, 43, 34]})
# Create a Pandas Excel writer object
# using XlsxWriter as the engine.
writer = pd.ExcelWriter('posdata.xlsx',engine ='xlsxwriter')
# write and Positioning the dataframes in the worksheet.
# Default position, cell A1.
df1.to_excel(writer, sheet_name ='Sheet1')
df2.to_excel(writer, sheet_name ='Sheet1', startcol = 3)
df3.to_excel(writer, sheet_name ='Sheet1', startrow = 6)
df4.to_excel(writer, sheet_name ='Sheet1',startrow = 7, startcol = 4)
# Close the Pandas Excel writer object
# and output the Excel file.
writer.save()
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
Name Age 0 Alice 25 1 Bob 30 2 Charlie 35
# Update a specific value
df.at[1, 'Age'] = 31
print(df)
Name Age 0 Alice 25 1 Bob 31 2 Charlie 35
Using loc or iloc: You can use the loc or iloc accessor to select and update values based on labels or integer-location. For example:
# Update using loc df.loc[0, 'Age'] = 26
print(df)
Name Age 0 Alice 26 1 Bob 31 2 Charlie 35# Update using iloc df.iloc[2, 1] = 36 print(df)
Name Age
0 Alice 26
1 Bob 31
2 Charlie 36
Using Boolean Indexing: You can use boolean indexing to update values based on a condition. For example:
# Update values based on a condition
df.loc[df['Name'] == 'Bob', 'Age'] = 32
print(df)
Name Age 0 Alice 26 1 Bob 32 2 Charlie 36
Using apply or applymap: If you want to update values based on a function, you can use apply or applymap. For example:
# Update values using apply
df['Age'] = df['Age'].apply(lambda x: x + 1)
print(df)
Name Age
0 Alicia 27 1 Bob 32 2 Charlie 37# Update all values using applymap df = df.applymap(lambda x: x.upper() if isinstance(x, str) else x) print(df)
Name Age
0 ALICIA 27
1 BOB 32
2 CHARLIE 37
Using replace: You can use the replace method to replace specific values with other values. For example:
# Replace specific values
df.replace({'Alice': 'Alicia', 25: 26}, inplace=True)
print(df)
Name Age
0 Alicia 26 1 Bob 31 2 Charlie 36
Using assign: The assign method allows you to add new columns or modify existing ones. For example:
df2 = df.assign('Address'=['Delhi', 'Bangalore', 'Chennai'])print(df2)
Name Age Address
0 Alice 26 Delhi 1 Bob 31 Banglore 2 Charlie 36 Chennai
Handling data from a data file
Create a data file in excel and save it in CSV format
The following are the various functions you can do on this data file
# importing pandas as pd
import pandas as pd
df=pd.read_csv('stud.csv',index_col='rollno')
print("data frame stud")
print(df)
#output
data frame stud name place mark rollno 101 binu ernkulam 45 103 ashik alleppey 35 102 faisal kollam 48 105 biju kotayam 25 106 ann thrisur 25 107 padma kylm 25
print("info")
print(df.info())
Info
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rollno 6 non-null int64 1 name 6 non-null object 2 place 6 non-null object 3 mark 6 non-null int64 dtypes: int64(2), object(2) memory usage: 208.0+ bytes None
print("statistical info of numerical column")
print(df.describe())
statistical info of numerical column mark count 6.000000 mean 33.833333 std 10.590877 min 25.000000 25% 25.000000 50% 30.000000 75% 42.500000 max 48.000000
print("columns")
print(df.columns)
columns Index(['name', 'place', 'mark'], dtype='object')
print("size")
print(df.size)
size 18
print("data types")
print(df.dtypes)
data types name object place object mark int64 dtype: object
print("shapes")
print(df.shape)
shapes (6, 3)
print("index and length of index")
print(df.index,len(df.index))
index and length of index Int64Index([101, 103, 102, 105, 106, 107], dtype='int64', name='rollno') 6
Viewing Data:
Use methods like head(), tail(), sample(), and info() to view the first few rows, last few rows, random rows, and summary information about the DataFrame.
print("top 2 rows")
print(df.head(2))
top 2 rows name place mark rollno 101 binu ernkulam 45 103 ashik alleppey 35
print("last 2 rows")
print(df.tail(2))
last 2 rows name place mark rollno 106 ann thrisur 25 107 padma kylm 25
print("Random samples of 2 records")
df.sample(2)
Random samples of 2 records name place mark rollno 106 ann thrisur 25 107 padma kylm 25
print("statistical functions")
print("sum=",df['mark'].sum())
print("mean=",df['mark'].mean())
print("max=",df['mark'].max())
print("min=",df['mark'].min())
print("var=",df['mark'].var())
print("standard deviation=",df['mark'].std())
print(df.std())
statistical functions sum= 203 mean= 33.833333333333336 max= 48 min= 25 var= 112.16666666666667 standard deviation= 10.59087657687817 mark 10.590877 dtype: float64
print("data from rows 0,1,2")
print(df[0:3])
data from rows 0,1,2
name place mark
rollno 101 binu ernkulam 45 103 ashik alleppey 35 102 faisal kollam 48
print("mark column values")
print(df['mark'])
mark column values rollno 101 45 103 35 102 48 105 25 106 25 107 25 Name: mark, dtype: int64
print("rows where mark >40")
print(df[df['mark']>40])
rows where mark >40 name place mark rollno 101 binu ernkulam 45 102 faisal kollam 48
print("rows 0,1,2 columns 0,2")
print(df.iloc[0:3,[0,2]])
data from rows 0,1,2 columns 0,2
name mark rollno 101 binu 45 103 ashik 35 102 faisal 48Sorting
print("sorting in the descending order of marks")
print(df.sort_values(by='mark',ascending=False))
sorting in the descending order of marks name place mark rollno 102 faisal kollam 48 101 binu ernkulam 45 103 ashik alleppey 35 105 biju kotayam 25 106 ann thrisur 25 107 padma kylm 25
Grouping and Aggregation:
Pandas allows you to group data based on one or more columns and perform aggregations on those groups.
print("use agg function to compute all the values")
print(df['mark'].agg(['min','max','mean']))
use agg function to compute all the values min 25.000000 max 48.000000 mean 33.833333 Name: mark, dtype: float64
print("median of marks")
print("Median",df.sort_values(by='mark',ascending=False).median())
median of marks Median mark 30.0 dtype: float64
print("mode of marks")
print("Mode",df.sort_values(by='mark',ascending=False)['mark'].mode())
mode of marks Mode 0 25 dtype: int64
print("count of marks")
print(df['mark'].value_counts())
count of marks 25 3 45 1 35 1 48 1 Name: mark, dtype: int64
print("grouping data based on column value")
print(df.groupby('mark')['mark'].mean())
grouping data based on column value mark 25 25 35 35 45 45 48 48 Name: mark, dtype: int64
print("mean age in each place")
place_groups = df.groupby('place') mean_age = place_groups['Age'].mean()
Plotting Data
print("plotting .....")
import matplotlib.pyplot as plt
figure(1)
plt.hist(df['mark'])
figure(2)
plt.scatter(df['name'],df['mark'])
figure(3)
plt.pie(df['mark'])
plt.show()
Outputs:
Plotting using dataframe object method plot() import pandas as pd
df=pd.read_csv('stud.csv')
print("data frame stud")
print(df)
df['mark'].plot(kind='hist') # df.hist(column='mark') #df.hist() will choose all columns
df.plot(x='rollno',y='mark')
df.plot(kind='scatter',x='name',y='mark')
df.plot.box(column='mark'))
Output:
Output:
Merging and Joining DataFrames:
You can combine multiple DataFrames by merging or joining them based on common columns or keys.
# Merge two DataFrames based on a common column merged_df = pd.merge(df1, df2, on='CommonColumn')
Here is a simple example showing how to export a DataFrame to a CSV file via to_csv():
One can also read specific columns using ‘usecols‘ parameter of read_excel() method.
import pandas as pd
all_sheets_df = pd.read_excel('stud.xlsx', na_values = "Missing",sheet_name = None)
print(all_sheets_df)
#Writing data frame to an excel file using pandas
# write dataframe to excel
df.to_excel(writer,sheet_name='Sheet1')
# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')
# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data1': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data2': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data3': [31, 32, 33, 34]})
# Create a Pandas Excel writer object
# using XlsxWriter as the engine.
writer = pd.ExcelWriter('data.xlsx',engine ='xlsxwriter')
# Write each dataframe to a different worksheet.
df1.to_excel(writer, sheet_name ='Sheet1')
df2.to_excel(writer, sheet_name ='Sheet2')
df3.to_excel(writer, sheet_name ='Sheet3')
# Close the Pandas Excel writer object
# and output the Excel file.
writer.save()
import pandas as pdWriting Data to CSV file using pandas
The process of creating or writing a CSV file through Pandas can be a little more complicated than reading CSV, but it's still relatively simple. We use the to_csv() function to perform this task. However, you have to create a Pandas DataFrame first, followed by writing that DataFrame to the CSV file.
Column names can also be specified via the keyword argument columns, as well as a different delimiter via the sep argument. Again, the default delimiter is a comma, ','.
Here is a simple example showing how to export a DataFrame to a CSV file via to_csv():
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print(df)
df.to_csv('studdata.csv')
#open the studdata.csv and see the data written
#Using another format with columns
import pandas as pd
#initialize a dataframe
df = pd.DataFrame(
[[21, 'Amol', 72, 67],
[23, 'Lini', 78, 69],
[32, 'Kiku', 74, 56],
[52, 'Ajit', 54, 76]],
columns=['rollno', 'name', 'physics', 'chemistry'])
print('DataFrame with default index\n')
print(df)
#set column as index
df = df.set_index('rollno')
print('\nDataFrame with column as index\n')
print(df)
df.to_csv("studmarks.csv")
Reading data from a csv file using pandas
# importing pandas as pd
import pandas as pd
df=pd.read_csv('stud.csv',index_col='rollno')
print("data frame stud")
print(df)
Writing data to a csv file using csv module
To create a CSV file and write data into it using Python, we have to follow a certain set of instructions:
- Open the CSV file in writing (w mode) with the help of open() function
- Create a CSV writer object by calling the writer() function of the csv module
- Write data to CSV file by calling either the writerow() or writerows() method of the CSV writer object
- Finally, close the CSV file
Example:
import csv
datarows =[['ABC123','Ganesh Kumar','S8','ABC',9.8],
['ECH265','John Mathew','S7','ECH',9.9],
['FET345','Reena K','S6','FET',9.7],
['GMT734','Adil M','S5','GMT',9.75]]
titles=['Roll.No','Name','Semester','College','CGPA']
file=open('cstoppers.csv',"w")
csvwriter=csv.writer(file)
csvwriter.writerow(titles)
csvwriter.writerows(datarows)
print("CSV file cstoppers.csv created")
file.close()
#using DictWriter
The objects created using the csv.DictWriter() class are used to write to a CSV file from a Python dictionary.
csvfile - file object having write() methodfieldnames - this parameter includes a sequence of keys that identify the order in which values in the dictionary should be passed
writeheader(): This method is used to simply write the first row in our CSV file using pre-defined field names.
writerows(): This method is used to simply write all the rows where in each row, only the values(not keys) are written to the CSV file.
Example:
import csv
# data rows as dictionary objects
mydict =[{'Name': 'Binu', 'SNo': '1', 'Subject': 'English'},
{'Name': 'Padma', 'SNo': '2', 'Subject': 'Mathematics'},
{'Name': 'Aditi', 'SNo': '3', 'Subject': 'Physics'}]
# field names
fields = ['SNo', 'Name', 'Subject']
csvfile=open('students.csv', 'w', newline='')
csvwriter = csv.DictWriter(csvfile, fieldnames = fields)
csvwriter.writeheader()
csvwriter.writerows(mydict)
file.close()
Reading data from a CSV file using csv module
import csv
file=open("students.csv", 'r')
csvreader = csv.reader(file)
for row in csvreader:
print(row)
Reading and Writing Excel Files
Pandas is a very powerful and scalable tool for data analysis. It supports multiple file format as we might get the data in any format. Pandas also have support for excel file format.
We first need to import Pandas and load excel file, and then parse excel file sheets as a Pandas dataframe.
Example:
#read excel file into a data frame
import pandas as pd
# read by default 1st sheet of an excel file
dataframe1 = pd.read_excel('stud.xlsx')
print(dataframe1)
Output:
rno name mark 0 101 bvp 45 1 102 abc 34 2 103 xz 47
#Read second sheet of an excel file
import pandas as pd
# read 2nd sheet of an excel file
dataframe2 = pd.read_excel('stud.xlsx', sheet_name = 1)
print(dataframe2)
#Parsing an excel sheet and read the data
import pandas as pd
# Import the excel file and call it excel_file
excel_file = pd.ExcelFile('stud.xlsx')
# View the excel_file's sheet names
print(excel_file.sheet_names)
# Load the excel_file's Sheet1 as a dataframe
df = excel_file.parse('Sheet1')
print(df)
Output:
['Sheet1', 'Sheet2', 'Sheet3'] rno name mark 0 101 bvp 45 1 102 abc 34 2 103 xz 47
# reading specific columns of an excel file
import pandas as pd
require_cols = [0, 2]
# only read specific columns from an excel file
required_df = pd.read_excel('stud.xlsx', usecols = require_cols)
print(required_df)
Output:
rno mark
0 101 45
1 102 34
2 103 47
Skip starting rows when Reading an Excel File using ‘skiprows’ parameter of read_excel() method.
# skipping starting rows while reading
import pandas as pd
# read 2nd sheet of an excel file after
# skipping starting two rows
df = pd.read_excel('stud.xlsx', sheet_name = 1, skiprows = 2)
print(df)
Reading Multiple Excel Sheets using ‘sheet_name’ parameter of the read_excel()method.
#read multiple sheets
# import pandas lib as pd
import pandas as pd
# read both 1st and 2nd sheet.
df = pd.read_excel('stud.xlsx', na_values = "Missing",sheet_name =[0, 1])
print(df)
Reading all Sheets of the excel file together using ‘sheet_name’ parameter of the read_excel() method.
#reading all sheets together
all_sheets_df = pd.read_excel('stud.xlsx', na_values = "Missing",sheet_name = None)
print(all_sheets_df)
#Reading data from excel file using xlrd package# Program to extract a particular row value
import xlrd
loc = ("stud.xlsx")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
#extracting column names
print(sheet.cell_value(0, 0),sheet.cell_value(0, 1),sheet.cell_value(0, 2))
for i in range(1,sheet.nrows):
print(sheet.row_values(i))
import pandas as pd
# create dataframe
df = pd.DataFrame({'name': ['Somu', 'Kiku', 'Amol', 'Lini'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})
# create excel writer object
writer = pd.ExcelWriter('marks.xlsx')
# create dataframe
df = pd.DataFrame({'name': ['Somu', 'Kiku', 'Amol', 'Lini'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})
# create excel writer object
writer = pd.ExcelWriter('marks.xlsx')
# write dataframe to excel
df.to_excel(writer,sheet_name='Sheet1')
# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')
Output:
Writing data frame to excel file without header and index
import pandas as pd
# create dataframe
df = pd.DataFrame({'name': ['Somu', 'Kiku', 'Amol', 'Lini'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})
# create excel writer object
writer = pd.ExcelWriter('marks.xlsx')
# write dataframe to excel
df.to_excel(writer,sheet_name='Sheet1',header=False,index=False)
# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')
Output:
# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data1': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data2': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data3': [31, 32, 33, 34]})
# Create a Pandas Excel writer object
# using XlsxWriter as the engine.
writer = pd.ExcelWriter('data.xlsx',engine ='xlsxwriter')
# Write each dataframe to a different worksheet.
df1.to_excel(writer, sheet_name ='Sheet1')
df2.to_excel(writer, sheet_name ='Sheet2')
df3.to_excel(writer, sheet_name ='Sheet3')
# Close the Pandas Excel writer object
# and output the Excel file.
writer.save()
# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data1': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data2': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data3': [31, 32, 33, 34]})
df4 = pd.DataFrame({'Data4': [35, 40, 43, 34]})
# Create a Pandas Excel writer object
# using XlsxWriter as the engine.
writer = pd.ExcelWriter('posdata.xlsx',engine ='xlsxwriter')
# write and Positioning the dataframes in the worksheet.
# Default position, cell A1.
df1.to_excel(writer, sheet_name ='Sheet1')
df2.to_excel(writer, sheet_name ='Sheet1', startcol = 3)
df3.to_excel(writer, sheet_name ='Sheet1', startrow = 6)
df4.to_excel(writer, sheet_name ='Sheet1',startrow = 7, startcol = 4)
# Close the Pandas Excel writer object
# and output the Excel file.
writer.save()
Output:
Comments
Post a Comment