61.Pandas-Panal Data and Python Data Analysis


Pandas is a Python library used for working with data sets.It has functions for analyzing, cleaning, exploring, and manipulating data.The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas is an open-source library that is built on top of NumPy library. It is a Python package that offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance and productivity for users.

Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management. He convinced the AQR to allow him to open source the Pandas. Another AQR employee, Chang She, joined as the second major contributor to the library in 2012.

The source code for Pandas is located at this github repository https://github.com/pandas-dev/pandas

Installation and use of Pandas

If you have Python and PIP already installed on a system, then installation of Pandas is very easy.

Install it using this command:
pip install pandas

If this command fails, then use a python distribution that already has Pandas installed like, Anaconda, Spyder etc.

Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas

Now Pandas is imported and ready to use.
The version string is stored under __version__ attribute.
print(pd.__version__)

Advantages
  • Fast and efficient for manipulating and analyzing data.
  • Data from different file objects can be loaded.
  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Data set merging and joining.
  • Flexible reshaping and pivoting of data sets
  • Provides time-series functionality.
  • Powerful group by functionality for performing split-apply-combine operations on data sets.

Note:Pandas is widely used in data analysis, data cleaning, and data preprocessing tasks because of its powerful features and easy-to-use API. It is an essential library for anyone working with data in Python, particularly when dealing with tabular or structured data.

Pandas generally provide two data structure for manipulating data, They are:
Series
DataFrame

Pandas Series:
A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, etc.).
It is similar to a column in a spreadsheet or a single column in a database table.
A Series consists of two parts: the data and the index. The data contains the actual values, and the index provides labels for the data points.
You can create a Series using the pandas.Series() constructor, passing in a list, array, or dictionary as the data.
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

import pandas as pd
obj=pd.Series([3,5,-8,7,9])
print(obj)
0    3
1    5
2   -8
3    7
4    9
dtype: int64
print(obj.index) 
RangeIndex(start=0, stop=5, step=1)
print(obj.values)
[ 3  5 -8  7  9]
Often it will be desirable to create a Series with an index identifying each data point:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
d    4
b    7
a   -5
c    3
dtype: int64
NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:
print(obj2[obj2>0])
d    4
b    7
c    3
dtype: int64
print(obj2*2)
d     8
b    14
a   -10
c     6
dtype: int64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict:
'b' in obj2
True
'e' in obj2
False
If you have data contained in a Python dict, you can create a Series from it by passing the dict:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3=pd.Series(sdata)
print(obj3)
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas'] 
obj4=pd.Series(sdata,index=states) 
print(obj4)
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
3 values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values. The isnull and notnull functions in pandas should be used to detect missing data:
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
obj4.notnull()
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

A critical Series feature for many applications is that it automatically aligns differently
indexed data in arithmetic operations:
print(obj3)
ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64 
print(obj4)
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64 
print(obj3+obj4)
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
obj4.name='Population'
obj4.index.name='states'

print(obj4)
states
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: Population, dtype: float64

Pandas DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Features:
  • A DataFrame is a two-dimensional, tabular data structure that is similar to a spreadsheet or a SQL table.
  • It consists of rows and columns, where each column can be of a different data type.
  • You can think of a DataFrame as a collection of Series, where each Series represents a column, and the columns share the same index.
  • DataFrames are commonly used for data analysis and are very versatile for handling and analyzing structured data.
  • You can create a DataFrame using the pandas.DataFrame() constructor, passing in data (usually a dictionary, a list of dictionaries, or a NumPy array) with column labels.

We will get a brief insight on all these basic operation which can be performed on Pandas DataFrame :
Creating a DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Manipulating Data
Iterating over rows and columns
Exporting and Saving data

Creating a DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

import pandas as pd
 
# list of strings
lst = ['mec', 'minor', 'stud', 'eee',  'bio']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
      0
0    mec
1  minor
2   stud
3    eee
4    bio
Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

import pandas as pd
# initialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
 # Print the output.
print(df)
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18

Dealing with Rows and Columns- selecting data
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.
Column Selection: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

import pandas as pd
 # Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
 # Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
#print the data frame
print(df)
     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   24     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
#select one column
print(df['Name'])
0       Jai
1    Princi
2    Gaurav
3      Anuj
Name: Name, dtype: object
# select two columns
print(df[['Name', 'Qualification']])
         Name             Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd


Row Selection
Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.

Create a data file using excel and save it in CSV(Comma Separated Values) format as shown below


# Import pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("stud.csv", index_col ="rollno")
print(data)
           name     place  mark
rollno                        
101       binu  ernkulam    45
103      ashik  alleppey    35
102     faisal    kollam    48
105       biju   kotayam    25
106        ann   thrisur    30
print("retrieving row by loc method")
print(data.loc[101])
retrieving row by loc method
name         binu
place    ernkulam
mark           45
Name: 101, dtype: object
print("retrieving row by iloc method")
print(data.iloc[1])
retrieving row by iloc method
name        ashik
place    alleppey
mark           35
Name: 103, dtype: object
Selecting name and mark

#select rows based on conditions ( three different methods)
print(df[df['Age']>=25])
print(df[df.Age >= 25])
print(df.query("Age >= 25 "))

    Name  Age  Address Qualification
0   Jai   27    Delhi           Msc
3  Anuj   32  Kannauj           Phd


Working with Missing Data

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.

Checking for missing values using isnull() and notnull() :
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)

# using isnull() function  
print(df.isnull())
                First Score          Second Score      Third Score
0        False         False         True
1        False         False        False
2         True         False        False
3        False          True        False
#using notnull() function
print(df.notnull())
     First Score     Second Score Third Score
0         True          True        False
1         True          True         True
2        False          True         True
3         True         False         True


Filling missing values using fillna(), replace() and interpolate() :
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

import pandas as pd
import numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
 
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
print(df)
    First Score  Second Score  Third Score
0        100.0          30.0          NaN
1         90.0          45.0         40.0
2          NaN          56.0         80.0
3         95.0           NaN         98.0

# filling missing value using fillna()  
print(df.fillna(0))
        First Score    Second Score  Third Score
0        100.0          30.0          0.0
1         90.0          45.0         40.0
2          0.0          56.0         80.0
3         95.0           0.0         98.0
#filling the empty First Score with mean value
m=df['First Score'].mean()
df['First score'].fillna(m,Inplace=True)

#filling the empty First Score with median value
m=df['First Score'].median()
df['First score'].fillna(m,Inplace=True)

#filling the empty First Score with mod value
m=df['First Score'].mod()[0]
df['First score'].fillna(m,Inplace=True)

#filling the NaN values by interpolation
print(df.interpolate())
     First Score  Second Score  Third Score
0        100.0          30.0          NaN
1         90.0          45.0         40.0
2         92.5          56.0         80.0
3         95.0          56.0         98.0

#replacing the nan values with -1
print(df.replace(np.nan,-1))
 First Score  Second Score  Third Score
0        100.0          30.0         -1.0
1         90.0          45.0         40.0
2         -1.0          56.0         80.0
3         95.0          -1.0         98.0

#dropping the rows containing null values
print(df.dropna())
    First Score  Second Score  Third Score
1         90.0          45.0         40.0


Removing Duplicates

To remove duplicates, use the drop_duplicates() method.

Example:
df.drop_duplicates(inplace = True)

Iterating over rows and columns
Iteration is a general term for taking each item of something, one after another. Pandas DataFrame consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a dictionary.
In order to iterate over rows, we can use three function iteritems(), iterrows(), itertuples() . These three function will help in iteration over rows.

# importing pandas as pd 
import pandas as pd 
# dictionary of lists 
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
 'degree': ["MBA", "BCA", "M.Tech", "MBA"], 
'score':[90, 40, 80, 98]} 

# creating a dataframe from a dictionary 
df = pd.DataFrame(dict) 
print(df)
for i in df.itertuples(): # this will get each row as a tuple
    print(i) 
    print()

Output
    name  degree  score
0  aparna     MBA     90
1  pankaj     BCA     40
2  sudhir  M.Tech     80
3   Geeku     MBA     98
Pandas(Index=0, name='aparna', degree='MBA', score=90)

Pandas(Index=1, name='pankaj', degree='BCA', score=40)

Pandas(Index=2, name='sudhir', degree='M.Tech', score=80)

Pandas(Index=3, name='Geeku', degree='MBA', score=98)

for i,j in df.iterrows(): # this will get each index and each row values
    print(i,j)
    print()
0 name      aparna
degree       MBA
score         90
Name: 0, dtype: object

1 name      pankaj
degree       BCA
score         40
Name: 1, dtype: object

2 name      sudhir
degree    M.Tech
score         80
Name: 2, dtype: object

3 name      Geeku
degree      MBA
score        98
Name: 3, dtype: object

for i,j in df.iteritems():# this will extract each field separately
    print(i,j)
    print()
name 0    aparna
1    pankaj
2    sudhir
3     Geeku
Name: name, dtype: object

degree 0       MBA
1       BCA
2    M.Tech
3       MBA
Name: degree, dtype: object

score 0    90
1    40
2    80
3    98
Name: score, dtype: int64

You can convert a column to list and later process the list easily
sc=df['score'].to_list()
sc is a list of score

Data Manipulation Inserting/Deleting rows and columns
# importing pandas as pd 
import pandas as pd 
# dictionary of lists 
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
 'degree': ["MBA", "BCA", "M.Tech", "MBA"], 
'score':[90, 40, 80, 98]} 
# creating a dataframe from a dictionary 
df = pd.DataFrame(dict) 
print(df)
    name  degree  score
0  aparna     MBA     90
1  pankaj     BCA     40
2  sudhir  M.Tech     80
3   Geeku     MBA     98

df.loc[len(df.index)]=['binu','Phd',47] #adding a new row
print(df)
 name  degree  score
0  aparna     MBA     90
1  pankaj     BCA     40
2  sudhir  M.Tech     80
3   Geeku     MBA     98
4    binu     Phd     47

lst=[47,45,26,34,45]
df['age']=lst # adding a new column age at the end
print(df)
     name  degree  score  age
0  aparna     MBA     90   47
1  pankaj     BCA     40   45
2  sudhir  M.Tech     80   26
3   Geeku     MBA     98   34
4    binu     Phd     47   45
lst=[2002,2003,2004,2005,2017]
df.insert(1,'year',lst) # adding a new column at a particular position
print(df)
      name  year  degree  score  age
0  aparna  2002     MBA     90   47
1  pankaj  2003     BCA     40   45
2  sudhir  2004  M.Tech     80   26
3   Geeku  2005     MBA     98   34
4    binu  2017     Phd     47   45
df.drop([0,1],inplace=True) # use the index values to remove the rows
print(df)
     name  year  degree  score  age
2  sudhir  2004  M.Tech     80   26
3   Geeku  2005     MBA     98   34
4    binu  2017     Phd     47   45
df.drop(['score'],axis=1,inplace=True) # use the column name to drop a column
print(df)
  
     name  year  degree  age
2  sudhir  2004  M.Tech   26
3   Geeku  2005     MBA   34
4    binu  2017     Phd   45

Updating a particular value
The following will change the score in the  3rd row. You can also use index values with at command.
df.at[3,'score']=100
print(df)


This will add value 2 to all values in age column
df['age'] +=2
print(df)

Different methods for updating values
Using Indexing: You can use DataFrame indexing to access and modify specific values. 
For example:
import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
# Update a specific value
df.at[1, 'Age'] = 31

print(df)
  Name  Age
0    Alice   25
1      Bob   31
2  Charlie   35
Using loc or iloc: You can use the loc or iloc accessor to select and update values based on labels or integer-location. For example:
# Update using loc 
df.loc[0, 'Age'] = 26 
print(df)
        Name  Age
0    Alice   26
1      Bob   31
2  Charlie   35
# Update using iloc df.iloc[2, 1] = 36 print(df)
     Name  Age
0    Alice   26
1      Bob   31
2  Charlie   36

Using Boolean Indexing: You can use boolean indexing to update values based on a condition. For example:
# Update values based on a condition
df.loc[df['Name'] == 'Bob', 'Age'] = 32
print(df)
  Name  Age
0    Alice   26
1      Bob   32
2  Charlie   36
Using apply or applymap: If you want to update values based on a function, you can use apply or applymap. For example:
# Update values using apply
df['Age'] = df['Age'].apply(lambda x: x + 1)
print(df)
      Name  Age
0   Alicia   27
1      Bob   32
2  Charlie   37
# Update all values using applymap df = df.applymap(lambda x: x.upper() if isinstance(x, str) else x) print(df)
     Name  Age
0   ALICIA   27
1      BOB   32
2  CHARLIE   37
Using replace: You can use the replace method to replace specific values with other values. For example:
# Replace specific values
df.replace({'Alice': 'Alicia', 25: 26}, inplace=True)
print(df)
      Name  Age
0   Alicia   26
1      Bob   31
2  Charlie   36
Using assign: The assign method allows you to add new columns or modify existing ones. For example:
df2 = df.assign('Address'=['Delhi', 'Bangalore', 'Chennai'])
print(df2)
       Name          Age     Address
0    Alice   26   Delhi
1      Bob   31   Banglore
2  Charlie   36   Chennai

Handling data from a data file
Create a data file in excel and save it in CSV format

The following are the various functions you can do on this data file
# importing pandas as pd
import pandas as pd
df=pd.read_csv('stud.csv',index_col='rollno')
print("data frame stud")
print(df)
#output
data frame stud
          name     place    mark
rollno                        
101       binu  ernkulam    45
103      ashik  alleppey    35
102     faisal    kollam    48
105       biju   kotayam    25
106        ann   thrisur    25
107      padma      kylm    25

print("info")
print(df.info())
Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rollno  6 non-null      int64 
 1   name    6 non-null      object
 2   place   6 non-null      object
 3   mark    6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 208.0+ bytes
None

print("statistical info of numerical column")
print(df.describe())
statistical info of numerical column
            mark
count   6.000000
mean   33.833333
std    10.590877
min    25.000000
25%    25.000000
50%    30.000000
75%    42.500000
max    48.000000

print("columns")
print(df.columns)
columns
Index(['name', 'place', 'mark'], dtype='object')

print("size")
print(df.size)
size
18
print("data types")
print(df.dtypes)
data types
name     object
place    object
mark      int64
dtype: object

print("shapes")
print(df.shape)
shapes
(6, 3)

print("index and length of index")
print(df.index,len(df.index))
index and length of index
Int64Index([101, 103, 102, 105, 106, 107], dtype='int64', name='rollno') 6

Viewing Data:
Use methods like head(), tail(), sample(), and info() to view the first few rows, last few rows, random rows, and summary information about the DataFrame.
print("top 2 rows")
print(df.head(2))
top 2 rows
         name     place  mark
rollno                       
101      binu  ernkulam    45
103     ashik  alleppey    35
print("last 2 rows")
print(df.tail(2))
last 2 rows
         name    place  mark
rollno                      
106       ann  thrisur    25
107     padma     kylm    25
print("Random samples of 2 records")
df.sample(2)
Random samples of 2 records
         name    place  mark
rollno                      
106       ann  thrisur    25
107     padma     kylm    25
print("statistical functions")
print("sum=",df['mark'].sum())
print("mean=",df['mark'].mean())
print("max=",df['mark'].max())
print("min=",df['mark'].min())
print("var=",df['mark'].var())
print("standard deviation=",df['mark'].std())
print(df.std())
statistical functions
sum= 203
mean= 33.833333333333336
max= 48
min= 25
var= 112.16666666666667
standard deviation= 10.59087657687817
mark    10.590877
dtype: float64

print("data from rows 0,1,2")
print(df[0:3])
data from rows 0,1,2
          name     place  mark
rollno                        
101       binu  ernkulam    45
103      ashik  alleppey    35
102     faisal    kollam    48

print("mark column values")
print(df['mark'])
mark column values
rollno
101    45
103    35
102    48
105    25
106    25
107    25
Name: mark, dtype: int64

print("rows where mark >40")
print(df[df['mark']>40])
rows where mark >40
          name     place  mark
rollno                        
101       binu  ernkulam    45
102     faisal    kollam    48

print("rows 0,1,2 columns 0,2")
print(df.iloc[0:3,[0,2]])
data from rows 0,1,2 columns 0,2
      name  mark
rollno              
101       binu    45
103      ashik    35
102     faisal    48


Sorting

print("sorting in the descending order of marks")
print(df.sort_values(by='mark',ascending=False))
sorting in the descending order of marks
          name     place  mark
rollno                        
102     faisal    kollam    48
101       binu  ernkulam    45
103      ashik  alleppey    35
105       biju   kotayam    25
106        ann   thrisur    25
107      padma      kylm    25

Grouping and Aggregation:
Pandas allows you to group data based on one or more columns and perform aggregations on those groups.
print("use agg function to compute all the values")
print(df['mark'].agg(['min','max','mean']))
use agg function to compute all the values
min     25.000000
max     48.000000
mean    33.833333
Name: mark, dtype: float64

print("median of marks")
print("Median",df.sort_values(by='mark',ascending=False).median())
median of marks
Median mark    30.0
dtype: float64

print("mode of marks")
print("Mode",df.sort_values(by='mark',ascending=False)['mark'].mode())
mode of marks
Mode 0    25
dtype: int64

print("count of marks")
print(df['mark'].value_counts())
count of marks
25    3
45    1
35    1
48    1
Name: mark, dtype: int64

print("grouping data based on column value")
print(df.groupby('mark')['mark'].mean())
grouping data based on column value
mark
25    25
35    35
45    45
48    48
Name: mark, dtype: int64

print("mean age in each place")
place_groups = df.groupby('place') 
mean_age = place_groups['Age'].mean()

Plotting Data
print("plotting .....")
import  matplotlib.pyplot as plt

figure(1)
plt.hist(df['mark'])

figure(2)
plt.scatter(df['name'],df['mark'])

figure(3)
plt.pie(df['mark'])
plt.show()

Outputs:
plotting.....





Plotting using dataframe object method plot()

import pandas as pd

df=pd.read_csv('stud.csv')
print("data frame stud")
print(df)

df['mark'].plot(kind='hist') # df.hist(column='mark')  #df.hist() will choose all columns
df.plot(x='rollno',y='mark')
df.plot(kind='scatter',x='name',y='mark')
df.plot.box(column='mark'))  

Output:







Merging and Joining DataFrames:
You can combine multiple DataFrames by merging or joining them based on common columns or keys.
# Merge two DataFrames based on a common column 

merged_df = pd.merge(df1, df2, on='CommonColumn')

Writing Data to CSV file using pandas

The process of creating or writing a CSV file through Pandas can  be a little more complicated than reading CSV, but it's still relatively simple. We use the to_csv() function to perform this task. However, you have to create a Pandas DataFrame first, followed by writing that DataFrame to the CSV file.

Column names can also be specified via the keyword argument columns, as well as a different delimiter via the sep argument. Again, the default delimiter is a comma, ','.

Here is a simple example showing how to export a DataFrame to a CSV file via to_csv():
# importing pandas as pd 
import pandas as pd 
# dictionary of lists 
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
 'degree': ["MBA", "BCA", "M.Tech", "MBA"], 
'score':[90, 40, 80, 98]} 
# creating a dataframe from a dictionary 
df = pd.DataFrame(dict) 
print(df)
df.to_csv('studdata.csv')
#open the studdata.csv and see the data written

#Using another format with columns 
import pandas as pd

#initialize a dataframe
df = pd.DataFrame(
[[21, 'Amol', 72, 67],
[23, 'Lini', 78, 69],
[32, 'Kiku', 74, 56],
[52, 'Ajit', 54, 76]],
columns=['rollno', 'name', 'physics', 'chemistry'])

print('DataFrame with default index\n')
print(df)

#set column as index
df = df.set_index('rollno')

print('\nDataFrame with column as index\n')
print(df)
df.to_csv("studmarks.csv")

Reading  data from a csv file using pandas
# importing pandas as pd
import pandas as pd
df=pd.read_csv('stud.csv',index_col='rollno')
print("data frame stud")
print(df)

Writing data to a csv file using csv module
To create a CSV file and write data into it using Python, we have to follow a certain set of instructions:
  • Open the CSV file in writing (w mode) with the help of open() function
  • Create a CSV writer object by calling the writer() function of the csv module
  • Write data to CSV file by calling either the writerow() or writerows() method of the CSV writer object
  • Finally, close the CSV file
Example:
import csv
datarows =[['ABC123','Ganesh Kumar','S8','ABC',9.8],
          ['ECH265','John Mathew','S7','ECH',9.9],
          ['FET345','Reena K','S6','FET',9.7],
          ['GMT734','Adil M','S5','GMT',9.75]]
titles=['Roll.No','Name','Semester','College','CGPA']
file=open('cstoppers.csv',"w")
csvwriter=csv.writer(file)
csvwriter.writerow(titles)
csvwriter.writerows(datarows)
print("CSV file cstoppers.csv created")
file.close()

#using DictWriter 
The objects created using the csv.DictWriter() class are used to write to a CSV file from a Python dictionary.
csvfile - file object having write() method
fieldnames - this parameter includes a sequence of keys that identify the order in which values in the dictionary should be passed
writeheader(): This method is used to simply write the first row in our CSV file using pre-defined field names.
writerows(): This method is used to simply write all the rows where in each row, only the values(not keys) are written to the CSV file.
Example:
import csv 

# data rows as dictionary objects 
mydict =[{'Name': 'Binu', 'SNo': '1', 'Subject': 'English'}, 
         {'Name': 'Padma', 'SNo': '2', 'Subject': 'Mathematics'}, 
         {'Name': 'Aditi', 'SNo': '3', 'Subject': 'Physics'}]

# field names 
fields = ['SNo', 'Name', 'Subject'] 

csvfile=open('students.csv', 'w', newline='') 
csvwriter = csv.DictWriter(csvfile, fieldnames = fields)
csvwriter.writeheader() 
csvwriter.writerows(mydict)
file.close()

Reading data from a CSV file using csv module
import csv

file=open("students.csv", 'r')
csvreader = csv.reader(file)
for row in csvreader:
    print(row)

Reading and Writing  Excel  Files

Pandas is a very powerful and scalable tool for data analysis. It supports multiple file format as we might get the data in any format. Pandas also have support for excel file format.

We first need to import Pandas and load excel file, and then parse excel file sheets as a Pandas dataframe.

Example:
#read excel file into a data frame
import pandas as pd
# read by default 1st sheet of an excel file
dataframe1 = pd.read_excel('stud.xlsx')

print(dataframe1)
Output:
  rno name  mark
0  101  bvp    45
1  102  abc    34
2  103   xz    47

#Read second sheet of an excel file

import pandas as pd

# read 2nd sheet of an excel file
dataframe2 = pd.read_excel('stud.xlsx', sheet_name = 1)

print(dataframe2)

#Parsing an excel sheet and read the data
import pandas as pd

# Import the excel file and call it excel_file
excel_file = pd.ExcelFile('stud.xlsx')

# View the excel_file's sheet names
print(excel_file.sheet_names)

# Load the excel_file's Sheet1 as a dataframe
df = excel_file.parse('Sheet1')
print(df)

Output:
['Sheet1', 'Sheet2', 'Sheet3']
   rno name  mark
0  101  bvp    45
1  102  abc    34
2  103   xz    47

One can also read specific columns using ‘usecols‘ parameter of read_excel() method.
# reading specific columns of an excel file

import pandas as pd
require_cols = [0, 2]
# only read specific columns from an excel file
required_df = pd.read_excel('stud.xlsx', usecols = require_cols)
print(required_df)

Output:
    rno  mark
0  101    45
1  102    34
2  103    47
Skip starting rows when Reading an Excel File using ‘skiprows’ parameter of read_excel() method.
# skipping starting rows while reading

import pandas as pd
# read 2nd sheet of an excel file after
# skipping starting two rows
df = pd.read_excel('stud.xlsx', sheet_name = 1, skiprows = 2)

print(df)

Reading Multiple Excel Sheets using ‘sheet_name’ parameter of the read_excel()method.
#read multiple sheets

# import pandas lib as pd
import pandas as pd

# read both 1st and 2nd sheet.
df = pd.read_excel('stud.xlsx', na_values = "Missing",sheet_name =[0, 1])

print(df)

Reading all Sheets of the excel file together using ‘sheet_name’ parameter of the read_excel() method.
#reading all sheets together

import pandas as pd
all_sheets_df = pd.read_excel('stud.xlsx', na_values = "Missing",sheet_name = None)
print(all_sheets_df)

#Reading data from excel file using xlrd package
# Program to extract a particular row value
import xlrd
loc = ("stud.xlsx")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
#extracting column names
print(sheet.cell_value(0, 0),sheet.cell_value(0, 1),sheet.cell_value(0, 2))
for i in range(1,sheet.nrows):
    print(sheet.row_values(i))

#Writing data frame to an excel file using pandas
import pandas as pd
# create dataframe
df = pd.DataFrame({'name': ['Somu', 'Kiku', 'Amol', 'Lini'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})

# create excel writer object
writer = pd.ExcelWriter('marks.xlsx')

# write dataframe to excel
df.to_excel(writer,sheet_name='Sheet1')

# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')

Output:

Writing data frame to excel file without header and index
import pandas as pd
# create dataframe
df = pd.DataFrame({'name': ['Somu', 'Kiku', 'Amol', 'Lini'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})


# create excel writer object
writer = pd.ExcelWriter('marks.xlsx')

# write dataframe to excel
df.to_excel(writer,sheet_name='Sheet1',header=False,index=False)

# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')

Output:

#writing data to three different sheets
import pandas as pd
# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data1': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data2': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data3': [31, 32, 33, 34]})
# Create a Pandas Excel writer object
# using XlsxWriter as the engine.
writer = pd.ExcelWriter('data.xlsx',engine ='xlsxwriter')
# Write each dataframe to a different worksheet.
df1.to_excel(writer, sheet_name ='Sheet1')
df2.to_excel(writer, sheet_name ='Sheet2')
df3.to_excel(writer, sheet_name ='Sheet3')
# Close the Pandas Excel writer object
# and output the Excel file.
writer.save()
#Positioning data frame in a work sheet
import pandas as pd

# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data1': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data2': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data3': [31, 32, 33, 34]})
df4 = pd.DataFrame({'Data4': [35, 40, 43, 34]})
# Create a Pandas Excel writer object
# using XlsxWriter as the engine.
writer = pd.ExcelWriter('posdata.xlsx',engine ='xlsxwriter')

# write and Positioning the dataframes in the worksheet.
# Default position, cell A1.
df1.to_excel(writer, sheet_name ='Sheet1')
df2.to_excel(writer, sheet_name ='Sheet1', startcol = 3)
df3.to_excel(writer, sheet_name ='Sheet1', startrow = 6)
df4.to_excel(writer, sheet_name ='Sheet1',startrow = 7, startcol = 4)

# Close the Pandas Excel writer object
# and output the Excel file.
writer.save()

Output:


Comments

Popular posts from this blog

Python For Machine Learning - CST 283 - KTU Minor Notes- Dr Binu V P

46.Classes and Objects in Python- Accessors and mutators

KTU Python for machine learning Sample Question Paper and Answer Key Dec 2020