DataFrame Blog 2 CSV Files

Unit 1 – Data Handling I

Chapter 2

DataFrames:

1.Creation –

▪ from dictionary of Series,

▪ list of dictionaries,

▪ Text/CSV files;

2. Display of the dataframe

3. Iteration of rows and columns

4. Addition of a row or a column to a dataframe

5. Deletion of a row or a column to a dataframe

6. Merging of rows / columns / dataframe to other dataframes.

7. Indexing using Labels, Boolean Indexing;

CSV file - Comma Separated Value Files. These files are ordinary text files where data is stored in rows

and the column values are separated by comma or blank space or tab spaces. These separation is also

termed as delimition or we say the data are delimited with a comma or tab space.

The data is when separated with spaces then it is said to be fixed.

Creating a CSV File -

Stage 1 -->

1. Open any text editor file say Notepad or TextEdit

2. Enter the data like this -

Roll_No,Name,Percentage

1,Abhishek, 99,97

2,Amitej,98,96

3,Naman,99,98

4,Prakhar,99,99

3. Copy the above data (Ctrl + A and then Ctrl + C)

4. Save the file

Stage 2 -->

1. Open any spreadsheet software say MS Excel.

2. Open a new File > Paste the data (Ctrl + V)

(You will see the entire data is pasted in the first column itself)

3. Select Data tab > Text to Columns > A Wizard box will open which will help you to select that whether

the data is delimited or fixed.

Since here data are separated with comma so Delimited option should be selected (that is the default

selection) and Click on Next button

Now in the next window the default selected delimiter is Tab

4. Select the correct delimiter from the given options and since in our file the data are separated by

comma so select the Comma check box and then Click on Next button.

5. If the data format is General (By default) then click on Finish button

Now see the data in the spreadsheet has automatically spread in the rows and its respective columns.

6. Save the file with the extension .csv in MSExcel

Usage of CSV File -

CSV files are simple, compact, high storage of data and that is the reason organizations which need

continuous data support and storage allows its data to be downloaded in CSV format.

Almost every field d areas which needs to show statistical data have CSV supportive format.

Like the share market stats, populations survey sites, cbse results and data uploading sites, NEET data

bank etc.

Creating & Display of a DataFrame using CSV files.

Method 1: Using pandas.read_csv( csvfile_path, skiprows, names, na_values)

csvfile_path= “location of the csv file” (Saved in your system / online)

skiprows =value; is an argument that allows you to specify the number of lines to skip at the start of the file.

names = [ ‘column1 name, column2 name, column3 name [ , …….. , …..] ]

From a specified CSV file if you do not want all the columns to be the part of DataFrame then only the required column names need to be specified as an array value to the ‘names’ argument.

na_values=[‘value1’, ‘value2’[, ……] ]

In a csv file there can be a string value mentioned in a numeric column (Like AB in marks column), So such column should be treated as type int64 or float64, and missing data should be encoded as NaN so that you can apply statistics in a missing-value-friendly manner for further statistical calculations or plotting.

Example 1 

import pandas

Mydf=pandas.read_csv( "/Users/vineeta/Desktop/Rough.csv")

print(" The tabular data from the file Rough is - \n ")

print(Mydf)

Example 2 

import pandas

Mydf=pandas.read_csv("https://www.nseindia.com/live_market/dynaContent/liv e_watch/equities_stock_watch.htm", names= [ ‘Symbol’, ‘Open’, ‘High’, ‘Low’, ‘Close’] )

print(" The Stock_data from NSE web site for August 24, 2020 is - \n \n ")

print(Mydf)

The data is of National Stock Exchange of India’s stock listings for October 13, 2019 taken from the

website - https://www.nseindia.com/live_market/dynaContent/live_watch/equities_stock_watch.htm

The stock data is saved in E:\Stock_Data.csv

Example - 3

import pandas as pd

import numpy as np

Mydf=pd.read_csv( "E:\Stock_Data.csv", skiprows=1, names=[ 'Symbol', 'Open', 'High', 'Low', 'Close' ] )

print(" The Stock_data from NSE web site for October 13, 2019 is - \n \n ", Mydf)

highe=Mydf.max( )

lowe= Mydf.min( )

print(" \n The highest stock for the day is = \n", highe)

print(" \n The lowest stock for the day is = \n", lowe)

The Stock_data from NSE web site for October 13, 2019 is -

Symbol Open High Low Close

0 NIFTY 50 11,919.45 11,958.85 11,881.75 11,940.10

1 INFRATEL 227.95 259.85 227.7 251.65

2 BHARTIARTL 426.9 445.65 421.45 444.65

3 AXISBANK 725 750.2 725 749.8

4 RELIANCE 1,467.00 1,514.90 1,465.00 1,511.55

5 POWERGRID 191 196 190.2 195.95

6 CIPLA 461.2 477.5 459.55 473.3

7 GRASIM 775 793.5 772 789

8 SBIN 325.8 331.8 322.1 331.4

9 TECHM 754 766 752 763.25

10 INDUSINDBK 1,382.00 1,409.00 1,374.55 1,395.00

11 EICHERMOT 21,600.00 21,799.95 21,424.20 21,780.05

12 INFY 709.9 714 705 712.8

13 HDFCBANK 1,265.30 1,275.00 1,261.10 1,274.75

14 IOC 132.4 133.6 131.75 133.5

15 BAJAJFINSV 9,070.00 9,189.00 9,066.30 9,163.00

16 DRREDDY 2,741.00 2,760.45 2,724.50 2,751.00

17 TITAN 1,158.80 1,167.80 1,153.10 1,164.00

18 WIPRO 248.2 251.15 247.7 250.45

19 GAIL 124.7 125.8 123.15 125.3

20 LT 1,364.50 1,378.20 1,361.50 1,369.50

21 UPL 551 556.75 538.1 550.2

22 SUNPHARMA 425 429.9 421.6 425.2

23 NESTLEIND 14,110.00 14,211.90 14,009.60 14,100.00

24 NTPC 117.75 118.3 116.8 117.5

25 BAJAJ-AUTO 3,156.00 3,171.50 3,122.10 3,156.30

26 HINDALCO 193.6 194 190.65 193.25

27 BPCL 520.65 521.9 513.5 520.7

28 KOTAKBANK 1,629.90 1,631.10 1,614.15 1,623.00

29 COALINDIA 200.95 201.3 198.25 199.75

30 BAJFINANCE 4,165.00 4,182.15 4,126.25 4,140.00

31 ADANIPORTS 367.5 368.15 361.55 364

32 BRITANNIA 3,140.90 3,158.35 3,110.00 3,124.50

33 ONGC 135 135.1 132.8 133.1

34 ITC 251.75 251.8 248.8 249.25

35 HCLTECH 1,139.00 1,143.50 1,128.65 1,133.90

36 MARUTI 7,146.00 7,155.00 7,032.00 7,054.00

37 JSWSTEEL 249.4 250.85 246.6 248

38 ULTRACEMCO 4,130.00 4,130.00 4,085.25 4,097.00

39 VEDL 143.75 143.9 140.4 142.3

40 ICICIBANK 497.95 502 489.85 494.3

41 HINDUNILVR 2,060.05 2,060.05 2,030.00 2,039.00

42 HDFC 2,235.60 2,238.75 2,202.20 2,215.70

43 ASIANPAINT 1,747.95 1,748.00 1,718.50 1,723.00

44 HEROMOTOCO 2,520.00 2,520.00 2,463.10 2,473.00

45 TATAMOTORS 170.8 171.75 167.1 167.9

46 TATASTEEL 414 414.85 400 403

47 TCS 2,153.00 2,154.65 2,105.00 2,110.00

48 ZEEL 293.75 293.75 282.5 288

49 M&M 573.75 575 559.65 561

50 YESBANK 65 67.25 63.05 64.2

The highest stock for the day is =

Symbol EICHERMOT

Open 21,780.05

High 21,799.95

Low 21,424.20

Close 21,600.00

dtype: object

The lowest stock for the day is =

Symbol YESBANK

Open 64.2

High 67.25

Low 63.05

Close 65

dtype: object

Sample DataFrame - (Mydf1)

3. Operations on Data Frame

Selection / Iteration - is an operation performed over the dataframe to show/extract

values from the dataframe either row wise or column wise with the two methods -

1. for loop -

import pandas as pd

sample={'Employee' : ['Amitej', 'Prakhar', 'Naman', 'Amitej', 'Prakhar'],

'Payable Amount':[10000, 12000, 14000, 20000, 15000]}

mydf=pd.DataFrame(sample)

print(mydf)

print("\nIterating over rows using index attribute :\n")

for i in mydf.index:

print(mydf['Employee'][i])

           import pandas as pd
sample={'Employee' : ['Amitej', 'Prakhar', 'Naman', 'Amitej', 'Prakhar'],
'Payable Amount':[10000, 12000, 14000, 20000, 15000]}
mydf=pd.DataFrame(sample)
print(mydf)
print("\nIterating over rows using index attribute :\n")
for i in mydf.index:
print(mydf['Employee'][i])
print()
for i in range(len(mydf)) :
print(mydf.loc[i, "Employee"])
print()
for i in range(len(mydf)) :
print(mydf.iloc[i, 0])

2. iterrows( ):

3. iteritems( ):

iterrows( ): - is the method / function which shows data of a dataframe row wise.

Syntax – for (row, rowSeries) in dataframeobjectname.iterows( ):

executable statement(s)

# row is the fixed keyword with initial row index value which keeps iterating (increases by 1)

# rowSeries is the keyword which extracts the data for each column along with column heading of the current row index

Eg- for (row, rowSeries) in mydf1.iterrows( ):

print(“Row Index – ”, row, “ & its record is – ” rowSeries)

iteritems( ): - is the method / function which shows data of a dataframe column / item wise.

Syntax – for (col, colSeries) in dataframeobjectname.iteritems( ):

executable statement(s)

#col is the keyword with default column index value which keeps iterating

#colSeries is the keyword which extracts the data for each row along with row heading of the current column index

#and the loop iterates from the initial value (0) of index to the last index value

Eg -- Same main program code to continue (Mydf1)

for (col, colSeries) in Mydf1.iteritems( ):

print(“Column Index – ”, col, “ & its each data item is –”, colSeries)

pandas.pivot_table( ) - is the method of pandas which allows to create a subset of a dataframe as what like pandas.pivot( )

The advantages of using pandas.pivot_table( ) is

* removes the chances of duplicity of data (how?)
* contains count(), sum() and table related functions
* converts row to column (vice versa)

Syntax – pandas.pivot_table(DataFrame, values=None, index=None, columns=None, aggfunc=‘funckeyword’, fill_value=None, dropna=True)

Primary parameters/ arguments are – index, values and aggfunc

#values – assign the column to be aggregated

#index - is assigned with a column / array / list

#columns - is assigned with a column / array / list

#aggfunc – is the aggregation keyword to be applied

#fil_value - is assigned with a default value which appears when values do not match

#dropna-is assigned with true/false(default) and if true results in dataframe with dropped

rows with missing values

Example 1 -

import pandas as pd

sample={

‘Employee’ : [‘Amitej’, ‘Prakhar’, ‘Naman’, ‘Amitej’, ‘Prakhar’],

‘Payable Amount’ :[10000, 12000, 14000, 20000, 15000]

} #creating a dictionary named sample

mydf2=pd.DataFrame(sample) #dataframe is created through dictionary

print(mydf2) O/P 1

newtable=pd.pivot_table(mydf2, index=[‘Employee’], aggfunc=numpy.sum)

print(newtable) O/P 2

O/P 1

Employee Payable Amount

0 Amitej 10000

1 Prakhar 12000

2 Naman 14000

3 Amitej 20000

4 Prakhar 15000

O/P 2

Employee Payable Amount

Amitej 30000

Prakhar 27000

Naman 14000

** index value is now the Employee column’s value

Example 2

Data of the CSV File Employee.csv

Code to create a pivoted table from the Employee.csv file

import pandas as pd

import numpy

mydf1=pd.read_csv(“..\DataFrames\Employee.csv, skiprows=1, names=

[ 'Code', 'Name', 'Week', 'Amount Payable'])

print(mydf1)

pivotedtable=mydf1.pivot_table(mydf1, index=[“Name”], aggfunc=numpy.sum)

** Creating a Data Frame without any data ????????

Generate an array / a table of random numbers -

numpy.random.randint(0,100)

array = numpy.random.randint(0,100,size=(10,4))

Convert this table / matrix / array into a dataframe

Mydf1 = pandas.DataFrame(array)

Convert this data frame to a csv file or from this dataframe create a csv file named try.csv

Mydf1.to_csv('try.csv')

Example 1 -

import numpy

array = numpy.random.randint(0,100,size=(4,4))

print(array)

Mydf1 = pandas.DataFrame(array, columns = ['Eng', 'Maths', 'Chem','Acc'])

print("\n", Mydf1)

Mydf1.to_csv('try.csv')

Create a second dataframe likewise

import numpy

array = numpy.random.randint(0,100,size=(4,5))

print(array)

Mydf2 = pandas.DataFrame(array, columns = ['Eng', 'Maths', 'Chem','Acc', 'Eco'])

print("\n", Mydf2)

Mydf1.to_csv('try2.csv')

2. Add / Insert a Dataframe

dataframe_object.append(dataframe2_object, ignore_index=True))

This method is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object.

Columns not in the original dataframe are added as new columns and the new cells appear with default value NaN.

ignore_index is an argument which by default is false and repeats the index address of the independent dataframe. But when set to true will show the index address as of the new dataframe in the order.

Limitation with a dictionary of series when the the ignore_index is not set to true (direct updating a row as a dictionary not as a datframe)

Columns not in the original dataframe are added as new columns and the new cells appear with default value NaN.

Inserting or adding a new Dataframe to an existing dataframe created from a list of dictionaries.

* without ignore_index

either of the data can be appended

Adding a row and dataframes in a dataframe created through a csv file

pandas.concat( )

pandas.concat([row_value, dataframe_object]).reset_index(drop)

pandas.concat([dataframe_objects], axis, join, join_axes[ ] , ignore_index)

row_value - is the new row value for a predefined dataframe which has to be added.

dataframe_object - is the dataframe in which a new row has to be added.

reset_index(drop) - is the method of the concat( ) which allows to reset the index address values of the new dataframe.

axis - default value is 0 (0 / 1) which means adding of the new row will be row wise.

join - default value is 'outer' ('outer' / 'inner' ) where outer is for union between the dataframe objects and inner for the intersection of the dataframe objects.

join_axes[ ] - replaces the indexes of the dataframes with a new set of indexes, ignoring their actual index and if one of the dataframes is longer than the corresponding index, then that particular index will be truncated in the resultant dataframe.

import pandas as pd

Mydf = pd.read_csv("https://www.nseindia.com/live_market/dynaContent/live_watch/equities_stock_watch.htm")

new_row = pd.DataFrame( { 'Symbol':'SIB', 'Open':1200.50, 'High':350, 'Low':220, 'Age':33, 'Close': 250}, index =[0])

# simply concatenate both dataframes

Mydf2 = pd.concat([new_row, Mydf]).reset_index(drop = True)

print(Mydf2)

To concatenate the dataframes as one dataframe -

Mydf1

Mydf2

Mydf3

Listdf = [ Mydf1, Mydf2, Mydf3 ]

Finaldf = pandas.concat(Listdf)

Let's try to answer some questions seeing the below image

df1

df2

df3

Listdf=[df1, df2, df3]

Result=pandas.concat(Listdf)

Now if we want that in the result dataframe (output) to add another column with specific address with each of the pieces of the chopped up dataframe. Somewhat like this

This can be achieved by the argument 'keys' of the pandas.concat( )

result = pd.concat( Listdf, keys=['x', 'y', 'z'] )

Now if i want to select or show only the rows of say dataframe2 or the records with the keys as 'y' from the result dataframe.

result.loc[ 'y' ]

Now, when we concatenate the dataframes column wise and joining is the intersection of the dataframes -

result=pandas.concat([ df1 , df4 ], axis =1, join ='inner')

Now if we want to concatenate the dataframes column wise and show the new dataframe with the exact index address as assigned in the original dataframes.

result=pandas.concat( [ df1, df4 ], axis=1, join_axes=[df1.index] )

Add New Column to Dataframe

Pandas allows to add a new column to an existing dataframe in various ways -

1. By directly assigning the new column with a value to the dataframe.

a. Mydf1['Percentage'] = 90

b. Mydf1['Rank'] = pandas.Series(['Third', 'Fifth' , 'First', 'Second', 'Fourth'])

c. Mydf1['Total Marks'] = Mydf1['M1'] + Mydf1['M2'] + Mydf1['M3']

2. pandas_object.assign(Column_name = List_obj)

We want to add the new column named Rank with the values - 'Third', 'Fifth' , 'First', 'Second', 'Fourth' respectively to our existing dataframe.

rank=['Third', 'Fifth' , 'First', 'Second', 'Fourth']

Mydf1.assign(Rank=rank)

Add Multiple Columns to a Dataframe

Let's add these two list (Date, City) as column to the existing dataframe using assign with a dict of column names and values

Date = ['1/9/2017','2/6/2018','7/12/2018', '5/11/2018', '3/7/2018', '4/8/2018' ]

City = ['SFO', 'Chicago', 'Charlotte', 'Santiago', 'Hollywood']

Mydf1.assign({'City' : City, 'Date' : Date})

3. pandas.insert( )

Pandas insert( ) method allows the user to insert a column in a dataframe or series(1-D Data frame) at the specified location.

A column can also be inserted manually in a data frame by the following method, but there isn’t much freedom here.

For example, even column location can’t be decided and hence the inserted column is always inserted in the last position.

DataFrameName.insert(loc, column, value, allow_duplicates = False)

loc: loc is an integer which is the location of column where we want to insert new column. This will shift the existing column at that position to the right.
column: column is a string which is name of column to be inserted.
value: value is simply the value to be inserted. It can be int, string, float or anything or even series / List of values. Providing only one value will set the same value for all rows.
allow_duplicates : allow_duplicates is a boolean value which checks if column with same name already exists or not.

# inserting new column with values of list made above

date_of_birth=['04/04/2004', '05/05/2005', .......]

Mydf1.insert(2, "D-o-B", date_of_birth)

Create a new column in Pandas DataFrame based on the existing columns

Create a Dataframe containing data about an event, we would like to create a new column called ‘Discounted_Price’, which is calculated after applying a discount of 10% on the Ticket price.

DataFrame.apply() function can be used to achieve this task.

# importing pandas as pd

import pandas as pd

# Creating the DataFrame

df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/2011'],

'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],

'Cost':[10000, 5000, 15000, 2000]})

# Print the dataframe

print(df)

Now we will create a new column called ‘Discounted_Price’ after applying a 10% discount on the existing ‘Cost’ column.

# using apply function to create a new column

df['Discounted_Price'] = df.apply(lambda row: row.Cost -(row.Cost * 0.1), axis = 1)

# Print the DataFrame after addition

# of new column

print(df)

** Pandas DataFrame.columns attribute return the column labels of the given Dataframe.

print(dataframe_obj.columns)

** Pandas DataFrame.sort_values( )

# Sort the rows of dataframe by 'Name' column

rslt_df = dataframe_obj.sort_values(by = 'Name')

rslt_df = dataframe_obj.sort_values(by = ['Name', 'Age'])

# column in Descending Order

rslt_df = dataframe_obj.sort_values(by = 'Name', ascending = False)

Create an empty DataFrame with some column names and indices and then appending rows one by one to it using loc[ ] attribute.

# import pandas library as pd

import pandas as pd

# create an Empty DataFrame object with

# column names and indices

df = pd.DataFrame(columns = ['Name', 'Articles', 'Improved'], index = ['a', 'b', 'c'])

print("Empty DataFrame With NaN values : \n\n", df)

# adding rows to an empty

# dataframe at existing index

df.loc['a'] = ['Ankita', 50, 100]

df.loc['b'] = ['Ankit', 60, 120]

df.loc['c'] = ['Harsh', 30, 60]

print(df)

Get minimum of each column from a dataframe

df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e'],

columns = ['x', 'y', 'z'])

minvalue_series = df.min()

Use min() function on a dataframe with ‘axis = 1’ attribute to find the minimum value over the row axis.

minvalue_series = df.min(axis = 1)

4. pandas.merge(dataframe_objects, how= 'inner', on= None, sort= 'True')

Dataframe objects

On - Column or index level names to join on. Must be found in each of the DataFrame objects. If not passed and left_index andright_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.

how - The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

Merge Values	SQL Join Name	Description
`left`	`LEFT OUTER JOIN`	Use keys from first dataframe only
`right`	`RIGHT OUTER JOIN`	Use keys from second dataframe only
`outer`	`FULL OUTER JOIN`	Use union of keys from both dataframes
`inner`	`INNER JOIN`	Use intersection of keys from both dataframes

Now, suppose, if we want to merge two dataframes say left and right and the keys should be used only from the first dataframe.

result=pandas.merge(left, right, on= 'key1 )

Now, when dataframes have more than one keys in each then how to merge theses dataframes -

result=pandas.merge(left, right, on= [ 'key1, 'Key2' ] )

Now, Merging the dataframes with different 'how' values like merging on the keys of first dataframe only.

result=pandas.merge(left, right, how='left', on= [ 'key1, 'Key2' ] )

Now, Merging the dataframes on the keys of second dataframe only

result=pandas.merge(left, right, how='right', on= [ 'key1, 'Key2' ] )

Now, Merging the dataframes on the keys of both the dataframes as union operation (all the rows and columns)

result=pandas.merge(left, right, how='outer', on= [ 'key1, 'Key2' ] )

Now, Merging the dataframes on the keys of both the dataframes as inner operation (intersection)

result=pandas.merge(left, right, how='inner', on= [ 'key1, 'Key2' ] )

Deleting a Column from a Dataframe

del dataframe_obj[ 'Coulmn_name'] - del keyword will delete the entire content of the specified column from the dataframe.

dataframe_obj.pop('Column_name') - will delete the particular specified column and will aso show the delted column with the values.

dataframe_obj.drop('index_name', 'Column_name', axis) - will show the new dataframe with the columns which have not been deleted.

axis-{0 or ‘index’, 1 or ‘columns’}, default 0

Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

index-single label or list-like

Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

columns-single label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

Example -

import pandas as pd

DictofSeries={ 'Subject1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject2': pd.Series([80, 90, 85, 90], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject3': pd.Series([100, 80, 95, 89], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject4': pd.Series([85, 70, 75, 70], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject5': pd.Series([90, 80, 80, 95], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

}

Mydf=pd.DataFrame(DictofSeries)

print("\nDataframe from Dictionary of Series \n",Mydf)

d1=Mydf.drop(['Abhishek'])

print("\n", d1)

d2=d1.drop(['Subject5'], axis=1)

print("\n", d2)

d3=d2.drop(columns='Subject4')

print("\n", d3)

Example 2-

KeyError: "['Subject5'] not found in axis"

Binary Operations on Dataframes -

import pandas as pd

Term1={ 'Subject1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject2': pd.Series([80, 90, 85, 90], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject3': pd.Series([100, 80, 95, 89], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject4': pd.Series([85, 70, 75, 70], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Grades': pd.Series(['A', 'B', 'B', 'C'], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

}

T1df=pd.DataFrame(Term1)

print("\nDataframe1 from Dictionary of Series \n", T1df)

Term2={ 'Subject1': pd.Series([100, 100, 100, 99], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject2': pd.Series([90, 90, 100, 90], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject3': pd.Series([90, 80, 95, 89], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject4': pd.Series([85, 75, 80, 70], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Grades': pd.Series(['A', 'A', 'B', 'C'], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

}

T2df=pd.DataFrame(Term2)

print("\nDataframe2 from Dictionary of Series \n", T2df)

Basic Mathematical Operations

1. Addition (+) - to perform the addition mathematical operation on more than one dataframes.

dataframe_obj1.add( dataframe_obj2) -

Adding two columns of two different dataframes -

import pandas as pd

Term1={ 'Subject1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject2': pd.Series([80, 90, 85, 90], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject3': pd.Series([100, 80, 95, 89], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject5': pd.Series([85, 70, 75, 70], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Grades': pd.Series(['A', 'B', 'B', 'C'], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

}

T1df=pd.DataFrame(Term1)

print("\nDataframe1 from Dictionary of Series \n", T1df)

Term2={ 'Subject1': pd.Series([100, 100, 100, 99], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject2': pd.Series([90, 90, 100, 90], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject3': pd.Series([90, 80, 95, 89], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Subject4': pd.Series([85, 75, 80, 70], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

'Grades': pd.Series(['A', 'A', 'B', 'C'], index=['Abhishek', 'Amitej', 'Prakhar', 'Bhavya']),

}

T2df=pd.DataFrame(Term2)

print("\nDataframe2 from Dictionary of Series \n", T2df)

dfnew=T1df['Subject5']+T2df['Subject4']

print("The 2 added column is \n", dfnew)

The O/P --

The 2 added column is 
 Abhishek    170
Amitej      145
Prakhar     155
Bhavya      140
dtype: int64

Further you can use this column in any of the existing dataframe / Add / Relpace

T2df['Subject4']=dfnew
print(T2df)

dfnew=T1df['Subject5']+T2df['Grades']

print("The difference of the columns = \n", dfnew)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Reminder

If you want to create a third column from adding two columns -

1. df['C'] = df['A'] + df['B']

2. df['C'] = df.sum(axis=1)

3. df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

4. df.assign(C = df.A + df.B)

2. subtraction - to find or calculate the differences.

' - ' - operator

print(T1df-T2df)

O/P-

TypeError: unsupported operand type(s) for -: 'str' and 'str'

sub( ) - method

TypeError: unsupported operand type(s) for -: 'str' and 'str'

dfnew=T1df['Subject5']-T2df['Subject4']

print("The difference of the columns = \n", dfnew)

3. product / multiplication -

'*'

mul( )

4. division / quotient -

'/'

div( )

5. right side addition -

radd( )

print(T1df.radd(T2df))

6. right side subtraction -

rsub( )

print(T1df.rsub(T2df))

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Example 2 - All the columns are integer

Indexing using Labels

In pandas data frames, each row also has a name. By default, this label is just the row number.

However, you can set one of your columns to be the index of your DataFrame, which mean

that its values will be used as row labels. We set the column 'name' as our index.

To select a column by its label, we use the .loc[ ] attribute.

** Integers are valid labels, but they refer to the label and not the position.

Indexing all rows of a specific column of a dataframe -

loc[ ] takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print(df.loc[:,'A'])

a   0.391548
b  -0.070649
c  -0.317212
d  -2.162406
e   2.202797
f   0.613709
g   1.050559
h   1.122680
Name: A, dtype: float64

Indexing all rows for multiple columns of a dataframe -

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple columns, say list[]
print(df.loc[:,['A','C']])

Output -

            A           C
a    0.391548    0.745623
b   -0.070649    1.620406
c   -0.317212    1.448365
d   -2.162406   -0.873557
e    2.202797    0.528067
f    0.613709    0.286414
g    1.050559    0.216526
h    1.122680   -1.621420

Indexing few rows for multiple columns of a dataframe -

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns, say list[]
print(df.loc[['a','b','f','h'],['A','C']])

Output −

           A          C
a   0.391548   0.745623
b  -0.070649   1.620406
f   0.613709   0.286414
h   1.122680  -1.621420

Indexing range of rows for all columns of a dataframe -

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print(df.loc['a':'h'])

Output -

            A           B          C          D
a    0.391548   -0.224297   0.745623   0.054301
b   -0.070649   -0.880130   1.620406   1.419743
c   -0.317212   -1.929698   1.448365   0.616899
d   -2.162406    0.614256  -0.873557   1.093958
e    2.202797   -2.315915   0.528067   0.612482
f    0.613709   -0.157674   0.286414  -0.500517
g    1.050559   -2.272099   0.216526   0.928449
h    1.122680    0.324368  -1.621420  -0.741470

iloc[row_dimension, column_dimension ] -

To select a row and column by its position within a dataframe we use .iloc[ ]. The

first dimension range selects the rows and the second dimension range selects the

columns of the dataframe.

Eg -1

BoolBoolean Indexing

Indexing - searching a data or an element from the given data structure.

Boolean Indexing - A dataframe is created with the index value as boolean values.

And the data is searched on the boolean values.

True / False

Stay Safe!!! STay Healthy!!!

Search This Blog

Data Frame