DataFrame Blog 1

Unit 1 – Data Handling I

DataFrames:

1.Creation –

▪ from numpy, list of dictionary, dictionary of list, series and dictionary of Series

▪ Text/CSV files;

2.Display;

3.Iteration;

4.Operations on rows and columns: add, select, delete, rename;

5.Head and Tail functions;

6.Indexing using Labels, Boolean Indexing;

7.Importing/Exporting Data between CSV files and Data Frames.

This Blog's objective is

* to let you know what is a dataframe in Python pandas. It's purpose, usage, applications.

* how can a dataframe be created in Python pandas. What are the various ways in which a dataframe

can be created.

* to understand the ways in which a dataframe can be displayed, what will be the possible outcomes of

the dataframe when it is supplied with different types of values.

Data Frame is a 2-dimensional labeled data structure with columns of potentially different data types.
You can think of it like a spreadsheet or SQL table, or a dict of Series objects.



It is generally the most commonly used pandas object.
Like Series, Data Frame accepts many different kinds of input:

✔  dictionary of Series,
✔  list of dictionaries,
✔  Text/CSV files
❖     Dict of 1D ndarrays,
❖     Lists
❖     Series
❖     2-D numpy.ndarray
❖     Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments to the DataFrame construct.
If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.

** When the data is a dict, and columns are not specified, the DataFrame columns will be ordered by the dict’s insertion order. Like a spreadsheet or SQL table, or a dict of Series objects.

When this table is created through the DataFrame( ) construct.
This is how the dataframe appears

** The row index by default is added to the table automatically as the very first column and the row indexing does not start with the column heading but rather with the first record onwards.

1. Creation of a Data Frame & 2. Display of a Data Frame

pandas.DataFrame(data_object ) – is the method which is used create a 2-D table from the specified data_object as an argument and the size of this table is mutable.

Steps to create a dataframe using the construct pandas.DataFrame( ) is

Step 1    Call the pandas library (by import statement)
Step 2 Define / Create the data_object
Step 3 Create the dataframe (by calling the DataFrame( ) constructor of pandas)
Step 4      Perform the Operation on the dataframe

I. Creating an Empty Data Frame  

Step 1     Call the pandas library (by import statement)
as DataFrame( ) is a construct/method/function defined within the pandas library so, the code to call the library is

import pandas             OR                   import pandas as <object_name>

Step 2      Define a dictionary of Series (Dos) / dicts (dictionary variable)
Ooops!!!!   I don’t have a supply of any dictionary of Series

Step 3      Create the dataframe by calling the DataFrame( ) constructor of pandas and assign to a variable
import pandas
Mydf1 = pandas.DataFrame( )         # Calling DataFrame( ) constructor of pandas to create a
dataframe direct from library

OR

import pandas as pd
Mydf1 = pd.DataFrame( )              # Calling DataFrame( ) constructor of pandas to create a
dataframe through object

Display a dataframe - print( dataframe_object) or dataframename_object

Step 4     Display the dataframe
import pandas
Mydf1 = pandas.DataFrame( )
print(Mydf1)

OR Display the dataframe
import pandas
pandas.DataFrame( )

II. Creation of DataFrame from NumPy ndarrays

Eg. 1 --

import numpy as np

import pandas as pd

array1 = np.array(["Aman",90,99])

df1 = pd.DataFrame(array1)

print("1D - DataFrame")

print("DataFrame 1")

print(df1)

array2 = np.array(["Naman",80,85])

df2= pd.DataFrame(array2)

print("\n \n DataFrame 2")

print(df2)

array3 = np.array(["Yuvi",75,70])

df3=pd.DataFrame(array3)

print("\n \n DataFrame 3")

df3

O/P - >

	0
0	Yuvi
1	75
2	70

2D DataFrame -

import numpy as np
import pandas as pd
array1 = np.array(["Aman",90,99])
array2 = np.array(["Naman",80,85])
df1 = pd.DataFrame([array1, array2])
print("2D - DataFrame 1 \n")
print(df1)

2D - DataFrame 1 

               0    1    2
0   Aman    90  99
1   Naman  80  85
2       Yuvi  75  70

import numpy as np
import pandas as pd
array1 = np.array(["Aman",90,99], index=[1,2,3])
array2 = np.array(["Naman",80,85], index=[1,2,3]) 
array3 = np.array(["Yuvi",75,70], index=[)
df1 = pd.DataFrame([array1, array2, array3])
print("2D - DataFrame 1 \n")
print(df1)

III. Creation of DataFrame from List of Dictionaries

import pandas as pd

listDict = [{'a':10, 'b':20}, {'a':5, 'b':10, 'c':20}]

dFrameListDict = pd.DataFrame(listDict)

FrameListDict

IV. Creation of DataFrame from Dictionary of Lists

import pandas as pd

dictForest = {'State': ['Assam', 'Delhi', 'Kerala'],

'GArea': [78438, 1483, 38852] ,

'VDF' : [2797, 6.72,1663] }

dFrameForest= pd.DataFrame(dictForest, columns = ['State','VDF', 'GArea'] )

dFrameForest

V. Creation of DataFrame from Series

import pandas as pd

seriesA = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])

seriesB = pd.Series ([1000,2000,-1000,-5000,1000], index = ['a', 'b', 'c', 'd', 'e'])

seriesC = pd.Series([10,20,-10,-50,100], index = ['z', 'y', 'a', 'c', 'e'])

dFrame5 = pd.DataFrame(seriesA)

dFrameV = pd.DataFrame( [ seriesA, seriesC] )

dFrame5

print(dFrameV)

import pandas as pd
seriesA = pd.Series(["Aman",90,95,96,98])
seriesB = pd.Series (["Naman",80,85,82,99])
seriesC = pd.Series (["Celin",70,86,96,98])
dFrame5 = pd.DataFrame(seriesA)
dFrameV = pd.DataFrame( [ seriesA, seriesB, seriesC] )
dFrameV.columns=['Name','Sub1','Sub2','Elec1','Elec2']
dFrameV.index=[1,2,3]
dFrameV.index.name='Roll No'
dFrameV

VI. Creation of a dataframe from dictionary of series or dos

dos_variable_name= { 'series1': pandas.Series( array of list ) , 'series2':pandas.Series( array of list ) ...}

Eg - Mydos={ 'Series1': pandas.Series( [10, 20, 30, 40, 50 ] ), 'Series2':pandas.Series( [ 1, 2, 3, 4, 5 ] ) }

Eg 1 -->

Mydos={ 'Name':pd.Series(["Aman", "Naman","Celin"] ),

'Sub1':pd.Series( [90,80,70] ),

'Sub2':pd.Series( [95,85,86] ),

'Elec1':pd.Series([96,82,96]),

'Elec2':pd.Series([98,99,98])}

dFrameVI = pd.DataFrame(Mydos)

dFrameVI.index=[1,2,3]

dFrameVI.index.name="Roll No"

dFrameVI

Code to create dictionary of series with the given values is -

Series 1 Values

100

Series 2 Values

So let's do in steps -->

Step 1 🡪 import pandas as pd

Step 2 🡪 MyDoS = { 'Series1': pd.Series([90, 100, 90, 99]),

'Series2': pd.Series([80, 90, 85, 99])

}

Step 3 🡪 Mydf=pd.DataFrame(MyDoS)

Step 4 🡪 print(Mydf)

Let's see the output

Eg - 2 Code to create a dataframe from the given dictionary of series (with the given row index value)

Series 1 Values Index Address

90 Abhishek

100 Amitej

90 Prakhar

99 Bhavya

Series 2 Values Index Address

80 Abhishek

90 Amitej

85 Prakhar

99 Bhavya

Use the index argument of the pandas.Series( ) to label the index with the desired value instead of the default index address values 0, 1, 2, ...........

Step 1 🡪 <'Series_object_variable'>:pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej',

'Prakhar', 'Bhavya’])

The above code will create a dataframe of a single series value and we have to create dictionary of Series

<'Series_object_variable'>: pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej',

'Prakhar', 'Bhavya’])

Step 2 🡪 DictofSeries={ 'Series1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej',

'Prakhar', 'Bhavya']),

'Series2': pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej',

'Prakhar', 'Bhavya'])

}

After creating the dictionary of Series --> Create the dataframe

Step 3 🡪 Create the dataframe from DoS !!!!

Mydf=pandas.DataFrame(DictofSeries) # Using DataFrame( ) constructor of pandas

After creating the dataframe from DoS

Step 4 🡪 Display the dataframe

print( "The dataframe created from DoS" , Mydf)

# The program to create a dataframe from the given dictionary of series and to display is -

import pandas as pd

DictofSeries= { 'Series1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar',

'Bhavya']),

'Series2': pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej', 'Prakhar',

'Bhavya'])

}

Mydf=pd.DataFrame(DictofSeries)

print("\nDataframe from Dictionary of Series \n", Mydf)

What after creating a dataframe??????

You can perform certain operations on it like – Selection, Display, Iteration, Mathematical and statistical operations etc.

The row and column labels can be accessed respectively by accessing the index and columns arguments.

Eg 3 - Code to extract from a dataframe row wise or column wise

import pandas as pd

DictofSeries={ 'Series1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar',

'Bhavya']),

'Series2': pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej', 'Prakhar',

'Bhavya'])

}

Mydf=pd.DataFrame(DictofSeries)

print("\nDataframe from Dictionary of Series \n",Mydf)

Selective=pd.DataFrame(Mydf, index=['Prakhar', 'Bhavya', 'Amitej'], columns=['Series1', 'Series2'])

print("The new subset of dataframe by index and column is \n", Selective)

** The values assigned to index and column names in the argument should be same as mentioned in the dictionary os series.

Eg - 4 Code to extract a data / subset from a dataframe where a column name does not match / exist

In the above code when the dictionary of series is used to create a dataframe it has two columns Series1 and Series2.

Now, when the data from dataframe is filtered or extracted using row index and columns argument then the columns argument is given three values as

1. Series1 (which exists in dataframe)

2. Series2 (which exists in dataframe)

3. Term1 (which does not exist in dataframe)

In such a case where a column is extracted which actually does not exist in the main dataframe then the column will be dispalyed with the new column name as heading and the values will be NaN.

NaN - Not a Number, is the default value which appears as the cell value of a column of a dataframe which does not exist or has not been assigned with a numeric value or does not match with the values.

Eg - 5 Code to extract a data / subset from a dataframe where a column name does not match / exist

In the above code the row index 'Ananya' does not exist in the original dataframe. So in the extracted dataframe the column values for non matching row index 'Ananya' is shown with the default value NaN.

Creating a Data Frame using the data_objcect - from list of dictionaries

Mylod=[ {'Name':"Aman",'Sub1':90,'Sub2':95,'Elec1':96,'Elec2':98 },

{'Name':"Naman",'Sub1':80,'Sub2':85,'Elec1':82,'Elec2':99 },

{'Name':"Celin",'Sub1':70,'Sub2':86,'Elec1':96,'Elec2':98}

]

dFrameVII = pd.DataFrame(Mylod)

dFrameVII.index=[1,2,3]

dFrameVII.index.name="Roll No"

dFrameVII

Method 1 pandas.DataFrame(data_object ) – is the method which is used create a 2-D table from the specified data_object as an argument and the size of this table is mutable.

In order to create a dataframe you need to

Step 1 🡪 Call the pandas library (by import statement)

as dataframe is a module or a class defined within the pandas library.

•

import pandas OR import pandas as <object_name>

Step 2 🡪 Define a list of dictionaries

Ooops!!!! I don’t have a supply of any list of dictionaries

•

Step 3 🡪 Create the dataframe by calling the DataFrame( ) constructor of pandas and assign to a variable

•

import pandas

Mydf1 = pandas.DataFrame( ) # Calling DataFrame constructor of pandas to create a

dataframe direct from library

• OR

•

import pandas as pd

Mydf1 = pd.DataFrame( ) # Calling DataFrame constructor of pandas to create a dataframe through object

Step 4 🡪 Display the dataframe

import pandas

Mydf1 = pandas.DataFrame( )

print(Mydf1)

In the above code the list of dictionaries is not assigned to the dataframe and so the output is empty. The dataframe, index values, column values are all empty.

Now when the values of dictionaries are given

Dictionary 1 - { 'Abhisek' : 99, ' Amitej ' : 98, ' Aman ' : 99 }

Dictionary 2 - { 'Abhisek' : 96, ' Amitej ' : 99, ' Aman ' : 79 }

Then the list of such given dictionaries can be created by just enclosing these dictionary values in square brackets (as an array of list)

data_object = [ { dictionary 1}, {dictionry 2} , {.......... }, {........} ]

listofdict_variable = [ { 'Abhisek' : 99, ' Amitej ' : 98, ' Aman ' : 99 }, { 'Abhisek' : 96, ' Amitej ' : 99, ' Aman ' : 79 } ]

Eg - 2 Code to create and display a dataframe from the given list of dictionaries -

Eg - 3 Code to create and display a dataframe from the given list of dictionaries without the print statement

Eg - 4 Code to create a dataframe from list of dictionaries and display it with both the statements -

When print( ) is the second statement

In the above code the final output is in the format of print( )

When print( ) is the first statement

Eg - 5 Code to create and display a dataframe from a list of dictionaries where the column headings does not match in all the dictionaries.

In the above example the list is created of 4 dictionaries where each dictionary has its own set of column values (none of the dictionary keys are identical to each other). So in the output / display part the column(s) headings which does not match in the respective row has shown the default value NaN (Not a Number).

Eg - 6 Code to create and display a dataframe from a list of dictionaries with the changed row index values.

In the above example the default row index values have now been changed to subject names using index argument of the construct DataFrame( ).

Eg 7 - Code to create and dispaly a new dataframe created from a list of dictionaries with the changed row index values and also trying to show an extra row value which actually does not exist as an index value.

In the above code the number of dictionaries are two out of which the list is created.

So in the dataframe there are two rows with their default row index values 0 and 1.

But when a new dataframe is created from the existing list of dictionaries and the values of the row index are when supplied through the argument 'index = [ row value1 , row value2 , .... ] there is actually three row index values being supplied in the argument.

So the actual row of the original dataframe is 2 where as the new dataframe is supplied with 3 row index values so this program will give a

ValueError: Shape of passed values is (2, 3), indices imply (3, 3)

2 rows and 3 columns are the actual dimension of the original data object (list of dictionaries)

3 rows and 3 columns are the desired dimesions of the new dataframe which has to be extracted from the original.

Dimension value mismatched. So, the output is a ValueError.

Method 2 pandas.DataFrame.from_dict(data_object , orient ) – is the method of DataFrame construct and not of pandas and is used to create a 2-D table from the specified data_object of type list of dictionaries.

orient is the argument which is supplied with either of the two values 'columns' or 'index'

The steps to create a dataframe from a list of dictionaries using the method / function pandas.DataFrame.from_dict( ) is same as of pandas.DataFrame( ).

Eg - 1 Code to create a dataframe using the from_dict( ) with the given values

In the above code the dataframe is created from a list of 2 dictionaries and displayed with the method from_dict( ) of the construct DataFrame.

And the dataframe which is created here with the method from_dict( ) is exactly the same in structure as when the dataframe was created with pandas.DataFrame( )

Eg - 2 Code to create and display a dataframe using the from_dict( ) with the given values where the columns (keys) are not matching in each row (dictionary)

In the above code the dataframe is created with the data object of type list of dictionaries and this list does not have matching keys in all the dictionaries. So, the unmatched keys (columns) have been shown with the default value NaN.

Sample reference program

Before we start with the argument orient let us once again have a look over the output of the dataframe created from a list of 3 dictionaries through from_dict( ) as in the above program.

Eg - 3 Code to create and display a dataframe created from a list of 3 dictionaries using the from_dict( ) with the argument orient and its value as columns

So, in the above code the argument orient is used and is assigned with the value column which in return has given the same output as the previous coding as Eg 2 without the argument orient.

Which means the default value of orient is column and it also mean that the dataframe which is made of list of dictionaries is oriented key wise (column) (Abhishek, Amitej are all keys)

Eg - 4 Code to create and display a dataframe created from a list of dictionaries using the from_dict( ) with the argument orient value as index

When the dataframe is created from a list of dictionaries and is oriented row wise (index wise) then the program returns an Attribute error because the list is an object which is of single dimension and that is why it fails to pick data row wise.

Display of the dataframe created from a list of dictionaries -

* with print( ) the dataframe is created but not stored any where and then displayed

* without print( ) the dataframe is created and is displayed in a tabular structure

* with print( ) as first statement the dataframe is shown in the normal format and with the direct statement of conversion / creation the dataframe is shown in a tabular structure.

* with print( ) as the second statement overwrites the display of the direct use of the statement and that is why the tabular structure is not shown.

Method 3 pandas.DataFrame.from_records(data_object, index, columns ) – is the method of DataFrame construct and not of pandas and is used to create a 2-D table from the specified data_object of type list of dictionaries.

index argument can be used to specify the alternative index value/ row name as a list of an array

columns argument is used to specify the columns or keys which should be extracted to show the dataframe. If such a column name is passed which is not a matching key in the dictionaries or does not exist in the dictionary then the values of such keys will be shown as NaN.

The creation of the dataframe is same as the previous two methods. Every output structure is also the same as of the above two methods.

Eg 1 -