DataFrame Blog 1

 Unit 1 – Data Handling I


DataFrames:


1.Creation –
 from numpy, list of dictionary, dictionary of list, series and dictionary of Series
▪ Text/CSV files;
2.Display;
3.Iteration;
4.Operations on rows and columns: add, select, delete, rename;
5.Head and Tail functions;
6.Indexing using Labels, Boolean Indexing;
7.Importing/Exporting Data between CSV files and Data Frames.


This Blog's objective is 
*   to let you know what is a dataframe in Python pandas. It's purpose, usage, applications.
*   how can a dataframe be created in Python pandas. What are the various ways in which a dataframe
can be created.
*   to understand the ways in which a dataframe can be displayed, what will be the possible outcomes of 
the dataframe when it is supplied with different types of values.


Data Frame is a 2-dimensional labeled data structure with columns of potentially different data types.

You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

 

  














It is generally the most commonly used pandas object.

Like Series, Data Frame accepts many different kinds of input:

 

  dictionary of Series,

  list of dictionaries,

  Text/CSV files

     Dict of 1D ndarrays,

     Lists

     Series

     2-D numpy.ndarray

     Another DataFrame

 

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments to the DataFrame construct.

If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.

 

** When the data is a dict, and columns are not specified, the DataFrame columns will be ordered by the dict’s insertion order. Like a spreadsheet or SQL table, or a dict of Series objects.


When this table is created through the DataFrame(  ) construct.

This is how the dataframe appears















** The row index by default is added to the table automatically as the very first column and the row indexing does not start with the column heading but rather with the first record onwards.

 

 

1. Creation of a Data Frame &   2. Display of a Data Frame


pandas.DataFrame(data_object ) – is the method which is used create a 2-D table from the specified data_object as an argument and the size of this table is mutable.

 

Steps to create a dataframe using the construct pandas.DataFrame( ) is

 

Step 1      Call the pandas library (by import statement)

Step 2      Define / Create the data_object

Step 3      Create the dataframe  (by calling the DataFrame( ) constructor of pandas)

Step 4      Perform the Operation on the dataframe

 

 

I. Creating an Empty Data Frame     


Step 1     Call the pandas library (by import statement)

as DataFrame( ) is a construct/method/function defined within the pandas library so, the code to call the library is

 

 import pandas             OR                   import pandas as <object_name>

 

Step 2      Define a dictionary of Series (Dos) / dicts (dictionary variable)

Ooops!!!!   I don’t have a supply of any dictionary of Series

 

 

Step 3      Create the dataframe by calling the DataFrame( ) constructor of pandas and assign to a variable

import pandas

Mydf1 = pandas.DataFrame( )         # Calling DataFrame( ) constructor of pandas to create a 

                                                             dataframe direct from library

 

OR

 

import pandas as pd

Mydf1 = pd.DataFrame( )              # Calling DataFrame( ) constructor of pandas to create a            

                                                           dataframe through object

 

Display a dataframe - print( dataframe_object) or dataframename_object


Step 4     Display the dataframe

import pandas

Mydf1 = pandas.DataFrame( )

print(Mydf1)























OR                Display the dataframe

import pandas

pandas.DataFrame( )













II. Creation of DataFrame from NumPy ndarrays

Eg. 1 --


import numpy as np
import pandas as pd
array1 = np.array(["Aman",90,99])
df1 = pd.DataFrame(array1)
print("1D - DataFrame")
print("DataFrame 1")
print(df1)

array2 = np.array(["Naman",80,85]) 
df2= pd.DataFrame(array2)
print("\n \n DataFrame 2")
print(df2)

array3 = np.array(["Yuvi",75,70])
df3=pd.DataFrame(array3)
print("\n \n DataFrame 3")
df3

O/P - >

2D DataFrame - 

import numpy as np
import pandas as pd
array1 = np.array(["Aman",90,99])
array2 = np.array(["Naman",80,85]) 
df1 = pd.DataFrame([array1, array2])
print("2D - DataFrame 1 \n")
print(df1)

2D - DataFrame 1 

               0    1    2
0   Aman    90  99
1   Naman  80  85
2       Yuvi  75  70


import numpy as np
import pandas as pd
array1 = np.array(["Aman",90,99], index=[1,2,3])
array2 = np.array(["Naman",80,85], index=[1,2,3]) 
array3 = np.array(["Yuvi",75,70], index=[)
df1 = pd.DataFrame([array1, array2, array3])
print("2D - DataFrame 1 \n")
print(df1)
III. Creation of DataFrame from List of Dictionaries 

import pandas as pd
listDict = [{'a':10, 'b':20}, {'a':5, 'b':10, 'c':20}]
dFrameListDict = pd.DataFrame(listDict) 
FrameListDict



IV. Creation of DataFrame from Dictionary of Lists

import pandas as pd 
dictForest = {'State': ['Assam', 'Delhi', 'Kerala'], 
                       'GArea': [78438, 1483, 38852] , 
                        'VDF' : [2797, 6.72,1663] }
dFrameForest= pd.DataFrame(dictForest, columns = ['State','VDF', 'GArea'] )
dFrameForest

V. Creation of DataFrame from Series

import pandas as pd
seriesA = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e']) 
seriesB = pd.Series ([1000,2000,-1000,-5000,1000], index = ['a', 'b', 'c', 'd', 'e'])
seriesC = pd.Series([10,20,-10,-50,100], index = ['z', 'y', 'a', 'c', 'e'])
dFrame5 = pd.DataFrame(seriesA)
dFrameV = pd.DataFrame( [ seriesA, seriesC] )
dFrame5
print(dFrameV)


import pandas as pd
seriesA = pd.Series(["Aman",90,95,96,98])
seriesB = pd.Series (["Naman",80,85,82,99])
seriesC = pd.Series (["Celin",70,86,96,98])
dFrame5 = pd.DataFrame(seriesA)
dFrameV = pd.DataFrame( [ seriesA, seriesB, seriesC] )
dFrameV.columns=['Name','Sub1','Sub2','Elec1','Elec2']
dFrameV.index=[1,2,3]
dFrameV.index.name='Roll No'
dFrameV


VI. Creation of a dataframe from dictionary of series or dos

dos_variable_name= { 'series1': pandas.Series(  array of list ) , 'series2':pandas.Series(  array of list ) ...}

Eg - Mydos={ 'Series1': pandas.Series( [10, 20, 30, 40, 50 ] ), 'Series2':pandas.Series( [ 1, 2, 3, 4, 5 ] ) }


Eg 1 --> 


Mydos={ 'Name':pd.Series(["Aman", "Naman","Celin"] ), 

                'Sub1':pd.Series( [90,80,70] ),

                'Sub2':pd.Series( [95,85,86] ),

                'Elec1':pd.Series([96,82,96]),

                'Elec2':pd.Series([98,99,98])}

dFrameVI = pd.DataFrame(Mydos)

dFrameVI.index=[1,2,3]

dFrameVI.index.name="Roll No"

dFrameVI



Code to create dictionary of series with the given values is - 

Series 1    Values 

                  90 

                  100 

                  90 

                  99 

Series 2  Values 

                  80 

                  90 

                  85 

                  99

So let's do in steps -->

Step 1 🡪                  import pandas as pd

Step 2 🡪                  MyDoS =  { 'Series1': pd.Series([90, 100, 90, 99]),

                                                     'Series2': pd.Series([80, 90, 85, 99])

                                                   }

Step 3 🡪                  Mydf=pd.DataFrame(MyDoS)

Step 4 🡪                  print(Mydf)


Let's see the output 



















Eg - 2 Code to create a dataframe from the given dictionary of series (with the given row index value)


Series 1      Values  Index Address

                          90  Abhishek 

                        100  Amitej

                          90  Prakhar

                          99  Bhavya

Series 2          Values  Index Address

                          80  Abhishek 

                          90  Amitej

                          85  Prakhar

                          99  Bhavya


Use the index argument of the pandas.Series( ) to label the index with the desired value instead of the default index address values 0, 1, 2, ...........

Step 1 🡪         <'Series_object_variable'>:pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 

                                                                      'Prakhar', 'Bhavya’])


The above code will create a dataframe  of a single series value and we have to create dictionary of Series


                        <'Series_object_variable'>: pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej',  

                                                                      'Prakhar', 'Bhavya’])

Step 2 🡪         DictofSeries={ 'Series1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 

                                                                  'Prakhar', 'Bhavya']),

                                                  'Series2': pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej', 

                                                                 'Prakhar', 'Bhavya'])

                                                }



After creating the dictionary of Series --> Create the dataframe

Step 3 🡪         Create the dataframe from DoS !!!!

                                                                               

                        Mydf=pandas.DataFrame(DictofSeries)   # Using DataFrame( ) constructor of pandas                                                          

        

After creating the dataframe from DoS 


Step 4 🡪           Display the dataframe

                          print( "The dataframe created from DoS" , Mydf)



# The program to create a dataframe from the given dictionary of series and to display is -


import pandas as pd


DictofSeries= { 'Series1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar',     

                           'Bhavya']),

                          'Series2': pd.Series([80, 90, 85, 99], index=['Abhishek', 'Amitej', 'Prakhar',  

                          'Bhavya'])

                         }


Mydf=pd.DataFrame(DictofSeries)

print("\nDataframe from Dictionary of Series \n", Mydf)






















What after creating a dataframe??????

You can perform certain operations on it like – Selection, Display, Iteration, Mathematical and statistical operations etc.


The row and column labels can be accessed respectively by accessing the index and columns arguments.




Eg 3 - Code to extract from a dataframe row wise or column wise 


import pandas as pd

DictofSeries={ 'Series1': pd.Series([90, 100, 90, 99], index=['Abhishek', 'Amitej', 'Prakhar', 

                          'Bhavya']),

                          'Series2': pd.Series([80, 90, 85, 99],  index=['Abhishek', 'Amitej', 'Prakhar', 

                          'Bhavya'])

                        }

Mydf=pd.DataFrame(DictofSeries)

print("\nDataframe from Dictionary of Series \n",Mydf)


Selective=pd.DataFrame(Mydf, index=['Prakhar', 'Bhavya', 'Amitej'], columns=['Series1', 'Series2'])


print("The new subset of dataframe by index and column is \n", Selective)



** The values assigned to index and column names in the argument should be same as mentioned in the dictionary os series.










Eg - 4   Code to extract a data / subset from a dataframe where a column name does not match / exist


In the above code when the dictionary of series is used to create a dataframe it has two columns Series1 and Series2.

Now, when the data from dataframe is filtered or extracted using row index and columns argument then the columns argument is given three values as 

1. Series1  (which exists in dataframe)

2. Series2 (which exists in dataframe)

3. Term1 (which does not exist in dataframe)

In such a case where a column is extracted which actually does not exist in the main dataframe then the column will be dispalyed with the new column name as heading and the values will be NaN.

NaN - Not a Number, is the default value which appears as the cell value of a column of a dataframe which does not exist or has not been assigned with a numeric value or does not match with the values.


Eg - 5 Code to extract a data / subset from a dataframe where a column name does not match / exist













In the above code the row index 'Ananya' does not exist in the original dataframe. So in the extracted dataframe the column values for non matching row index 'Ananya' is shown with the default value NaN.


Creating a Data Frame using the data_objcect - from list of dictionaries

Mylod=[  {'Name':"Aman",'Sub1':90,'Sub2':95,'Elec1':96,'Elec2':98 },

                {'Name':"Naman",'Sub1':80,'Sub2':85,'Elec1':82,'Elec2':99 },

                {'Name':"Celin",'Sub1':70,'Sub2':86,'Elec1':96,'Elec2':98}

             ]

dFrameVII = pd.DataFrame(Mylod)

dFrameVII.index=[1,2,3]

dFrameVII.index.name="Roll No"

dFrameVII

Method 1 pandas.DataFrame(data_object ) – is the method which is used create a 2-D table from the specified data_object as an argument and the size of this table is mutable.


In order to create a dataframe you need to

  Step 1 🡪    Call the pandas library   (by import statement)

as dataframe is a module or a class defined within the pandas library.

  import pandas  OR  import pandas as <object_name>

  Step 2 🡪  Define a list of dictionaries

Ooops!!!!   I don’t have a supply of any list of dictionaries

  Step 3 🡪 Create the dataframe by calling the DataFrame( ) constructor of pandas and assign to a variable

           import pandas

           Mydf1 = pandas.DataFrame( )      # Calling DataFrame constructor of pandas to create a 

                                                                     dataframe direct from library

                                                                    

      OR

           import pandas as pd

           Mydf1 = pd.DataFrame( )            # Calling DataFrame constructor of pandas to create a                                                                       dataframe through object

   Step 4 🡪 Display the dataframe

          

          import pandas

          Mydf1 = pandas.DataFrame( )

          print(Mydf1)



In the above code the list of dictionaries is not assigned to the dataframe and so the output is empty. The dataframe, index values, column values are all empty.

Now when the values of dictionaries are given 

Dictionary 1 - { 'Abhisek' : 99, ' Amitej ' : 98, ' Aman ' : 99 }

Dictionary 2 - { 'Abhisek' : 96, ' Amitej ' : 99, ' Aman ' : 79 }

Then the list of such given dictionaries can be created by just enclosing these dictionary values in square brackets (as an array of list) 

data_object = [ { dictionary 1}, {dictionry 2} , {.......... }, {........} ]

listofdict_variable = [ { 'Abhisek' : 99, ' Amitej ' : 98, ' Aman ' : 99 }, { 'Abhisek' : 96, ' Amitej ' : 99, ' Aman ' : 79 } ]


Eg - 2 Code to create and display a dataframe from the given list of dictionaries - 











Eg - 3 Code to create and display a dataframe from the given list of dictionaries without the print statement











Eg - 4 Code to create a dataframe from list of dictionaries and display it with both the statements - 

When print( ) is the second statement 








In the above code the final output is in the format of print( )

When print( ) is the first statement 











Eg - 5 Code to create and display a dataframe from a list of dictionaries where the column headings does not match in all the dictionaries.











In the above example the list is created of 4 dictionaries where each dictionary has its own set of column values (none of the dictionary keys are identical to each other). So in the output / display part the column(s) headings which does not match in the respective row has shown the default value NaN (Not a Number).

Eg - 6  Code to create and display a dataframe from a list of dictionaries with the changed row index values.

In the above example the default row index values have now been changed to subject names using index argument of the construct DataFrame( ).

Eg 7 - Code to create and dispaly a new dataframe created from a list of dictionaries with the changed row index values and also trying to show an extra row value which actually  does not exist as an index value.



In the above code the number of dictionaries are two out of which the list is created.
So in the dataframe there are two rows  with their default row index values 0 and 1.

But when a new dataframe is created from the existing list of dictionaries and the values of the row index are when supplied through the argument 'index =  [ row value1 , row value2 , .... ] there is actually three row index values being supplied in the argument. 
So the actual row of the original dataframe is 2 where as the new dataframe is supplied with 3 row index values so this program will give a 

ValueError: Shape of passed values is (2, 3), indices imply (3, 3)

2 rows and 3 columns are the actual dimension of the original data object (list of dictionaries)

3 rows and 3 columns are the desired dimesions of the new dataframe which has to be extracted from the original. 

Dimension value mismatched. So, the output is a ValueError. 

Method 2 pandas.DataFrame.from_dict(data_object , orient ) – is the method of DataFrame construct and not of pandas and is used to create a 2-D table from the specified data_object of type list of dictionaries. 

orient is the argument which is supplied with either of the two values 'columns' or 'index' 

The steps to create a dataframe from a list of dictionaries using the method / function pandas.DataFrame.from_dict( ) is same as of pandas.DataFrame( ).

Eg  - 1 Code to create a dataframe using the from_dict( ) with the given values 

In the above code the dataframe is created from a list of  2 dictionaries and displayed with the method from_dict( ) of the construct DataFrame.

And the dataframe which is created here with the method from_dict( )  is exactly the same in structure as when the dataframe was created with pandas.DataFrame( )


Eg - 2  Code to create and display a dataframe using the from_dict( ) with the given values where the columns (keys) are not matching in each row (dictionary)

In the above code the dataframe is created with the data object of type list of dictionaries and this list does not have matching keys in all the dictionaries. So, the unmatched keys (columns) have been shown with the default value NaN.

Sample reference program












Before we start with the argument orient let us once again have a look over the output of the dataframe created from a list of 3 dictionaries through from_dict( ) as in the above program.

Eg - 3  Code to create and display a dataframe created from a list of 3 dictionaries using the from_dict( ) with the argument orient and its value as columns









So, in the above code the argument orient is used and is assigned with the value column which in return has given the same output as the previous coding as Eg 2 without the argument orient. 

Which means the default value of orient is column and it also mean that the dataframe which is made of list of dictionaries is oriented key wise (column) (Abhishek, Amitej are all keys)

Eg - 4  Code to create and display a dataframe created from a list of dictionaries using the from_dict( ) with the argument orient value as index











When the dataframe is created from a list of dictionaries and is oriented row wise (index wise) then the program returns an Attribute error because the list is an object which is of single dimension and that is why it fails to pick data row wise.

Display of the dataframe created from a list of dictionaries  - 

* with print( ) the dataframe is created but not stored any where and then displayed 



* without print( ) the dataframe is created and is displayed in a tabular structure









* with print( ) as first statement the dataframe is shown in the normal format and with the direct statement of conversion / creation the dataframe is shown in a tabular structure.










* with print( ) as the second statement overwrites the display of the direct use of the statement and that is why the tabular structure is not shown.







Method 3 pandas.DataFrame.from_records(data_object, index, columns  ) – is the method of DataFrame construct and not of pandas and is used to create a 2-D table from the specified data_object of type list of dictionaries.

index argument can be used to specify the alternative index value/ row name as a list of an array

columns argument is used to specify the columns or keys which should be extracted to show the dataframe. If such a column name is passed which is not a matching key in the dictionaries or does not exist in the dictionary then the values of such keys will be shown as NaN.

The creation of the dataframe is same as the previous two methods. Every output structure is also the same as of the above two methods.

Eg 1 - 










Eg 2 -















Eg 3 - 




Eg 4 - 











*****************************************

Comments

Popular posts from this blog

DataFrame Blog 2 CSV Files