<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Reticulate on Laminar Insight</title>
    <link>https://laminarinsight.com/tags/reticulate/</link>
    <description>Recent content in Reticulate on Laminar Insight</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Tue, 12 Jan 2021 00:00:00 +0000</lastBuildDate>
    
        <atom:link href="https://laminarinsight.com/tags/reticulate/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Who&#39;s Missing?</title>
      <link>https://laminarinsight.com/post/who-s-missing/</link>
      <pubDate>Tue, 12 Jan 2021 00:00:00 +0000</pubDate>
      
      <guid>https://laminarinsight.com/post/who-s-missing/</guid>
      <description>
&lt;script src=&#34;https://laminarinsight.com/post/who-s-missing/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://laminarinsight.com/post/who-s-missing/index_files/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://laminarinsight.com/post/who-s-missing/index_files/lightable/lightable.css&#34; rel=&#34;stylesheet&#34; /&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#motivation&#34;&gt;Motivation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#preparation&#34;&gt;Preparation&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#tools-and-libraries&#34;&gt;Tools and Libraries&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#python-function&#34;&gt;Python Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#demo&#34;&gt;Demo&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#data&#34;&gt;Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#application&#34;&gt;Application&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-did-we-improve&#34;&gt;What Did We Improve?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;Finding out what’s missing in the data is one of those grinding data wrangling tasks of data science. Though they are missing data, if not pointed out and handled properly they can come back and haunt you in the long run!&lt;/p&gt;
&lt;p&gt;While working in a project using Python, I found finding missing data is easy but presenting it nicely in a Notebook can be messy. Hence this is my effort to make missing data exploration a bit easier and more useful.&lt;/p&gt;
&lt;div id=&#34;motivation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;In Python we can use this simple one line of code &lt;code&gt;data.isnull().sum&lt;/code&gt; to list down missing data in a data set. But the problem lies in how this function produces a whole list containing all the feature names whether they have any missing value or not. Which can look very messy in a Notebook when the data set in question is a large one with lots of features in it.&lt;/p&gt;
&lt;p&gt;To overcome that in this tutorial we will see how we can write a simple utility function that will calculate missing values for only the features with missing observations and store it in a data frame which can later be used to report or visualize.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;preparation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Preparation&lt;/h2&gt;
&lt;div id=&#34;tools-and-libraries&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Tools and Libraries&lt;/h3&gt;
&lt;p&gt;In this tutorial I will be using RStudio as the IDE. Thus I will use R package &lt;a href=&#34;https://rstudio.github.io/reticulate/&#34;&gt;Reticulate&lt;/a&gt; to run Python codes.&lt;/p&gt;
&lt;p&gt;I will be using a &lt;em&gt;Mini Conda&lt;/em&gt; virtual environment, &lt;em&gt;curious-joe&lt;/em&gt; in the back-end as the Python environment. To be able to reproduce this tutorial you may want to create your own virtual environment in Conda and use the name of that in the &lt;code&gt;reticulte::use_condaenv()&lt;/code&gt; function. To learn detail about creating and managing Conda environment you can visit this &lt;a href=&#34;https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html&#34;&gt;document&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# loading libraries

library(reticulate)
library(dplyr)
library(kableExtra)
library(knitr)

# setting up virtual python environment
reticulate::use_condaenv(&amp;quot;curious-joe&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;python-function&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Python Function&lt;/h2&gt;
&lt;p&gt;The function we will see depends on the Python library &lt;em&gt;Pandas&lt;/em&gt;, a library commonly used for data wrangling and analysis.&lt;/p&gt;
&lt;p&gt;The function simple enough to understand as you will read through it. But for people brand new to Python programming here is a breakdown of how the functionality flows inside the function:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Creates a list, &lt;em&gt;colNames&lt;/em&gt; of string values to store column names,&lt;/li&gt;
&lt;li&gt;Creates a blank data frame &lt;em&gt;df&lt;/em&gt; with the values from &lt;em&gt;colNames&lt;/em&gt; as column names,&lt;/li&gt;
&lt;li&gt;Runs a for loop to iterate over each column of the input data frame and performs following series of tasks:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Calculates percentage of missing values in a column and saves the output in an object called &lt;em&gt;p&lt;/em&gt;,&lt;/li&gt;
&lt;li&gt;Calculates total count of missing values in a column and saves the output in an object called &lt;em&gt;q&lt;/em&gt;,&lt;/li&gt;
&lt;li&gt;Runs a check if &lt;em&gt;p&lt;/em&gt;, percent of missing value, is larger than zero and if it is populates the empty data frame &lt;em&gt;df&lt;/em&gt; with the column name and its corresponding count and percentage of missing values.&lt;/li&gt;
&lt;li&gt;Sorts the &lt;em&gt;df&lt;/em&gt;, the result data frame on descending order,&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Returns &lt;em&gt;df&lt;/em&gt;, the data frame with names and missing count of the features with missing values.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;# pyhton library
import pandas as pd

# @ countMissing
# Fetches columns from the spefied dataset that contains missing values
# @param dataFrame Name of the dataframe object

def countMissing(dataFrame):
    # colNames = [&amp;#39;colNames&amp;#39;, &amp;#39;missingValue&amp;#39;, &amp;#39;missingValuePerc&amp;#39;]
    colNames = [&amp;#39;Featuers&amp;#39;, &amp;#39;Missing_Value&amp;#39;, &amp;#39;Percentage_Missing&amp;#39;]
    df = pd.DataFrame(columns = colNames)
    for i in dataFrame.columns:
        p = round((dataFrame[i].isnull().sum()/dataFrame.shape[0]) * 100, 2)
        q = round(dataFrame[i].isnull().sum(), 0)
        if p &amp;gt; 0:
            df.loc[len(df)] = [i, q, p]
    # creating data frame with the missing value columns and values   
    df = df.sort_values([&amp;#39;Percentage_Missing&amp;#39;], ascending = False).reset_index(drop=True)
    return(df)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;demo&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Demo&lt;/h2&gt;
&lt;div id=&#34;data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Data&lt;/h3&gt;
&lt;p&gt;To demonstrate how the function will work I will use &lt;em&gt;iris&lt;/em&gt; data set and introduce some &lt;em&gt;NA&lt;/em&gt; values (missing values in R’s language) in the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# preparing data
data &amp;lt;- iris
data = data %&amp;gt;% mutate(Sepal.Width = ifelse(Sepal.Length &amp;gt;7, NA, Sepal.Width))
data = data %&amp;gt;% mutate(Sepal.Length = ifelse(Sepal.Length &amp;gt;7, NA, Sepal.Length))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the code we have removed values for Sepal.Width and Sepal.Length features when Sepal.Length value is larger than 7. Which result in 24 rows with missing values.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;application&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Application&lt;/h3&gt;
&lt;p&gt;The following code chunk applies &lt;em&gt;countMissing()&lt;/em&gt;, the function that we have just created and prints out the output data frame.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;# calculating missing value using countMissing()
table = countMissing(r.data)
table&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        Featuers Missing_Value  Percentage_Missing
## 0  Sepal.Length            12                 8.0
## 1   Sepal.Width            12                 8.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s use some R markdown packages to make the output look nicer!&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knitr::kable(py$table, caption = &amp;quot;Missing Values&amp;quot;) %&amp;gt;%
  kable_classic(full_width = F, html_font = &amp;quot;Cambria&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34; lightable-classic&#34; style=&#34;font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;caption&gt;
&lt;span id=&#34;tab:unnamed-chunk-2&#34;&gt;Table 1: &lt;/span&gt;Missing Values
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Featuers
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Missing_Value
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Percentage_Missing
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Sepal.Length
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Sepal.Width
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;what-did-we-improve&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What Did We Improve?&lt;/h2&gt;
&lt;p&gt;If you look inside the &lt;em&gt;countMissing()&lt;/em&gt; function you will see that we are using &lt;em&gt;isnull().sum()&lt;/em&gt; inside, the same function that we could use to get the missing count. The only reason we created &lt;em&gt;countMissing()&lt;/em&gt; was to make sure that the missing count is produced in a more presentable and usable way. Though the difference is more obvious when they are run on wider data set, the following code chunk shows how the outputs from these two approaches differ.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;r.data.isnull().sum()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Sepal.Length    12
## Sepal.Width     12
## Petal.Length     0
## Petal.Width      0
## Species          0
## dtype: int64&lt;/code&gt;&lt;/pre&gt;
&lt;center&gt;
&lt;strong&gt;VS&lt;/strong&gt;
&lt;/center&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;countMissing(r.data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        Featuers Missing_Value  Percentage_Missing
## 0  Sepal.Length            12                 8.0
## 1   Sepal.Width            12                 8.0&lt;/code&gt;&lt;/pre&gt;
&lt;center&gt;
&lt;strong&gt;Or Even Better&lt;/strong&gt;
&lt;/center&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knitr::kable(py$table, caption = &amp;quot;Missing Values&amp;quot;) %&amp;gt;%
  kable_classic(full_width = F, html_font = &amp;quot;Cambria&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34; lightable-classic&#34; style=&#34;font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;caption&gt;
&lt;span id=&#34;tab:unnamed-chunk-5&#34;&gt;Table 2: &lt;/span&gt;Missing Values
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Featuers
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Missing_Value
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Percentage_Missing
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Sepal.Length
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Sepal.Width
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr /&gt;
&lt;p&gt;In this tutorial we basically had an introduction to writing functions in Python. We learned how we can write our own little utility functions to solve our unique problems.&lt;/p&gt;
&lt;center&gt;
***
&lt;/center&gt;
&lt;center&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Thank you for reading so far!&lt;/strong&gt;&lt;br /&gt;
If you enjoyed this, please feel free to browse through my blog or you may also follow me on &lt;a href=&#34;https://medium.com/@curious-joe&#34;&gt;Medium&lt;/a&gt; or connect with me on &lt;a href=&#34;https://www.linkedin.com/in/arafath-hossain/&#34;&gt;Linked in&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/center&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
