Data Visualisation: A Visual Approach to Understanding the Data - I

Data Visualisation: A Visual Approach to Understanding the Data - I

Visualisation of data is critical for data analysis because it enables you to understand the data. This is critical even for machine learning, as it allows you to select an appropriate algorithm based on your data. For instance, if non-linear relationships (also known as non-linear data) exist, you must choose an algorithm that works with non-linear data. Algorithms such as linear regression would fail to work because they are based on the assumption that the data is linearly separable.

Given the importance of data visualisation, this post will examine a few data visualisation techniques that can assist in better understanding the data. The emphasis is on how to create the visualisation rather than its interpretation, which is beyond the scope of this post. We will use the pull request data from Kaggle's GitHub programming languages dataset for this work.

Following data loading, the first step is to inspect the data. This can be accomplished using the Pandas dataframe head command, as demonstrated below. The head command displays a preview of the dataset.

df_pr.head()

name year quarter count
0 Ruby 2011 3 632
1 PHP 2011 3 484
2 Python 2011 3 423
3 JavaScript 2011 3 367
4 Java 2011 3 216

The first question we want to answer and learn more about data is how it is distributed. There are several options for this, such as a histogram plot, but we will use the cumulative distribution (CDF) plot. Why CDF? I would suggest reading article “why we love CDF and do not like histograms that much?”, which goes into greater detail about the benefits of CDF. A CDF is defined as a function F(x) that is the probability that a random variable c, from a particular distribution, is less than x [1].

We will visualise the CDF plot of the mean, min, and max values to better understand the distribution of the data. This will provide us with an accurate picture of our data. The function shown below computes the minimum, mean, and maximum of count features for our data, which is grouped by year and name feature column. Furthermore, this also tells us that we want to understand how our data is distributed based on year and the name, which is programming languages.

def make_group(df, group_by, column_name): 
    grouped_multiple = df.groupby(group_by).agg({column_name: ['mean', 'min', 'max']})
    grouped_multiple.columns = [f"{column_name}_mean", f"{column_name}_min", f"{column_name}_max"]
    grouped_multiple = grouped_multiple.reset_index()
    grouped_multiple = grouped_multiple.dropna()  
    return grouped_multiple
grouped_df=make_group(df_pr,['year','name'],'count')
grouped_df.head()

The table below shows a snapshot of the aggregation result, which we will use in the following step to compute and visualise CDF.

year name count_mean count_min count_max
0 2011 C 623.0 623 623
1 2011 C# 291.0 291 291
2 2011 C++ 506.0 181 831
3 2011 Erlang 149.0 149 149
4 2011 HTML 149.0 149 149

The function cdf_plot_mean_max_min below computes the CDF and generates a plot of it. The graphs in Figures 1 and 2 depict the CDF distribution plot of pull requests for the C and Erlang programming languages. The difference in data distribution is evident, and the result is intuitive, even to non-experts. Without the plot, one would have to exert considerable effort to discern the difference if only the numbers were presented (i.e., only computed CDF values).

def cdf_plot_mean_max_min(grouped_multiple, column_name,xlabel,title):
    sns.set(rc={'figure.figsize':(20,8)})
    count_min = pd.DataFrame(grouped_multiple.groupby(f"{column_name}_min").size().rename('Freq')).reset_index()
    count_mean = pd.DataFrame(grouped_multiple.groupby(f"{column_name}_mean").size().rename('Freq')).reset_index()
    count_max = pd.DataFrame(grouped_multiple.groupby(f"{column_name}_max").size().rename('Freq')).reset_index()
    count_mean['cum'] = count_mean['Freq'].cumsum() / count_mean['Freq'].sum() *100
    count_max['cum'] = count_max['Freq'].cumsum() / count_max['Freq'].sum()*100
    count_min['cum'] = count_min['Freq'].cumsum() / count_min['Freq'].sum()*100
    
    plt.plot(count_mean[f"{column_name}_mean"], count_mean['cum'],linestyle='--')

    plt.plot(count_max[f"{column_name}_max"], count_max['cum'],linestyle='--')

    plt.plot(count_min[f"{column_name}_min"], count_min['cum'],linestyle='--')
 
    
    plt.legend([
        'Mean', 'Max', 'Min '
    ],fontsize=30)


    plt.xlabel(xlabel,fontsize=30) 
    plt.title(title, fontsize=30)
    plt.ylabel('CDF (%)',fontsize=30)
    plt.xticks(fontsize=30) 
    plt.yticks(fontsize=30) 
    plt.show()
    plt.close()
cdf_plot_mean_max_min(make_group(df_pr[df_pr["name"]=="C"],['year'],'count'),
                      "count","Number of Samples",
                     "CDF distribution of C programming language grouped by Year")

png

Figure 1: CDF distribution of C programming language grouped by year

cdf_plot_mean_max_min(make_group(df_pr[df_pr["name"]=="Erlang"],['year'],'count'),
                      "count","Number of Samples",
                     "CDF distribution of Erlang programming language grouped by Year")

png

Figure 2: CDF distribution of Erlang programming language grouped by year

The next thing we'd like to know is how the state of pull requests varies by programming language and year. The function pull_request_language_by_year can be used for this purpose. Figures 3, 4, 5, and 6 show the visualisations generated using the function pull_request_language_by_year.

def pull_request_language_by_year(df, language_name, xlabel, ylabel, title):
    if type(language_name)!=str:
        raise TypeError("Type mismatched, required string")
    data_to_plot=df[df["name"]==language_name].groupby(["year"]).agg({"count": ['sum']})
    data_to_plot.columns = ["count_total_pr"]
    data_to_plot= data_to_plot.reset_index()
    plt.plot(data_to_plot["year"],data_to_plot["count_total_pr"])
    plt.xlabel(xlabel,fontsize=30) 
    plt.title(title, fontsize=30)
    plt.ylabel(ylabel,fontsize=30)
    plt.xticks(fontsize=30) 
    plt.yticks(fontsize=30) 
    plt.show()
    plt.close()
pull_request_language_by_year(df_pr,"Ruby","Year","Number of pull requests",
                              "Ruby programming language pull requests by year")

png

Figure 3: Visualisation of the year-to-year variation in the number of pull requests for the Ruby programming language

pull_request_language_by_year(df_pr,"C","Year","Number of pull requests",
                              "C programming language pull requests by year")

png

Figure 4: Visualisation of the year-to-year variation in the number of pull requests for the C programming language

pull_request_language_by_year(df_pr,"C++","Year","Number of pull requests",
                              "C++ programming language pull requests by year")

png

Figure 5: Visualisation of the year-to-year variation in the number of pull requests for the C++ programming language

pull_request_language_by_year(df_pr,"Erlang","Year","Number of pull requests",
                              "Erlang programming language pull requests by year")

png

Figure 6: Visualisation of the year-to-year variation in the number of pull requests for the Erlang programming language

Now we're going to do a slight variation on what we have already seen, which is pull requests by year. The variation we desire is to visualise the pull requests not only by year and programming language, but also by quarter. The results of the variation generated by the function pull_request_language_by_year_quarter are shown in Figures 7 and 8.

def pull_request_language_by_year_quarter(df, 
                                          filterby_year, 
                                          filterby_quarter,
                                          xlabel, 
                                          ylabel, 
                                          title,
                                         rotate_xticks=False): 
    
    data_to_plot=df[(df["year"]==filterby_year) & (df["quarter"]==filterby_quarter)]
    plt.plot(data_to_plot["name"],data_to_plot["count"])
    if rotate_xticks:
        plt.xticks(rotation = 'vertical')
    plt.xlabel(xlabel,fontsize=30) 
    plt.title(title, fontsize=30)
    plt.ylabel(ylabel,fontsize=30)
    plt.xticks(fontsize=30) 
    plt.yticks(fontsize=30) 
    plt.show()
    plt.close()
pull_request_language_by_year_quarter(df_pr,
                                     2011,
                                     3,
                                     "Programming language",
                                     "Number of pull requests",
                                     "Pull requests by programming language for the third quarter of 2011")

png

Figure 7: Visualisation of the number of pull requests for different programming languages for the third quarter of the year 2011

pull_request_language_by_year_quarter(df_pr,
                                     2011,
                                     4,
                                     "Programming language",
                                     "Number of pull requests",
                                     "Pull requests by programming language for the fourth quarter of 2011",
                                     True)

png

Figure 8: Visualisation of the number of pull requests for different programming languages for the fourth quarter of the year 2011

Finally, we visualise all the pull requests grouped by programming languages, which is shown in Figure 9. The function pull_request_groupby_filter_all can be used for this purpose.

def pull_request_groupby_filter_all(df, group_by, xlabel, ylabel, title,rotate_xticks=False):
    sns.set(rc={'figure.figsize':(50,10)})
    if type(group_by)!=str:
        raise TypeError("Type mismatched, required string")
    if rotate_xticks:
        plt.xticks(rotation = 65)
        
    data_to_plot=df.groupby(group_by).agg({"count": ['sum']})
    data_to_plot.columns = ["total_pr"]
    data_to_plot= data_to_plot.reset_index()
    plt.bar(data_to_plot[group_by],np.log10(data_to_plot["total_pr"]))
    plt.xlabel(xlabel,fontsize=30) 
    plt.title(title, fontsize=30)
    plt.ylabel(ylabel,fontsize=30)
    plt.xticks(fontsize=20) 
    plt.yticks(fontsize=30) 
    plt.show()
    plt.close()
pull_request_groupby_filter_all(df_pr,
                               "name",
                               "Programming languages",
                               "Total pull requests (log10)",
                               "Total pull requests based on programming language",
                               True)

png

Figure 9: Visualisation of the total number of pull requests by programming languages

To summarise, we saw various methods for visualising data as well as the code necessary to perform such visualisations.

References

[1]. McClarren, R., 2017. Computational nuclear engineering and radiological science using python. Academic Press.

Code: Jupyter Notebook