Agile Data Science - Quick Guides



Agile Data Science Study Notes

Agile is a software development methodology that helps in building software through incremental sessions using short iterations of 1 week to about a month so that the development is lined up with the changing business needs. Agile Data science includes of a combination of agile methodology and data science. In this study notes, we have utilized appropriate examples to help you understand agile development and data science in a general and quick way.

Audience

This study notes has been prepared for engineers and project managers to assist them understand the basics of agile principles and its implementation. After finishing this study notes, you will find yourself at a moderate level of expertise, from where you can progress further with implementation of data science and agile methodology.

Prerequisites

It is essential to have basic knowledge of data science modules and software development concepts such as software requirements, coding along with testing.


Agile Data Science - Introduction

Agile data science is a methodology of utilizing data science with agile methodology for web application development. It focusses on the output of the data science process suitable for affecting change for an association. Data science incorporates building applications that describe research process with analysis, intelligent visualization and now applied machine learning as well.

The major objective of agile data science is to −

document and guide illustrative data analysis to discover and follow the critical path to a compelling product.

Agile data science is composed with the following set of principles −

Continuous Iteration

This procedure includes continuous iteration with creation tables, charts, reports and predictions. Building predictive models will require many iterations of feature engineering with extraction and creation of insight.

Intermediate Output

This is the track list of outputs generated. It is even told that failed experiments also have output. Tracking output of every iteration will help making better output in the next iteration.

Prototype Experiments

Prototype experiments include assigning tasks and generating output as per the experiments. In a given task, we should iterate to accomplish insight and these iterations can be best explained as experiments.

Integration of data

The software development life cycle incorporates various stages with data essential for −

  • clients

  • engineers, and

  • the business

The integration of data paves way for better possibilities and outputs.

Pyramid data value

 

The-Data-Value-Prramid

 

The above pyramid value depicted the layers required for “Agile data science” development. It begins with a collection of records dependent on the prerequisites and plumbing individual records. The charts are created after cleaning and aggregation of data. The aggregated data can be used for data visualization. Reports are generated with appropriate structure, metadata and tags of data. The second layer of pyramid from the top includes prediction analysis. The prediction layer is where more value is created but helps in creating good predictions that focus on feature engineering.

The topmost layer involves actions where the value of data is driven successfully. The best representation of this implementation is “Artificial Intelligence”.


Agile Data Science - Methodology Concepts

In this section, we will focus on the concepts of software development life cycle called “agile”. The Agile software development methodology helps in building a software through addition sessions in short iterations of 1 week to about a month so the development is lined up with changing business prerequisites.

There are 12 principles that describe the Agile methodology in detail −

Satisfaction of customers

The highest priority is given to clients concentrating on the prerequisites through early and continuous delivery of valuable software.

Welcoming new changes

Changes are acceptable during software development. Agile processes is designed to work in order to match the client’s competitive advantage.

Delivery

Delivery of a working software is given to customers within a span of one week to about a month.

Collaboration

Business analysts, quality experts and developers must cooperate during the whole life cycle of project.

Motivation

Projects should be structured with a clan of motivated individuals. It gives an environment to help individual colleagues.

Personal conversation

Face-to-face conversation is the most productive and effective method of sending information to and inside a development team.

Measuring progress

Measuring progress is the key that helps in characterizing the progress of project and software development.

Maintaining constant pace

Agile process focusses on sustainable development. The business, the engineers and the users should be able to maintain a constant pace with the project.

Monitoring

It is required to maintain regular attention to technical excellence and good design to improve the agile functionality.

Simplicity

Agile process keeps everything simple and uses simple terms to measure the work that isn't finished.

Self-organized terms

An agile team should be self-organized and should be independent with the best architecture; prerequisites and designs emerge from self-composed teams.

Review the work

It is important to review the work at regular intervals so that the team can reflect on how the work is advancing. Reviewing the module on a timely basis will improve execution.

Daily Stand-up

Daily stand-up refers to the daily status meeting among the colleagues. It gives updates related to the software development. It additionally refers to addressing obstacles of project development.

Daily stand-up is a required practice, no matter how an agile team is established regardless of its office location.

The list of features of a daily stand-up are as per the following −

  • The duration of daily stand-up meet should be roughly 15 minutes. It should not extend for a longer duration.

  • Stand-up should include discussions on status update.

  • Participants of this meeting normally stand with the intention to end up meeting quickly.

User Story

A story is usually a requirement, which is formulated in few sentences in simple language and it should be finished within an iteration. A user story should incorporate the following characteristics −

  • All the related code should have related check-ins.

  • The unit test cases for the specified iteration.

  • All the acceptance test cases should be defined.

  • Acceptance from product owner while characterizing the story.

 

agile-scrum-methodology

 

What is Scrum?

Scrum can be considered as a subset of agile methodology. It is a lightweight procedure and incorporates the following features −

  • It is a process framework, which incorporates a set of practices that need to be followed in predictable order. The best representation of Scrum is following iterations or sprints.

  • It is a “lightweight” process meaning that the process is kept as small as possible, to augment the productive output in given duration specified.

Scrum process is known for its distinguishing procedure in comparison with different methodologies of traditional agile approach. It is divided into the following three categories −

  • Roles

  • Artifacts

  • Time Boxes

Roles characterize the colleagues and their roles included throughout the process. The Scrum Team consists of the following three roles −

  • Scrum Master

  • Product Owner

  • Team

The Scrum artifacts provide key information that each member should be aware of. The information includes details of product, activities planned, and activities finished. The artefacts defined in Scrum framework are as per the following −

  • Product backlog

  • Sprint backlog

  • Burn down chart

  • Increment

Time boxes are the user stories which are planned for each iteration. These user stories help in depicting the product features which form part of the Scrum artefacts. The product backlog is a list of user stories. These user stories are prioritized and forwarded to the user meetings to decide which one ought to be taken up.

Why Scrum Master?

Scrum Master collaborates with every member of the team. Let us now see the interaction of the Scrum Master with other teams and assets.

Product Owner

The Scrum Master interacts the product owner in following manners −

  • Discovering techniques to achieve effective product backlog of user stories and managing them.

  • Helping team to understand the necessities of clear and concise product backlog items.

  • Product planning with specific environment.

  • Ensuring that product owner knows how to increase the value of product.

  • Facilitating Scrum events as and when required.

Scrum Team

The Scrum Master associates with the team in several ways −

  • Coaching the organization in its Scrum adoption.

  • Planning Scrum implementations to the particular organization.

  • Helping employees and stakeholders to understand the requirement and phases of product development.

  • Working with Scrum Masters of different teams to increase effectiveness of the application of Scrum of the specified team.

Organization

The Scrum Master communicates with organization in several ways. A few are mentioned below −

  • Coaching and scrum team associates with self-organization and includes a feature of cross functionality.

  • Coaching the organization and teams in such areas where Scrum is not fully adopted yet or not acknowledged.

Benefits of Scrum

Scrum helps clients, colleagues and stakeholders work together. It incorporates timeboxed approach and continuous feedback from the product owner ensuring that the product is in working condition. Scrum provides advantages to different roles of the project.

Client

The sprints or iterations are considered for shorter duration and user stories are designed according to need and are taken up at sprint planning. It ensures that every sprint delivery, client requirements are fulfilled. If not, the requirements are noted and are planned and taken for sprint.

Organization

Organization with the help of Scrum and Scrum masters can focus on the efforts required for development of user stories thus reducing work overload and avoiding rework if any. This also helps in maintaining increased efficiency of development team and client satisfaction. This methodology also helps in increasing the capability of the market.

Product Managers

The fundamental responsibility of the product managers is to ensure that the quality of product is maintained. With the help of Scrum Masters, it becomes easy to facilitate work, gather quick responses and absorb changes if any. Product managers also confirm that the designed product is aligned as per the client requirements in every sprint.

Development Team

With time-boxed nature and keeping sprints for a smaller duration of time, development team becomes enthusiastic to see that the work is reflected and delivered appropriately. The working product increments each level after every iteration or rather we can call them as “sprint”. The user stories which are designed for every sprint become client need adding up more value to the iteration.

Conclusion

Scrum is a productive framework within which you can develop software in teamwork. It is completely structured on agile principles. ScrumMaster is there to help and co-operate the team of Scrum in every possible way. He acts like a personal trainer who helps you stick with designed plan and perform all the activities as per the plan. The authority of ScrumMaster should never extend beyond the process. He/she should be potentially capable to manage every situation.


Agile Data Science - Data Science Process

In this section, we will understand the data science process and terminologies required to understand the process.

“Data science is the blend of data interface, algorithm development and technology in order to tackle analytical complex issues”.

 

data_science_process

 

Data science is an interdisciplinary field incorporating scientific methods, processes and systems with categories included in it as Machine learning, math and statistics knowledge with traditional research. It additionally includes a combination of hacking skills with substantive expertise. Data science draws principles from mathematics, statistics, information science, and computer science, data mining and predictive analysis.

The various roles that form part of the data science team are mentioned below −

Clients

Clients are the people who utilize the product. Their interest decides the success of project and their feedback is truly significant in data science.

Business Development

This team of data science signs in early clients, either firsthand or through creation of landing pages and promotions. Business development team delivers the estimation of product.

Product Managers

Product managers take in the significance to create best product, which is valuable in market.

Interaction designers

They focus on design interactions around data models so that users find appropriate value.

Data scientists

Data scientists investigate and transform the data in new ways to create and publish new highlights. These scientists also combine data from diverse sources to create a new value. They play an significant role in creating visualizations with researchers, engineers and web developers.

Researchers

As the name specifies researchers are engaged in research activities. They solve complicated issues, which data scientists can't do. These issues involve intense focus and time of machine learning and statistics module.

Adapting to Change

All the colleagues of data science are required to adapt to new changes and work on the basis of requirements. A few changes should be made for adopting agile methodology with data science, which are referenced as follows −

  • Choosing generalists over specialists.

  • Preference of small teams over large teams.

  • Using high-level tools and platforms.

  • Continuous and iterative sharing of intermediate work.

Note

In the Agile data science team, a small team of generalists utilizes high-level tools that are scalable and refine data through iterations into increasingly higher states of value.

Consider the following examples identified to the work of data science team members −

  • Designers deliver CSS.

  • Web developers build entire applications, understand the user experience, and interface design.

  • Data scientists should work on both research and building web services including web applications.

  • Researchers work in code base, which shows results explaining intermediate results.

  • Product managers try identifying and understanding the flaws in all the related zones.


Agile Data Science - Tools And Installation

In this section, we will learn about the different Agile tools and their installation. The development stack of agile methodology incorporates the following set of components −

Events

An event is an occurrence that occurs or is logged along with its features and timestamps.

An event can come in many forms like servers, sensors, financial transactions or actions, which our users take in our application. In this complete study notes, we will utilize JSON files that will facilitate data exchange among different tools and languages.

Collectors

Collectors are event aggregators. They collect events in a deliberate way to store and aggregate bulky data queuing them for action by real time workers.

Distributed document

These documents incorporate multinode (multiple nodes) which stores document in a particular format. We will concentrate on MongoDB in this study notes.

Web application server

Web application server enables data as JSON through customer through visualization, with minimal overhead. It means web application server helps to test and deploy the projects made with agile methodology.

Modern Browser

It enables modern browser or application to introduce data as an intelligent tool for our users.

Local Environmental Setup

For managing data sets, we will concentrate on the Anaconda framework of python that incorporates tools for managing excel, CSV and many more files. The dashboard of Anaconda framework once installed is as shown below. It is also called the “Anaconda Navigator” −

 

managing_data_sets

 

The navigator incorporates the “Jupyter framework” which is a notebook system that assists to manage datasets. When you launch the framework, it will be hosted in the browser as referenced below −

 

jupyter_framework


Agile Data Science - Data Processing In Agile

In this section, we will concentrate on the difference between structured, semi-structured and unstructured data.

Structured data

Structured data concerns the data stored in SQL format in table with rows and columns. It incorporates a relational key, which is planned into pre-designed fields. Structured data is utilized for a larger scale.

Structured data represents only 5-10 percent of all informatics data.

Semi-structured data

Sem-structured data incorporates data which don't reside in relational database. They incorporate some of organizational properties that make it simpler to analyse. It includes the same process to store them in relational database. The examples of semi-structured database are CSV files, XML and JSON documents. NoSQL databases are considered semistructured.

Unstructured data

Unstructured data represents 80 percent of data. It frequently includes text and multimedia content. The best examples of unstructured data include audio files, presentations and web pages. The examples of machine generated unstructured data are satellite pictures, scientific data, photos and video, radar and sonar data.

 

satellite-images

 

The above pyramid structure specifically focusses on the amount of data and the ratio on which it is dissipated.

Semi-structured data appears as type between unstructured and semi-structured data. In this study notes, we will focus on semi-structured data, which is helpful for agile methodology and data science research.

Semi structured data doesn't have a formal data model but has a clear, selfdescribing pattern and structure which is developed by its analysis.


Agile Data Science - SQL Vs NoSQL

The complete focus of this study notes is to follow agile methodology with less number of steps and with execution of more useful tools. To get this, it is important to know the difference between SQL and NoSQL databases.

Most of the users are aware of SQL database, and have a decent knowledge on either MySQL, Oracle or other SQL databases. Over the last several years, NoSQL database is getting generally adopted to solve different business problems and requirements of project.

 

sqlvsnosql

 

The following table shows the difference between SQL and NoSQL databases −

SQL NoSQL
SQL databases are primarily called Relational Database Management system (RDBMS). NoSQL database is also called documentoriented database. It is non-relational and distributed.
SQL based databases incorporates structure of table with rows and columns. Collection of tables and other schema structures called database. NoSQL database incorporates documents as significant structure and the inclusion of documents is called collection.
SQL databases incorporate predefined schema. NoSQL databases have dynamic data and incorporate unstructured data.
SQL databases are vertical scalable. NoSQL databases are horizontal scalable.
SQL databases are good match for complex query environment. NoSQL don't have standard interfaces for complex question development.
SQL databases are not attainable for hierarchal data storage. NoSQL databases fits better for hierarchical data storage.
SQL databases are best fit for heavy exchanges in the specified applications. NoSQL databases are still not considered comparable in high burden for complex transactional applications.
SQL databases offers excellent support for their vendors. NoSQL database still relies on community support. Only few specialists are available for setup and deployed for large-scale NoSQL deployments.
SQL databases focuses on ACID properties – Atomic, Consistency, Isolation And Durability. NoSQL database focuses on CAP properties – Consistency, Availability, and Partition resistance.
SQL databases can be named as open source or closed source based on the vendors who have selected them. NoSQL databases are classified based on the capacity type. NoSQL databases are open source by default.

Why NoSQL for agile?

The above-mentioned comparison shows that the NoSQL document database totally supports agile development. It is schema-less and doesn't totally focus on data modelling. Rather, NoSQL defers applications and services and thus developers get a better idea of how data can be displayed. NoSQL defines data model as the application model.

 

comparison_shows

 

MongoDB Installation

Throughout this study notes, we will focus more on the examples of MongoDB as it is considered the best “NoSQL schema”.

 

mongodb_first

 

 

mongodb_second

 

 

mongodb_third

 

 

mongodb_fourth

 

 

mongodb_fifth


Agile Data Science - NoSQL & Dataflow Programming

There are times when the data is inaccessible in relational format and we need to keep it transactional with the assistance of NoSQL databases.

In this section, we will concentrate on the dataflow of NoSQL. We will also learn how it is operational with a mix of agile and data science.

One of the significant reasons to utilize NoSQL with agile is to increase the speed with market competition. The following reasons show how NoSQL is a best fit to agile software methodology −

Fewer Barriers

Changing the model, which at present is going through mid-stream has some genuine costs even in case of agile development. With NoSQL, the users work with aggregate data instead of wasting time in normalizing data. The main point is to get something done and working with the objective of making model perfect data.

Increased Scalability

Whenever an organization is creating product, it lays more focus on its adaptability. NoSQL is constantly known for its adaptability but it works better when it is structured with horizontal adaptability.

Ability to leverage data

NoSQL is a schema-less data model that permits the user to readily utilize volumes of data, which incorporates several parameters of variability and velocity. When considering a decision of technology, you should always consider the one, which leverages the data to a greater scale.

Dataflow of NoSQL

Let us consider the following example wherein, we have demonstrated how a data model is focused on creating the RDBMS schema.

Following are the various requirements of schema −

  • User Identification should be listed.

  • Each user should have mandatory at least one skill.

  • The details of each user’s experience should be maintained appropriately.

 

users_experience

 

The user table is normalized with 3 separate tables −

  • Users

  • User skills

  • User experience

The complexity increases while querying the database and time utilization is noted with expanded standardization which is not good for Agile methodology. The same schema can be planned with the NoSQL database as mentioned below −

 

complexity_increases

 

NoSQL maintains the structure in JSON format, which is light- weight in structure. With JSON, applications can store objects with settled data as single documents.


Agile Data Science - Collecting & Displaying Records

In this section, we will concentrate on the JSON structure, which forms part of the “Agile methodology”. MongoDB is a broadly used NoSQL data structure and operates effectively for collecting and displaying records.

 

json_structure

 

Step 1

This step includes building up connection with MongoDB for creating collection and determined data model. All you have to execute is “mongod” command for starting connection and mongo command to connect to the predetermined terminal.

 

specified_data_model

 

Step 2

Create a new database for creating records in JSON format. For the present, we are creating a dummy database named “mydatabase”.

>use mydatabase
switched to db mydatabase
>db
mydatabase
>show dbs
local 0.78125GB
test 0.23012GB
>db.user.insert({"name":"Agile Data Science"})
>show dbs
local 0.78125GB
mydatabase 0.23012GB
test 0.23012GB

Step 3

Creating collection is obligatory to get the list of records. This feature is beneficial for data science research and outputs.

>use test
switched to db test
>db.createCollection("mycollection")
{ "ok" : 1 }
>show collections
mycollection
system.indexes
>db.createCollection("mycol", { capped : true, autoIndexId : true, size :
 6142800, max : 10000 } )
{ "ok" : 1 }
>db.agiledatascience.insert({"name" : "demoname"})
>show collections
mycol
mycollection
system.indexes
demoname

Agile Data Science - Data Visualization

Data visualization plays a significant role in data science. We can consider data visualization as a module of data science. Data Science incorporates more than building predictive models. It incorporates clarification of models and utilizing them to understand data and make decisions. Data visualization is an essential part of presenting data in the most convincing way.

From the data science perspective, data visualization is a highlighting feature which shows the changes and trends.

Consider the following guidelines for effective data visualization −

  • Position data along regular scale.

  • Utilization of bars are more effective in comparison of circles and squares.

  • Proper color should be utilized for scatter plots.

  • Use pie chart to show proportions.

  • Sunburst visualization is more successful for hierarchical plots.

Agile needs a simple scripting language for data visualization and with data science in collaboration “Python” is the suggested language for data visualization.

Example 1

The following example demonstrates data visualization of GDP determined in specific years. “Matplotlib” is the best library for data visualization in Python. The installation of this library is shown below −

 

demonstrates_data_visualization

 

Consider the following code to understand this −

import matplotlib.pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]

# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')

# add a title plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()

Output

The above code generates the following output −

 

code_generates

 

There are many approaches to customize the charts with axis labels, line styles and point markers. Let’s focus on the next example which shows the better data visualization. These results can be utilized for better output.

Example 2

import datetime
import random
import matplotlib.pyplot as plt

# make up some data
x = [datetime.datetime.now() + datetime.timedelta(hours=i) for i in range(12)]
y = [i+random.gauss(0,1) for i,_ in enumerate(x)]

# plot
plt.plot(x,y)

# beautify the x-labels
plt.gcf().autofmt_xdate()
plt.show()

Output

The above code generates the following output −

 

code_generates_second


Agile Data Science - Data Enrichment

Data enrichment refers to a range of procedures used to upgrade, refine and improve raw data. It refers to helpful data transformation (raw data to valuable information). The process of data enrichment focusses on making data a valuable data asset for modern business or enterprise.

The most common data enrichment process includes amendment of spelling mistakes or typographical errors in database through utilization of specific decision algorithms. Data enrichment tools add useful information to simple data tables.

Consider the following code for spell correction of words −

import re
from collections import Counter
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):
   "Probabilities of words"
   return WORDS[word] / N
	
def correction(word):
   "Spelling correction of word"
   return max(candidates(word), key=P)
	
def candidates(word):
   "Generate possible spelling corrections for word."
   return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
	
def known(words):
   "The subset of `words` that appear in the dictionary of WORDS."
   return set(w for w in words if w in WORDS)
	
def edits1(word):
   "All edits that are one edit away from `word`."
   letters = 'abcdefghijklmnopqrstuvwxyz'
   splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes = [L + R[1:] for L, R in splits if R]
   transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
   replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
   inserts = [L + c + R for L, R in splits for c in letters]
   return set(deletes + transposes + replaces + inserts)
	
def edits2(word):
   "All edits that are two edits away from `word`."
   return (e2 for e1 in edits1(word) for e2 in edits1(e1))
   print(correction('speling'))
   print(correction('korrectud'))

In this program, we will match with “big.txt” which incorporates corrected words. Words coordinate with words included in text file and print the suitable results accordingly.

Output

The above code will generate the following output −

 

code_will_generate


Agile Data Science - Working With Reports

In this section, we will understand about report creation, which is an significant module of agile methodology. Agile sprints chart pages created by visualization into full-blown reports. With reports, charts become intelligent, static pages become dynamic and network related data. The characteristics of reports phase of the data value pyramid is shown below −

 

The-Data-Value-Prramid-Reports

 

We will lay more stress on creating csv file, which can be utilized as report for data science analysis, and drawing conclusion. Although agile focusses on less documentation, generating reports to mention the progress of product development is always considered.

import csv
#----------------------------------------------------------------------
def csv_writer(data, path):
   """
      Write data to a CSV file path
   """
   with open(path, "wb") as csv_file:
   writer = csv.writer(csv_file, delimiter=',')
   for line in data:
   writer.writerow(line)
#----------------------------------------------------------------------
if __name__ == "__main__":
   data = ["first_name,last_name,city".split(","),
      "Tyrese,Hirthe,Strackeport".split(","),
      "Jules,Dicki,Lake Nickolasville".split(","),
      "Dedric,Medhurst,Stiedemannberg".split(",")
   ]
	
   path = "output.csv"
   csv_writer(data, path)

The above code will assist you generate the “csv file” as shown below −

 

comma_separated_values

 

Let us consider the following benefits of csv (comma- separated values) reports −

  • It is human friendly and simple to edit manually.
  • It is easy to implement and parse.
  • CSV can be handled in all applications.
  • It is littler and faster to handle.
  • CSV follows a standard format.
  • It gives straightforward schema for data researchers.

Agile Data Science - Role Of Predictions

In this section, we will earn about the role of predictions in agile data science. The interactive reports uncover various aspects of data. Predictions form the fourth layer of agile sprint.

 

The-Data-Value-Prramid-predictions

 

When making predictions, we generally refer to the past data and use them as inferences for future iterations. In this complete process, we change data from batch processing of historical data to real-time data about the future.

The role of predictions incorporates the following −

  • Predictions help in forecasting. Some forecasts are based on statistical inference. Some of the predictions are based on opinions of pundits.

  • Statistical inference are engaged with expectations of all kinds.

  • Sometimes forecasts are accurate, while sometimes forecasts are inaccurate.

Predictive Analytics

Predictive analytics incorporates a variety of statistical techniques from predictive modeling, machine learning and data mining which analyze current and historical facts to make predictions about future and unknown events.

Predictive analytics requires training data. Trained data incorporates independent and dependent highlights. Dependent highlights are the values a user is trying to predict. Independent highlights are features describing the things we want to predict based on dependent highlights.

The study of features is called feature engineering; this is urgent to making predictions. Data visualization and exploratory data analysis are parts of feature engineering; these form the core of Agile data science.

 

three phase of Predictions

 

Making Predictions

There are two ways of making predictions in agile data science −

  • Regression

  • Classification

Building a regression or a classification totally depends on business requirements and its analysis. Forecast of continuous variable leads to regression model and prediction of categorical variables leads to classification model.

Regression

Regression considers examples that comprise features and thereby, produces a numeric output.

Classification

Classification takes the input and produces a categorical characterization.

Note − The example dataset that defines input to statistical prediction and that enables the machine to learn is designated “training data”.


Agile Data Science - Extracting Features With PySpark

In this section, we will find out about the application of the extracting features with PySpark in Agile Data Science.

Overview of Spark

Apache Spark can be characterized as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive questions and iterative algorithms.

Spark is written in “Scala programming language”.

PySpark can be considered as a blend of Python with Spark. PySpark offers PySpark shell, which joins Python API to the Spark core and introduces the Spark context. Most of the data researchers use PySpark for tracking features as discussed in the past section.

In this example, we will concentrate on the transformations to build a dataset called counts and save it to a specific file.

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) 
   .map(lambda word: (word, 1)) 
   .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Using PySpark, a user can work with RDDs in python programming language. The inbuilt library, which covers the basics of Data Driven documents and components, helps in this.


Agile Data Science - Building A Regression Model

Logistic Regression refers to the machine learning algorithm that is utilized to predict the probability of categorical dependent variable. In logistic regression, the dependent variable is binary variable, which comprises of data coded as 1 (Boolean values of valid and false).

In this section, we will concentrate on developing a regression model in Python utilizing continuous variable. The example for linear regression model will focus on data exploration from CSV file.

The classification goal is to anticipate whether the customer will subscribe (1/0) to a term deposit.

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt

plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
data = pd.read_csv('bank.csv', header=0)
data = data.dropna()
print(data.shape)
print(list(data.columns))

Follow these steps to actualize the above code in Anaconda Navigator with “Jupyter Notebook” −

Step 1 − Launch the Jupyter Notebook with Anaconda Navigator.

 

jupyter_notebook_first

 

 

jupyter_notebook_second

 

Step 2 − Upload the csv document to get the output of regression model in systematic manner.

 

jupyter_notebook_third

 

Step 3 − Create a new file and execute the previously mentioned code line to get the desired output.

 

jupyter_notebook_fourth

 

 

jupyter_notebook_fifth


Agile Data Science - Deploying A Predictive System

In this example, we will figure out how to create and deploy predictive model which helps in the expectation of house costs utilizing python script. The important framework utilizing for deployment of predictive system incorporates Anaconda and “Jupyter Notebook”.

Follow these steps to deploy a predictive system −

Step 1 − Implement the following code to convert values from csv documents to associated values.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits

%matplotlib inline
data = pd.read_csv("kc_house_data.csv")
data.head()

The above code generates the following output −

 

above_code_generates

 

Step 2 − Execute the describe function to get the data types included in attributed of csv documents.

data.describe()

 

describe_function

 

Step 3 − We can drop the related values based on the deployment of the predictive model that we made.

train1 = data.drop(['id', 'price'],axis=1)
train1.head()

 

associated_values

 

Step 4 − You can visualize the data according to the records. The data can be utilized for data science analysis and output of white papers.

data.floors.value_counts().plot(kind='bar')

 

data_science_analysis


Agile Data Science - SparkML

Machine learning library also called the “SparkML” or “MLLib” comprises of basic learning algorithms, including classification, regression, clustering and collaborative filtering.

Why learn SparkML for Agile?

Spark is turning into the de-facto platform for building machine learning algorithms and applications. The developers work on Spark for implementing machine algorithms in a scalable and concise manner in the Spark framework. We will become familiar with the concepts of Machine learning, its uses and algorithms with this framework. Agile always opts for a framework, which delivers short and quick results.

ML Algorithms

ML Algorithms incorporate basic learning algorithms such as classification, regression, clustering and collaborative filtering.

Features

It incorporates feature extraction, transformation, dimension reduction and selection.

Pipelines

Pipelines provide tools for developing, evaluating and tuning machine-learning pipelines.

Popular Algorithms

Following are a few popular algorithms −

  • Basic Statistics

  • Regression

  • Classification

  • Recommendation System

  • Clustering

  • Dimensionality Reduction

  • Feature Extraction

  • Optimization

Recommendation System

A recommendation system is a subclass of information filtering system that seeks prediction of “rating” and “preference” that a user suggests to a given thing.

Recommendation system includes various filtering systems, which are utilized as follows −

Collaborative Filtering

It incorporates building a model based on the past behavior as well as similar decisions made by other users. This specific filtering model is utilized to predict items that a user is interested to take in.

Content based Filtering

It includes the filtering of discrete characteristics of an item in order to recommend and add new items with comparable properties.

In our subsequent chapters, we will concentrate on the utilization of recommendation system for solving a specific problem and improving the prediction performance from the agile methodology perspective.


Agile Data Science - Fixing Prediction Problem

In this section, we will concentrate on fixing a prediction problem with the help of a particular situation.

Think about that as a company wants to automate the loan eligibility details as per the client details provided through online application form. The details include name of customer, gender, marital status, loan amount and other mandatory details.

The details are recorded in the CSV document as shown below −

 

specific_scenario

 

Execute the following code to evaluate the prediction problem −

import pandas as pd
from sklearn import ensemble
import numpy as np

from scipy.stats import mode
from sklearn import preprocessing,model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

#loading the dataset
data=pd.read_csv('train.csv',index_col='Loan_ID')
def num_missing(x):
   return sum(x.isnull())
 
#imputing the the missing values from the data
data['Gender'].fillna(mode(list(data['Gender'])).mode[0], inplace=True)
data['Married'].fillna(mode(list(data['Married'])).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(list(data['Self_Employed'])).mode[0], inplace=True)

# print (data.apply(num_missing, axis=0))
# #imputing mean for the missing value
data['LoanAmount'].fillna(data['LoanAmount'].mean(), inplace=True)
mapping={'0':0,'1':1,'2':2,'3+':3}
data = data.replace({'Dependents':mapping})
data['Dependents'].fillna(data['Dependents'].mean(), inplace=True)
data['Loan_Amount_Term'].fillna(method='ffill',inplace=True)
data['Credit_History'].fillna(method='ffill',inplace=True)
print (data.apply(num_missing,axis=0))

#converting the cateogorical data to numbers using the label encoder
var_mod = ['Gender','Married','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
   le.fit(list(data[i].values))
   data[i] = le.transform(list(data[i]))
 
#Train test split
x=['Gender','Married','Education','Self_Employed','Property_Area','LoanAmount', 'Loan_Amount_Term','Credit_History','Dependents']
y=['Loan_Status']
print(data[x])
X_train,X_test,y_train,y_test=model_selection.train_test_split(data[x],data[y], test_size=0.2)

#
# #Random forest classifier
# clf=ensemble.RandomForestClassifier(n_estimators=100,
criterion='gini',max_depth=3,max_features='auto',n_jobs=-1)
clf=ensemble.RandomForestClassifier(n_estimators=200,max_features=3,min_samples
   _split=5,oob_score=True,n_jobs=-1,criterion='entropy')
	
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)
print(accuracy)

Output

The above code generates the following output.

 

specific_scenario_output


Agile Data Science - Improving Prediction Performance

In this section, we will concentrate on building a model that helps in the prediction of student’s performance with a number of attributes included in it. The focus is to display the disappointment result of students in an examination.

Process

The target value of assessment is G3. This values can be binned and further classified as disappointment and success. If G3 value is greater than or equal to 10, then the student finishes the examination.

Example

Consider the following example wherein a code is executed to predict the performance if students −

import pandas as pd
""" Read data file as DataFrame """
df = pd.read_csv("student-mat.csv", sep=";")
""" Import ML helpers """
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC # Support Vector Machine Classifier model
""" Split Data into Training and Testing Sets """
def split_data(X, Y):
 return train_test_split(X, Y, test_size=0.2, random_state=17)
""" Confusion Matrix """
def confuse(y_true, y_pred):
 cm = confusion_matrix(y_true=y_true, y_pred=y_pred)
 # print("\nConfusion Matrix: \n", cm)
  fpr(cm)
 ffr(cm)
""" False Pass Rate """
def fpr(confusion_matrix):
 fp = confusion_matrix[0][1]
 tf = confusion_matrix[0][0]
 rate = float(fp) / (fp + tf)
 print("False Pass Rate: ", rate)
""" False Fail Rate """
def ffr(confusion_matrix):
 ff = confusion_matrix[1][0]
 tp = confusion_matrix[1][1]
 rate = float(ff) / (ff + tp)
 print("False Fail Rate: ", rate)
 return rate
""" Train Model and Print Score """
def train_and_score(X, y):
 X_train, X_test, y_train, y_test = split_data(X, y)
 clf = Pipeline([
 ('reduce_dim', SelectKBest(chi2, k=2)),
 ('train', LinearSVC(C=100))
 ])
 scores = cross_val_score(clf, X_train, y_train, cv=5, n_jobs=2)
 print("Mean Model Accuracy:", np.array(scores).mean())
 clf.fit(X_train, y_train)
 confuse(y_test, clf.predict(X_test))
 print()
""" Main Program """
def main():
 print("\nStudent Performance Prediction")
 # For each feature, encode to categorical values
 class_le = LabelEncoder()
 for column in df[["school", "sex", "address", "famsize", "Pstatus", "Mjob",
"Fjob", "reason", "guardian", "schoolsup", "famsup", "paid", "activities",
"nursery", "higher", "internet", "romantic"]].columns:
 df[column] = class_le.fit_transform(df[column].values)
 # Encode G1, G2, G3 as pass or fail binary values
 for i, row in df.iterrows():
 if row["G1"] >= 10:
 df["G1"][i] = 1
 else:
 df["G1"][i] = 0
 if row["G2"] >= 10:
 df["G2"][i] = 1
 else:
 df["G2"][i] = 0
 if row["G3"] >= 10:
 df["G3"][i] = 1
 else:
 df["G3"][i] = 0
 # Target values are G3
 y = df.pop("G3")
 # Feature set is remaining features
 X = df
 print("\n\nModel Accuracy Knowing G1 & G2 Scores")
 print("=====================================")
 train_and_score(X, y)
 # Remove grade report 2
 X.drop(["G2"], axis = 1, inplace=True)
 print("\n\nModel Accuracy Knowing Only G1 Score")
 print("=====================================")
 train_and_score(X, y)
 # Remove grade report 1
 X.drop(["G1"], axis=1, inplace=True)
 print("\n\nModel Accuracy Without Knowing Scores")
 print("=====================================")
 train_and_score(X, y)
main()

Output

The above code generates the output as shown below

The prediction is treated with reference to only one variable. With reference to one variable, the student performance prediction is as shown below −

 

student_performance_prediction


Agile Data Science - Creating Better Scene With Agile & Data Science

Agile methodology helps organizations to adjust change, compete in the market and build high quality products. It is seen that organizations develop with agile methodology, with expanding change in requirements from customers. Compiling and synchronizing data with agile teams of organization is critical in rolling up data across as per the necessary portfolio.

Build a better plan

The standardized agile performance exclusively depends on the plan. The ordered data-schema empowers productivity, quality and responsiveness of the organization’s progress. The level of data consistency is maintained with historical and real time scenarios.

Consider the following diagram to comprehend the data science experiment cycle −

 

data-cycle

 

Data science includes the analysis of requirements followed by the creation of algorithms dependent on the same. Once the algorithms are planned along with the environmental setup, a user can create experiments and collect data for better analysis.

This ideology computes the last sprint of agile, which is designated “actions”.

 

The-Data-Value-Prramid-actions

 

Actions includes all the mandatory tasks for the last sprint or level of agile methodology. The track of data science phases (with respect to life cycle) can be maintained with story cards as things to do.

Predictive Analysis and Big data

The future of planning completely lies in the customization of data reports with the data collected from analysis. It will also include manipulation with big data analysis. With the assistance of big data, discrete pieces of information can be analyzed, effectively with slicing and dicing the metrics of the organization. Analysis is constantly considered as a better solution.


Agile Data Science - Implementation Of Agile

There are different methodologies used in the agile development process. These methodologies can be used for data science research process too.

The flowchart given below shows the different methodologies −

 

Types-of-Agile-Project-Management

 

Scrum

In software development terms, scrum means managing work with a small team and management of a particular project to reveal the strength and weaknesses of the project.

Crystal methodologies

Crystal methodologies incorporate imaginative techniques for product management and execution. With this strategy, teams can go about similar tasks in different ways. Crystal family is one of the least demanding approach to apply.

Dynamic Software Development Method

This delivery framework is basically used to implement the current knowledge system in software methodology.

Future driven development

The focus of this development life cycle is features involved in project. It works best for domain object modeling, code and highlight development for ownership.

Lean Software development

This method aims at increasing the speed of software development at low cost and focusses the team on delivering specific value to client.

 

Extreme Programming

Extreme programming is a unique software development methodology, which focusses on improving the software quality. This comes effective when the client is not sure about the functionality of any project.

Agile methodologies are taking root in data science stream and it is considered as the significant software methodology. With agile self-organizing, cross-functional teams can work together in effective manner. As mentioned there are six primary categories of agile development and each one of them can be streamed with data science as per the requirements. Data science involves an iterative process for statistical insights. Agile helps in breaking down the data science modules and helps in processing iterations and sprints in effective manner.

The process of Agile Data Science is an astonishing way of understanding how and why data science module is executed. It solves problems in creative way.





Input your Topic Name and press Enter.