**Are you planing to build your career in Data Science in This Year?**

**Do you the the Average Salary of a Data Scientist is $100,000/yr?**

Do you know over **10 Million+ New Job** will be created for the Data Science Filed in Just** Next 3 years**??

If you are a Student / a Job Holder/ a Job Seeker then it is the Right time for you to go for Data Science!

Do you Ever Wonder that Data Science is the “**Most Hottest**” Job Globally in 2018 – 2019!

Above, we just give you a very few examples why you Should move into Data Science and Test the Hot Demanding Job Market Ever Created!

The Good News is That From this Hands On Data Science and Machine Learning in R course You will Learn All the Knowledge what you need to be a MASTER in Data Science.

Why Data Science is a MUST HAVE for Now A Days?

The Answer Why Data Science is a Must have for Now a days will take a lot of time to explain. Let’s have a look into the Company name who are using Data Science and Machine Learning. Then You will get the Idea How it BOOST your Salary if you have Depth Knowledge in Data Science & Machine Learning!

Here we list a Very Few Companies : –

**Google**– For Advertise Serving, Advertise Targeting, Self Driving Car, Super Computer, Google Home etc. Google use Data Science + ML + AI to Take Decision**Apple:**Apple Use Data Science in different places like: Siri, Face Detection etc**Facebook:**Data Science , Machine Learning and AI used in Graph Algorithm for Find a Friend, Photo Tagging, Advertising Targeting, Chatbot, Face Detection etc**NASA:**Use Data Science For different Purpose**Microsoft:**Amplifying human ingenuity with Data Science

So From the List of the Companies you can Understand all Big Giant to Very Small Startups all are chessing Data Science and Artificial Intelligence and it the Opportunity for You!

**Why Choose This Data Science with R Course?**

We not only “

**How**” to do it but also Cover “**WHY”**to do it?Theory explained by

**Hands On Example!****15+ Hours**Long Data Science Course**100+ Study Materials on Each and Every Topic of Data Science!****Code Templates**are Ready to Download! Save a lot of Time

**What You Will Learn From The Data Science MASTERCLASS Course:**

Learn what is Data science and how Data Science is helping the modern world!

What are the benefits of Data Science , Machine Learning and Artificial Intelligence

Able to

**Solve Data Science Related Problem**with the Help of R ProgrammingWhy R is a Must Have for Data Science , AI and Machine Learning!

Right Guidance of the Path if You want to be a Data Scientist + Data Science Interview Preparation Guide

**How to switch career in Data Science?**R Data Structure –

**Matrix, Array, Data Frame, Factor, List**Work with R’s conditional statements, functions, and loops

Systematically explore data in R

Data Science Package:

**Dplyr , GGPlot 2****Index, slice, and Subset Data**Get your data in and out of R –

**CSV, Excel, Database, Web, Text Data**Data Science – Data Visualization :

**plot different types of data & draw insights like: Line Chart, Bar Plot, Pie Chart, Histogram, Density Plot, Box Plot, 3D Plot, Mosaic Plot**Data Science – Data Manipulation –

**Apply function, mutate(), filter(), arrange (), summarise(), groupby(), date**in RStatistics – A Must have for Data Sciecne

Data Science – Hypothesis Testing

### *************Section Zero *********

### Introduction to Data Science

This course will expose you to the Data Science with R.This lecture is all about Introduction to Data Science.

**Introduction to Data Science: **

Meaning of Business Analytics

Evolution Of Business Analytics

Different Types Of Analytics

Application Of Business Analytics

**Need or use of Business Analytics:**

Before you know about Machine Learning or Data Science or Business Analytics, you have to know the need or use of Data Science. Day to day base different organism tries to solve different business portion for their growth.

Different business portion means

How much you stock in inventory?

Which seller may miss the target of order?

Which factor can influence customers preference?

What kind of customer’s sentiments is there about your product?

What would be the next demand of the customers?

All those business-oriented queries can be solved in two different way -**Traditional Approach &** **Data Driven**.

**Traditional Approach: **

The data-driven is a decision which is adopted by all the organization in today’s world., here you can hire a senior one who has a lot of statistical knowledge about their traditional way of business.

**Data Driven:**

All the company collect some data then store the data and processing it thereafter they analyze the data from multiple sources in a lot of statistical technique, while complete the realization then it shows in dashboards and reports.

That is the way data driven to insight, here it is like a roadmap where data is collected to insight. In these process, the analyzing process is most vital that’s why we ultimate focus on the analytical part.

**Meaning of Business Analytics:**

Business Analytical(BA) is the practice of interactive, methodical exploration of an organization’s data through statistical analysis.

Business Analytics is used by companies by accepted to data driven decision making.

All organization in every industry focus on exploiting data for competitive advantages.

**Evolution of Analytics:**

In 2008-2010 there was a huge change in data.There was a lot of volume of data like tweets, facebook post , youtube videos That’s why to store the data we need a cheap commodities hardware and the answer is **Hadoop **where we can store all those data.Then we find out MapReduce. Using the **MapReduce **technique we process those data in a fraction of second.

You can see the trends in last ten years the cost of storing the data is gradually decreased

And the processing power is increased in last decade.

Now combining these two factors we can store a data and analyze it.

Now the third stage is about machine learning. Using machine learning technique we can understand a data and extract the insight from data.

For an example,Decision Tree and Neural Network these two are one of the machine learning algorithms ,invented in 1970’s but at that time if you ran a decision tree of 1gb of data it took half an hour but in today’s world with **Hadoop**,it took just one second.These are the three reason because of that we are using it.

**Types of business Analytics:**

Descriptive Analytics

Diagnostic Analytics

Predictive Analytics

Prescriptive Analytics

The graph Descriptive to Prescriptive, the complexity is higher as well the business value and skill also high.

**Descriptive Analytics: **

If you know more about how something happened or if you know more about a business event happened or you interested to know what happened or those kinds of think that is Descriptive Analytics.

As it requires minimal to no coding, that’s why Descriptive Analytics is the easiest technique for data analytics.

It is analysis of the past(historical) data to understand trends and evaluates metrics over time.

There are many sophisticated tools that can handle Descriptive Analytics like Tableau,QlikView,Microstrategy, Google Analytics etc.

EXAMPLE:

Analyzing past 6 months sales data and identify top 10 selling product

Analyzing customers comment on Twitter and count positive and negative comments

**Diagnostic Analytics:**

Thereafter if you go one level down and think why it happened in the past then it’s called diagnostic Analytics.

It involves the technique why it happened

By using statistical methods and basic data exploration we can understand the reason behind happening some event.

EXAMPLE:

Analyze why the sales were down in a particular region

Analyze why customer leave your organization

**Predictive Analytics:**

This Analytics said what would be happened or what might be happened. Predictive Analytics, the more complex which gives you more business value.

These analytics predict the future outcomes.

It can also predict the impact of a variable (weather) on another variable(sale).

It can be future categorized into different technique such as-Predictive Modelling,Data Mining, forecasting.

EXAMPLE:

Predictive Analytics are Predictive future sales based on past historical data

Predictive whether a customer would take a product or not

Predicting whether a customer leaves your organization or not.

**Prescriptive Analytics:**

The final one is prescriptive Analytics,it tells what would happened and it also tell what you have to do,just like a Doctor prescription.

Prescriptive Analytics specifies best course of actions for a business activity in the form of the output of a prescriptive model.

It uses optimization algorithm to create the final output or prescription.

The Prescriptive Analytic is sophisticated tools and technique as well.

EXAMPLE:

Supply chain- Finding best route to deliver the product

Marketing-optimum budget for marketing expenditure in each channel

Retail also-use price markdown model that provide the when to give discounts or offers and how much to give discount

**Decision Automation:**

Using these four process, we can automate the whole process. but how?

We are dealing with lots of data using some technique and based on that we are taking a decision.

In Descriptive Analytics then the human or the marketing manager or you have to take on decision based on the output from your Descriptive Model.

Thereafter Diagnostic Analytics,may be the human input is a bit less but still, you have to take the decision based on the output.

Then predictive you take the decision based on the data or based on the predictive output.

Now, the prescriptive model you don’t need to do anything everything is providing the model you just simply take the decision.

This is all about the introduction Data Science with R.This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and **THANK YOU**.

**Application of Business Analytics:**

Its a complete overview of application then we can think about in terms of Industry, By Business function and By Technique.

**Telecom:**

We'll try understand some customers will actually churn or not. In a telecom industry, some customers will actually leave the organization or not. If you look it in the Business function then it is a case of Customer Analytics Function similarly if you take it in technique then we can tell it by Predictive Modeling.

Similarly, each of the use cases can look into by Industry, BY Business Function and By Technique. Here we give some analytics examples one by one.

**BFSI:**

Credit scoring: It is based on customer behavior of purchasing. Based on customer’s credit score bank can score a customer and decided if these customers deserve a loan or credit card or not .

Fraud Detecttion: The Bank can detect a fraud by following the money transaction and there automatically give a warning alert to the customer that there may be a fraud is here .You immediately take an action for it.

**Manufacturing**:

Manufacturing has a predictive maintenance like when a machinery fails and then you have to fix it.But in machine learning approach we have to predict where the next failure is going to happen and at what time and which machine is going to fail.

**The Transport & Logistics**:

Likewise, **The Transport & Logistics**, here we have TCO (transport cost optimization). An instance is given that you have a lot of product to deliver at the same day in a different route and different place and different time, a machine learning can tell you in which route and product you deliver at and what time you cover the whole things.

**Retail:**It has lots of examples. The Recommendation Engine is one of them.Whenever you call to customer care, the customer care guys already knows your last purchase,latest update or your interests and based of this they guide you some other products which you might take.**Healthcare:****Healthcare**based on lots of attributes different tumor we can predict cancer.**Insurance**:The claim of money predict the increase price value.**Education**, if you, again and again, check your plagiarism, automatically your score will increase.

** By Business Function**

**Customer Analytics**:In Retail, if we trying to understand more about your**customer**,**customer churn, customer model, customer lifetime value.****Sales Analytics:**If you more focused on Sales, then we are trying to understand what would be your lead what would be might your demand and we trying to forecast.

**Marketing**:Similarly, in**Marketing**also if you create a marketing mix modeling where we gave the instance that which marketing channel you invest and how much you invest

So the point is here that, a large number of example use cases in all the industry, all the business function is there but we only use a different kind of machine learning technique in a different scenario.

** By Industry and Retail Analytics**

**CustomerAnalytics:**

**Customer Segmentation**: The model tells you the detail about who is your premium customer, which customer want to leave your organization, which customer is ok based on all those insights you can set different type of marketing offer.**Churn Prediction:**The model tells you that which customer might leave your organization.**The propensity to Buy:**The model tells you that which are the products that the customer might be interested in and you should send a mail or message them or you should actually productively call them for buying purpose.

**Sales Analytics:**

**Sales Forecasting**:We can actually predict your future sales, based on the last 6 years of sales data we can predict future 3 or 4 quarter sales,if you know that things then it would be very much beneficial.**Inventory Planning Analysis**,**Store Analytics**&**up sell or cross sell:**All of those helps to progress your growth and the predictive analysis make you much more strategic for your future planning.

**Price Analytics:**

**Markdown Optimization model:**These model tell you what time you give the discount and how much you give the discount.**Dynamic Pricing:**It tells you what is the right price you should keep so that your profit is maximized.**Discount Analytics**&**What if Scenario Modelling:**It also come under the Price Analytics which also help you to make the discount.

**Marketing Analytics: **has lots of different kind of use cases. You will be surprised to know about it. The same technique is being used here in different use cases and different industry.

**Market Basket Analysis****Marketing Mix Modelling****Personalized Customer Offer****Social Media Analytics**

These help you in the marketing process.

** By Process**

By Process is about the **Sales,** Marketing, Supply Chain and all.

If you have a manufacturer company and you may have a supply chain manager then you would be always think about the** forecasted demand** so we can here help you by this machine learning model to forecast, as well as we can predict the price of the raw materials and similarly the **MRP** which is **Materials Requirements Planning **so what time you increase your planning and how much that is decided by the **MRP.** The **Inventory Optimization** tells you how much inventory you should actually keep in and whether this is also not low or not so high .**Transport Cost Optimization **as well for helping the right route. Finally using all those predicting modelings you can enhance your revenue or you can decrease your cost.

So, Supplier to Customer it is a huge process where we see** demand forecasting**,** Price Predicting, procurement Planning **and **Vendor Selection, Vendor Selection and Risk Assessment, Contextual Intelligence all comes under the Sourcing** and **Procurement. MRP, Preventing Maintenance, Production and Routing Optimization and Inventory Optimization** all are in the Production and the **Outbound Logistics-Transport Cost Optimization **comes under the Sales and Distribution.

** By Technique**

In a telecom company a customer wanted to find out preferable telecom provider in the **Explore** stage, then he ultimately decided to **Join **provider 1 or provider 2 then he join and thereafter **in life** he already use it but may be sometimes he faces some problems and calls the customer care that is in the **help** stage and finally he decided that he will leave the organization or not leave the organization that is the** leave** stage.

These are the analytical use case-.

Descriptive

Diagnostic

Predictive

Prescriptive

Now the framework thinks you, whatever industry you belong you can think all those kind of different use cases. So maybe you are doing in descriptive analytics so day to day you create a chart in Excel and but you don’t think how you can predict something or do prescriptive. But these Machine Learning course actually give you more, using the same data you don’t require any other data, using the same data you can create Predictive Analytical use cases.

Now just give some couple of examples like **campaign conversion Rate** when you are doing some campaign or promotion, that how they are working that’s why you have to create an excel chart that’s are actually descriptive but not predictive. Similarly, we also understand where the next network failure happens. We can also understand which customer is more likely to churn the organization and so that you immediately call them and provide some special discount or give the promotion as well.

So this is a good framework to understand how **descriptive to prescriptive** use cases can be used in analytics customer lifecycle.

This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and THANK YOU.

In this session we'll talk about

**Data To Insight****What is Machine Learning?****How Machine Learning Works?****What is Big Data ?**

**What is Data Science?**

Data Science is nothing but and technique by which we can actually achieve Data Driven Decision. So, whenever we take any decision instead of our knowledge and if we take the decision based on the data that is actually Data Driven Decision and to actually achieve this thing the Science is Data Science.

**Data Driven Decision(DDD) Making**:

All the organization collect data that is internal data(internal to the organization) or external data,after collecting the data they store it in the big data Hadoop system or maybe in a SQL database or maybe no SQL database or maybe it is a streaming data and they can store it in a streaming database if is a time series database they can store it in a time series database so there is a different kind of option. But high level tries to understand in any organization to collect all those data and store the data depending on how the data type is and then they process the data in this time you need a requirement that you have to process in real time. But sometimes you have a requirement that you process it batch to overnight.

After the collecting, storing and processing you have to analyze. Now,these are the two cases, we use any machine learning algorithms, maybe some time use Forecasting, some time use Predictive Modeling or Optimization, Simulation or Artificial Intelligence or whatever ultimately we come off with a system.

The system may be Recommendations System would provide an Action Items from your prescription that you can simply do this Contextual Intelligence that since the price of the raw materials is high you should not buy it right now. So everything and anything is associated with Action Item and from the action, you can achieve two thing-Cost Reduction & Profit Maximization that is actually your objective.

**What is Machine Learning?**

Now you create a rule and algorithm and you pass a new data. Suppose you have written if else statement and you are trying to create a system which can actually detect a spam. So, in a mail, you are looking for each of those mail and you are trying to figure out whether this is spam or not. If it is spam then you put into a spam folder.

**Traditional Programming:**

Now, if you would try in a Traditional Programming way, then you would create different rules so you would write if else statement, while statement etc. and all those things. A new data would come which is a new mail and based on your program it would give you an output. Whether this is spam or not but this base is actually a Rule Based approach. Where your rule accepts or rejects based on the Rule Based but it would not change the situation.

Suppose you have created a rule that each mail contains more than ten images then it is a spam but unfortunately, you receive a good mail from your friend and which is contain more than 10 images but it is not actually a spam. Here your programme marked this as a spam but once you say no that is not a spam then it would not learn again. But if you see in our Gmail what’s happened you try to train your model, you say no this is not a spam but next time whenever you received from your friend more than 10 images it would not mark it spam why because Gmail spam system is based on ML Based is not rule based.

**Machine Programming**

What is different between Rule Based and ML Based?

ML Based actually adopt the new situation and it improves with new each data and on the other hand Rule Based accept or reject based on rule doesn’t change the situation.

we have a historical data, historical data is nothing just historical email data that is last or previous 6 month to 1 year ago message. whatever the mail came that is historical data and it has an output also. So whenever you marked that it is not a spam or said it is not a spam, then those are placed in your outbox.whether it can be spam or not.

Now, It will be trained in a Machine Learning Algorithm, now put some algorithm, SVM algorithm or other, a lots of algorithms are there.So, now just pass on some data, pass on the output and my machine will learn itself and it will create an algorithm. So, Algorithm is nothing just a rule to be followed in calculations or other problem-solving operations, for a computer. So it would now say that if there is normally more than 10 images then it would be spam but if the images more than ten and comes from your friend then it is not a spam. Since you have passed the historical data and output to your machine and your system can learn using the machine learning algorithm. So, next time when it comes from your friend it owns marked it a spam.

**What is Machine Learning? How it works?**

Here we don’t create any algorithms rather we pass an algorithm and the system learn itself and now your last 6 month all the emails are in historical data which is divided into 2 trains (Training Set) and Test(Evaluation Set) then your five month data is training set and one month data is in evaluation set and then we pass different different algorithms like Linear Regression, Logistic Regression, and SVM and then we evaluate the model and then best on output we choose which is the highest accuracy. That’s why machine learning work.

Think tomorrow you have a new data and it also becomes in the history of the data and the same way it comes in the training data set so your model change the data set every day and every second and then the best on that it would predict. The model is tested on an unseen data (Evaluation Set) and the model score is calculated for each of the algorithms. Thereafter best algorithm is chosen.

**What is Big Data and the Connection between Big Data and Data Science**-

Big data is more familiar to 4Vs but here we discussed 3Vs

**Volume**:

If your data volume is so large you cannot store it in a simple sequel database. That’s would be a big data challenge then you have to store the Terabytes and PetaBytes data. The data is unknown value, such as Twitter data feeds, click streams on a webpage or a mobile app or sensor enabled equipment.

**Velocity **:

The data is coming with so much Velocity so may be from a GPS or from a twitter or from a facebook every time in a nanosecond there are so many tweets and so many facebook post and you have to store all of them, that against a big data challenge that’s why you have to use the Hadoop system. Velocity deals with the speed at which data flows in from the sources of business processes, application logs, networks, social media sites, sensors, mobile devices etc.

**Variety **:

Similarly, Variety refers to the many types of data which is different in nature. Earlier all those data are stored in very structured format but today datas are unstructured, not structured at all. Because all those videos, images are unstructured data. This refers to the inconsistency which can be shown by the data at times.Now,since the Big Data are very different in nature to actually understand to get the insight out of this data is also very difficult that’s why we need to have these Decision Science and we need Machine Learning Algorithm to learn itself because it not easy that manually you would actually extract the insight from the data that’s why we need all these Machine Learning Algorithm to learn itself.

This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and THANK YOU.

** Carrer Transition To ML**

**Types of Analytics Companies:**

What are the different kinds of Analytics companies there? Or Who is actually wanting to hire Data Science as a professional? Data Science is not made only by the Data Scientist, rather there is a different kind of roles.

**Pure Play Analytics**:

These companies are only doing Analytics projects mostly and purely Data Science focused.It can be predictive one or descriptive one or prescriptive one maybe they are playing with AI or lots of things, but they are actually Pure Play Analytics.

**IT MNC:**

The second kind of companies is** IT MNC** .They also had Big Data Clouds IOT now they have Data Science as well. Data Science or Analytics or Business Analytics or Machine Learning the name may be different in different companies, and the profile might be different but they hiring a lot of number data science as a professional.why are they hiring a huge number? Because these companies get a lot of big projects and some companies want to totally change and they want to become a Data Driven Organization.

**BPO/BPS Analytics**:

Generally, these BPS companies actually outsource any back and source process.But these process is very interesting because those are actually a backend process which can be automated when we talk automated to actually achieve automation .That’ why they hire lots of consultant and lots of Data Scientist in their company. If you are getting any calls from TCS BPO or CTS BPO, don’t be frightened that why this is coming from BPS rather you should be happy that the call is coming from Analytics Stream.

**In House Analytics**:

Lots of companies and industry thinks that should not give their data to any other third party because the data is actually a new soil so that is most important and crucial for them they do not want to share this with any other company. That’s why they are hiring a data scientist to build their own data science team within their organization.

**Analytics Product Company**:

So, all these companies are creating some kind of analytics product which they are trying to sell in the market. In future when it would be stabled kind of domain, then they can sell all this kind to a lot of customers but right now they are hiring two kinds of people, one who can actually build all those products and other who can actually sell those products.

So, you might be interested in those company as well.Maybe you are in IT, and you don’t have that level of industry focus, and you are worried where should I fit in, you don’t need to think in, you learn all those techniques and you also learn a couple of use cases and build a couple of model in that industry and then you gradually move into a particular industry and slowly become a Data Science professional. So, those are all about Analytics company.

**Analytics Roles-Track, Skillset, Technology:**

**Track-Role:**

**Data Engineer:**In Analytics, Data Scientist is not the only one, another one is Data Engineer. The question is, who is the Data Engineer? Previously, we said that at first data is collecting,storing,processing .In these three steps work is done by the Data Engineer. Data Engineer collets all those data from the various system. You maybe become a Data scientist, so you can extract the insight out of the data. So starting from Data Engineer you become a Data Architect and then the skill set which is required from you is ETL, Data Warehouse, Data Lake, Data Modelling, Distributed Computing.

**Data Scientist**:But this course actually deals with Data Scientist. You start with your work as a Data Analysts, where you work is creating a good dashboard, doing descriptive analytics, or cleaning and other normal stuff,and the modeling part would be taken care by Data Scientist. Slowly, you can also become a Data Scientist. And you can slowly move in senior Data Scientist, Data Analytics Manager, Chief Data scientist and till the end. The skill set for these is Math, Sat, Machine Learning, Coding, Business, BI and the technique is R, Python, SAS, Revo R, SPSS, Azure ML. Here we discuss the only R.

**Business Consultant**: In Data Science the Consultant role is very different, you have to very good understanding about your business, a customer in that industry as well as you need have a very good understanding about a Data Science project. where you not only help your customer, to achieve analytic use case and you also know how to implement that as well. The skill set is required for Business+Technology.

**Presale Consultant:**Here the Pre-sale is a big difference to normal pre-sale.Normal Presale we try to respond RAPA, RFIS maybe you don’t have the technical accruement but for this kind of profile, you need to have a skill set where you can actually build an end to end solution for a Data Science project. You need to have these skill set More Business+Less Technology, Client Presentation, Latest Technology, Market Trained so that you get into client’s calls and convenience these are the techniques deliberate you know and you build the solution for them.

**Data Visualization & BI Analyst:**where they create a very good dashboard, and slowly become a Data Scientist by learning these techniques.

Now, finally, we talk about original Statistician. The statistician track role is Quality Assurance and Quality Manager and their skill set are Basic Stat, Project Management, a Business process by the technique of Excel, PPT, SPSS.

Whatever your past experience, based on that now you are learning this skill set and you can pick and choose CTC, best on your preference which role you actually get into.

**Analytics Role-Data Scientist:**

Now, talking about Data Scientist. If you want to be Data Scientist you need to have three skill-first is about your programming knowledge,any of programming knowledge you can be understand,it can be R,It can be Python, SAAS or anything,the second one is Statistics,you need to understand the Machine Learning concept and Statistics,and the third one is substantive Expertise in Marketing skill.

You need to have these three skills to become a Data Science professional but you also need to have a good balance as well. If you are a fresher you should be more focused about the Programming skill and Statistic skill, the business skill may not be required that much but if you are a senior one then you need to have very good understanding about this business along with statistic and then programming may not be that much required but if you are a mid-level IT professional you need to have balance about the three. If you are learned already R, you should not learn Python, SQL or something else. Once you pick R then all the three circle you can move and stronger to any one of them.

This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and THANK YOU.

**Roadmap to Become A Data Scientist:**

**Step 1: Learn R/Python/SAS:**

Here we choose the R as a preferred programming language for this course. We learn basic of R. And also learn, how to import a data, how to clean a data, how to manipulate a data, how to visualize a data, all those things learn the basic of Machine Learning concept.

**Importing Data****Data Visualization****Data Manipulation****Modeling / Machine Learning**

**Step 2: Statistics:**

Now we will learn about the statistics. Statistics is used in long back. If you have a problem statement, what will you do? You just analyze your past data.,build a Hypothesis and predict a future result and ensure that you do get the predicted results.

Here we will be learning about,

**How to test a Hypothesis?****What is Univariate Analysis?****What is Bivariate Analysis?****What is Multivariate Analysis?****What is Correlation?****What is regression?****What is Chi-Square and ANOVA?**

After these points you will be able to understand the basic of Statistics and you also know how to implement all those statistics knowledge in any language which is R.

**Step 3: Machine Learning:**

After that, we will be learning Machine Learning concept. Machine Learning is a type of Artificial Intelligence which teaches the system to learn and take a decision,

Now in several ways, we learn Machine Learning concept like

**What is supervised algorithm?****what is unsupervised algorithm?****What is a different kind of regression?**

The **Algorithm** like **Linear Regression**, then classification algorithm like** Logistic Regression** then after that two algorithms, we move to **Advanced Algorithm**.

**Step 4: Advanced Analytics:**

The Advanced Algorithm is

**Tree based Algorithms like Decision Tree****Bagging & Boosting****Cross Validation****Model Tuning****Deep Learning****NLP**

All of these included in the Advanced Analytics. Then we will be learning more about the Random forest and other advanced algorithms.Advanced Analytics mainly used for predict the future event across the industry.

**Step 5: Complete in Kaggle:**

Once you will be done with these then we will be actually doing a couple of examples. So, we will take some real dataset from **Kaggle**. So kaggle is a platform for

**Predictive modelling and Practice problem****Open challenge**

Its acompetition platform where data scientist or statisticians or data miners from all over the world come and compete over their to produce the best models for predicting and describing the datasets uploaded by companies and users and they try to learn from each other. Kagglers come from a wide variety of backgrounds, such as computer science, computer vision, biology, medicine, and glaciology. We will also take a couple of problem from there and we will actually try to solve a business problem using Machine Learning algorithms.

**Step 6: Analytics Project Life Cycle:**

If these step 5 is done then we will be more discussing about and we will be joining the dots. So, still this now we will be learning lots of concepts. But how these concepts are being used in the industries to actually deliver a project, we will be learning those life cycle. How an** Analytics Project Lifecycle** goes? And whoever has actually experienced hopes? who are trying to switch your career in Machine Learning so, probably this is very important to you to understand so that you can actually deliver a project and you can lead a team of data scientist as well. Here, we display some Analytics Project Life Cycle steps like

**Business Problem Understanding****Data Collection****Data Cleaning****Data Exploration****Feature Engineering****Model Building****Model Evaluation and Tuning****Model Deployment.**

**Step 7: High Level-Digital Transformation Services:**

If the sixth step is also done then we will be discussing few concepts about the other technology that is **High Level-Digital Transformation services**. Now we talk about the technology like

**Big Data Basic****Cloud Basics****IoT Basics****Automation Basics**

But mostly, we will focusing step one to step six that is Learn R/Python/SAAS to Analytics Project Life Cycle.

**Roadmap to Become A Data Scientist:**

**Step 1: Learn R/Python/SAS:**

Here we choose the R as a preferred programming language for this course. We learn basic of R. And also learn, how to import a data, how to clean a data, how to manipulate a data, how to visualize a data, all those things learn the basic of Machine Learning concept.

**Importing Data****Data Visualization****Data Manipulation****Modeling / Machine Learning**

**Step 2: Statistics:**

Now we will learn about the statistics. Statistics is used in long back. If you have a problem statement, what will you do? You just analyze your past data.,build a Hypothesis and predict a future result and ensure that you do get the predicted results.

Here we will be learning about,

**How to test a Hypothesis?****What is Univariate Analysis?****What is Bivariate Analysis?****What is Multivariate Analysis?****What is Correlation?****What is regression?****What is Chi-Square and ANOVA?**

After these points you will be able to understand the basic of Statistics and you also know how to implement all those statistics knowledge in any language which is R.

**Step 3: Machine Learning:**

After that, we will be learning Machine Learning concept. Machine Learning is a type of Artificial Intelligence which teaches the system to learn and take a decision,

Now in several ways, we learn Machine Learning concept like

**What is supervised algorithm?****what is unsupervised algorithm?****What is a different kind of regression?**

The **Algorithm** like **Linear Regression**, then classification algorithm like** Logistic Regression** then after that two algorithms, we move to **Advanced Algorithm**.

**Step 4: Advanced Analytics:**

The Advanced Algorithm is

**Tree based Algorithms like Decision Tree****Bagging & Boosting****Cross Validation****Model Tuning****Deep Learning****NLP**

All of these included in the Advanced Analytics. Then we will be learning more about the Random forest and other advanced algorithms.Advanced Analytics mainly used for predict the future event across the industry.

**Step 5: Complete in Kaggle:**

Once you will be done with these then we will be actually doing a couple of examples. So, we will take some real dataset from **Kaggle**. So kaggle is a platform for

**Predictive modelling and Practice problem****Open challenge**

Its acompetition platform where data scientist or statisticians or data miners from all over the world come and compete over their to produce the best models for predicting and describing the datasets uploaded by companies and users and they try to learn from each other. Kagglers come from a wide variety of backgrounds, such as computer science, computer vision, biology, medicine, and glaciology. We will also take a couple of problem from there and we will actually try to solve a business problem using Machine Learning algorithms.

**Step 6: Analytics Project Life Cycle:**

If these step 5 is done then we will be more discussing about and we will be joining the dots. So, still this now we will be learning lots of concepts. But how these concepts are being used in the industries to actually deliver a project, we will be learning those life cycle. How an** Analytics Project Lifecycle** goes? And whoever has actually experienced hopes? who are trying to switch your career in Machine Learning so, probably this is very important to you to understand so that you can actually deliver a project and you can lead a team of data scientist as well. Here, we display some Analytics Project Life Cycle steps like

**Business Problem Understanding****Data Collection****Data Cleaning****Data Exploration****Feature Engineering****Model Building****Model Evaluation and Tuning****Model Deployment.**

**Step 7: High Level-Digital Transformation Services:**

If the sixth step is also done then we will be discussing few concepts about the other technology that is **High Level-Digital Transformation services**. Now we talk about the technology like

**Big Data Basic****Cloud Basics****IoT Basics****Automation Basics**

But mostly, we will focusing step one to step six that is Learn R/Python/SAAS to Analytics Project Life Cycle.

### Course Curriculum Overview

### INTRODUCTION TO R

**Introduction to R**

**R**is a programming language, which is an object oriented language created by Statisticians, R provides objects, operators and functions that allow the user to explore, model and visualize data.**R**is a Programming language Developed at AT&T Bell Lab.

It is an open source free language, allowing anyone to use and modify it. R is licensed under the GNU General Public License, with copyright held by The R Foundation For Statistical Computing. It has no need to pay any subscription charges

**R**has a huge active community member. If you have any question about any function any library you can Google it and you would get a proper answer and right the way.

As it is an open source language, you, me and lots of Data Scientist, they actually built in all those, inbuilt function and they upload it in a website called CRAN and then you can download all those packages. Over 7800 packages listed on CRAN, here we listed some of the most powerful and commonly used in R packages.

**R**is a cross platform. R can run in different kind of operating system and different hardware. Generally, it is used on GNU/Linux, Macintosh, and Microsoft Windows and running on both 32 and 64-bit processor.

**R**is mainly used for Statistical Analysis and Analytics Purpose, you might be thinking why to learn again another language if you already know many programming languages like JAVA or other programming languages, and think why do you need the language because R is mainly used for all those statistical Analysis and that’s why you should learn the language R. you would understand after doing this course it is actually easy to interpret.

**R**is the leading tool for statistics and data analysis, machine learning as well as. The programming language is more than a statistical package, you can build your own objects, functions, and packages.

It is easy to use, the coding style is quite easy.

**R**enables you to interact with many data sources: ODBC -compliant databases (Excel, Access). R also can handle CSV files, SAS, and SPSS, XML and lots of other different files as well.

Similarly, it can create a very good visualization. It can produce graphics output in PDF, JPG, PNG and SVG formats and table output for LATEX and HTML. It has a lot of inbuilt functions(packages & Libraries) and the results are also easy to interpret and that’s why lots of industries are using R, it is not about the big or small. Lots of companies like Microsoft, Google are using R actively. It has a big reason, it is free and you can do POC out there.

So, be confident about the fact that you are going to learn R and it has huge popularity and your market value is always higher if you know R in Data Science

**Steps for R Installation: **

**Step 1: Download and install R base:**

Now, discuss how do you download and then install your R, we will be actually using R Studio.Then click on that

**Step 2: Download and Install R Studio:**

R Studio which actually updates version of R base. So, earlier may be back in 2010 that time we had only R base which was not good looking at all, it has a console. You give the commands and you would get the answer quickly but then the updated version is R Studio which we going to use in our course. but to actually use R Studio, you have to install First R base and then you have to open our studio.

Then click on that

you can download and install R and then R Studio

R is getting the update all the time and you might get a warning messages, that these new packages and new libraries are built in new version and it remember you that you are probably using an old version and that is the good time you should check your R version and update as well.

for Mac click on that

and it will be easily downloded

**Opening R Console:**

**Step 3: Open R Console (R GUI from your Desktop)**

When downloading the R base, you would be getting R base icon, in your desktop then double-click the icon to open it and get a very easy looking console in R base but we are not using it, after installing this we just close it, it is only for the R Studio.

**Opening RStudio:**

**Step 4:Open RStudio (RStudio GUI from your desktop)**

At first you download the R Studio then after installation you just search the icon RStudio and then you can open it. So,your RStudio is now opened and here you can create a file and you can type a command. When you type a commands it’s prompt as well. When you press Ctrl+Enter then it will be run or you can select a particular command and then press Ctrl+Enter and then your RStudio will be run as well. But before you can actually code or learn the integrity of coding and different basic of R.

Let’s me work you through what it is. So, first go to the source panel. Our R base version has the only console part and the RStudio also has the Source panel as well. In this source panel, you can write all those things,you don’t need to run anything and probably you can run one by one. That’s is how this is better than R base.

In the file panel you can see the files and the Plot panel you can create a plot, then we have a concept of packages and libraries and then you have help and viewer also.

Then you have an Environment Panel whatever you want to store that is stored in your Global Environment.

So, the RStudio consists of the Source Panel, the Console Panel, the Environmental Panel and the last one is File, Plot, Packages, Help and view panel.

**Setting up R Studio:**

Previous we talk about the downloading process and the installation process of the R base and The R Studio. As we mention R base is mandatory to run R Studio, so the first download, install, open and close the R base .

Then open R Studio, and we will be actually discussing a lot of things in R Studio as well. Now would be learning about the setting up your R Studio. So whenever you have R Studio, suppose you want to get a data so you want to import a data set or the data set is on your drive. But you need to also know where is your working directories.

So the command is: get working directories.

Suppose you want to set up your working directories in any location, and may be in D drive and then you just copy your drive path and paste it. When you are pasting it just keep in mind that here the double slash is not accepted so you just check it and then delete the one slash. Then your working directories will be ready. So your working directories are now saved in the folder.

Now if you again run in your working directories, it has been changed, so suppose earlier if your directories are in C drive now you have changed to your directories in new one. So if you save some file or import some file from that folder you can do it.

### R Programming

**Conditional Statement:**

In R we have two Conditional Statement. One is If else Statement and another one is Nested If Else Statement. It works like any other programming language.

The If else statements are a very important part of R programming. In R there are a lot of powerful packages for data manipulation. It produces a logical value and carries out the next statement only when the logical values become TRUE.

If statement,

If the condition is true you can try to access the condition and if it doesn’t make then it goes to else Statement.

As you can check, whether a is less than 4 or not?

If it is less than 4 then it is satisfied and then enters into the block and print a is less than 4 and else again check whether it is a=4?

If it is a=4 then print whether a has the value of 4 otherwise go to another statement that’s how it goes on and on. So, if statement you have the condition and you write that conditions it's just checked and if it’s satisfying and it enters into the block and does whatever you order to do and otherwise it goes to else statement.

The Nested If-Else the same thing can be done. With a Nested statement you are saying that if a is less than equal to 4 then print a is less than equal to 4 which is exactly same as the previous one but here you don’t write multiple else rather you are writing else if a == 4 then print a is value of 4 otherwise else print a>=4.

Hurriedly, see the output. The output is also same. For If else statement a is the value of 4 and since here we had stored 4 for a.

Similar things happened for the Nested If Else Statement.

Same in another variant which is If else variant. You can use If else function here you can print, when a==4 then the output is Yes,otherwise it is No. it doesn’t have the If Else Statement level of control but it does your job checking two conditions whether this made then print whatever you want to say otherwise you print whatever is not made.

**Loops-**

**For Loop in R:**

Suppose you are storing 1,2,3,4 in vector, using that combining function then you are seeing for then you are writing the condition that if “i” in a vector for 1 to 4 times you are asking to run that loop and you are asking to **print (i)**.

so I will start from 1 then 2 then 3 then 4, that’s how the output you can see 1,2,3,4 and that’s how a For Loop also. It is exactly to similar to any other languages.

**While Loop:**

While loop repeats a statement or group of statements when the given condition is true. It is testing the condition before executing the Loop body.Now in While Loop, while the test expression remains true, the code inside the loop keeps on executing.

Till the point the condition is made which is X less than 6 this will just run. Here suppose we will start with X equal to 1 then we will check whether it is less than 6 then we print X and then before existing we will increase by 1 so it would print 1 and X becomes to again checks again print 2 then again increase to 3 and again goes till the point 5 and then once it becomes 5 then before existing it would increase that to 6 so it own entered into the while loop. That’s how a while loop works. It is pretty simple.

**Repeat Loop:**

Now discuss the Repeat Loop. A Repeat Loop executes a sequence of statements multiple times and abbreviates the code that manages the Loop variable.

Earlier cases in For Loop and While Loop we are giving the condition before we even start the loop but in Repeat Loop we are not mentioning any condition. Here we just start X equal to 1(X=1) and let’s print X before existing you increase X/1 and then you give your condition that if X is == 6 (X==6) then break, break means nothing you just exist the loop. That’s how we jot down the condition in a Repeat Loop.

So you keep on doing all those conditions and then, in the end, you just check the condition at the end of the loop before existing and if it is satisfied, before it satisfied the condition then it gives the output 1,2,3,4,5 then you just increase that 1 and then it becomes X equal to 6(X=6) and finally you break from the loop.

**Loops in R:**

Now we are discussing **Break Statement** and **Next Statement** as well.

Here will be some situation where we have to terminate the loop without executing all the statements. In this condition, we can use the Break Statement and Next statements. Just like the While and Repeat Loop, you can break out of a loop completely by using the break statement. Additionally, if you want to skip the current iteration and continue the loop then you can use the Next Statements.

**Break Statement:**

In a Repeat Loop if you actually want to break from that loop so based on a condition you can write that statement. Break Statement is used inside a loop, to stop the iteration and flow the control outside of the loop. It is also used to terminate a case in the switch statement. (covered in the next chapter) Break Statement can also be used inside the else branch of the if else statement. Just like the While and Repeat Loop, you can break out of a loop completely by using the break statement.

For an example:

Suppose you have 15 statements inside the loop and you want to exit from the loop when a certain condition is true otherwise it has to execute all of them. In this condition, you have to use the If Statement to check for the expression and place the Break Statement inside the If block. If the condition is true then the compiler will execute the break statement, and the break will exit the controller from the loop completely otherwise, it will execute all the statements.

**Next Statement:**

Next statement is useful when we want to skip the current iteration of a loop without terminating it. On encountering next, the R parser skips further evaluation and starts next iteration of the loop. Next Statement simulates the behavior of R switch. The next discontinues a particular iteration and jumps to the Next Cycle. In fact, it jumps to the evaluation of the condition holding the current loop. The Next Statements can also be used inside the else branch and of if else statements. Next Statement actually helps to skip from a current relationship of a loop.

So, suppose in a For loop ideally it should go from 1 so suppose I equal to 1? (I=1) and in number, we have 5, so we are checking from one to five(1-5) and we are printing the value of i. But now you want whenever it reaches I equal to three(i=3) then you want to skip that loop. So, what would happen? So, it would print all of them but it own print if it is I equal to three (i=3). So, it would print one to then skip three then four five.

This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and **THANK YOU**.

**Function in R:**

What is function?

The function is exactly the same concept as any other language it’s like a Black Box, you give input and based on the function it tries to solve and it would give you the output.

Like, you have a function called mean(), it gets as input and whatever it's given,it provides you the output. Similarly, in R a function called mean() and if you give the input as 1,5,6,7 and you combining all of them by combine function and you would get the output as 4.75.

By using the keyword ‘function’ the R function is created.

**Function Structure & Documentation:**

Now, R has a lot of inbuilt function like Numeric function, Statistical Function and Character functions and lots of other function as well. Before we actually deep drive in those function let’s understand what is the structure of a function.

Therefore, the function has a body where you have written all those things. There is another function we are trying to create all those things which are about added, so if you pass two argument X and Y then we will add them and return the value. Now, by default you can set some values and if you set X, Y then there is so no default value but if you set X, Y=1 then by default you are saying that even you don’t pass anything in Y value, and suppose you pass 2 in this function then it would give the result 2+1=3.

So, we will be learning all those things, like, How to write a function on your own but before we learn all those things, we will actually learn a couple of inbuilt function in R, which are very popular and we are actually using all those functions in our subsequent courses.

**Function Arguments Matching:**

In the fifth or sixth position argument is very difficult to remember that’s why, whenever you are passing any value to a function, you can pass it two way either By Position where you say the first position is value, the second one is na.rm. But if you forget it by any chance then you just remember it By Name. so, if you have two arguments and suppose in Standard Deviation.

Then in the fifth position, there is an argument which you want to change. Instead of position, you can choose By name so you can say na.rm equal to False or True and it would understand.

**Introduction to Function:**

Now, we are talking about lots of Inbuilt Function. There is a couple of Numeric Function, a couple of Statistic Function and couple of Character Function as well.

At first, we are discussing Numeric Function.

**Numeric Function:**

We are learning the Numeric Function from R Studio. We are talking about its various function.

The first thing is a couple of easier one **sqrt(x).**

**Sqrt(x):** It is a numeric value or a valid numeric expression for which you want to see square root. If the Numeric Expression is positive value then sqrt() function will go back to the square root of a given value.

If the numeric expression is negative value then sqrt() function will return NaN.

If the numeric expression is not a number(NaN), or negative infinity then sqrt in R will return NaN.

If the numeric expression is positive infinity then sqrt function will return the result as positive infinity.

**Ceiling(x): **It is one of the R Math Function which is used to return the smallest integer value that is greater than or equal to an individual number, or an expression. It can be a numeric value for which you want to find the square root. If the numeric Expression is positive or negative numeric value, then ceiling() function will return the ceiling value.

If the numeric expression is positive or negative zero then ceiling() function will return zero.

If the numeric expression is not a number(NaN) therefore ceiling will go back to NaN.

If the numeric number is positive or negative infinity then the function will return the same.

**Floor(x): **The R Floor method is one of the R Math Function that is used to return the largest integer value which is not greater than(less than) or equal to an individual number or an individual expression.

It can be a numeric value for which you want to find the square root. If the numeric expression is positive or negative numeric value then floor function will return the floor value.

If the numeric expression is positive or negative zero then the function will return zero.

If the numeric expression is NaN(not a number), therefore floor function will return NaN.

If the numeric expression is positive or negative infinity, therefore the function will return the same.

**Exp(x):** The function Exp() defines exponential distribution,a one-parameter distribution for a gamlss.family object to be used in GAMLSS fitting using the function gamlss(). The mu parameter represents the mean of a distribution. The functions are dEXP,pEXP,qEXP and rEXP define the density, distribution function, quantile function and random generation for the specific parameterization of the exponential distribution defined by Exp() function. “Keywords” is “distribution” “regression”

**Log(x):** log computes logarithms, by default natural logarithms, log10 computes common (that is base 10) logarithms, and log2 computes binary (that is base 2) logarithms. The general form log(x, base) computes logarithms with the base.

**Round(x,digits=n): **Round is classified in some steps like,

Ceiling: It takes a single numeric argument x and returns a numeric vector containing the smallest integers not less than the corresponding elements of x.

Floor: It takes a single numeric argument x and returns a numeric vector containing the largest integers not greater than the corresponding elements of x.

Trunc: It takes a single numeric argument x and returns a numeric vector containing the integers formed by truncating the values in x toward 0.

Round: rounds the values in its first argument to the specified number of decimal places (default 0)

Signif: rounds the values in its first argument to the specified number of significant digits.

**Append():** Append values to x, probably inserted into the middle of x. This function is important since its trains to perform a little faster than using the concatenation (c) function.

**Identical(): **The safe and reliable way to tests two objects for being equal True in this case and False in every other case.

**Length(): **Get the length of vectors and factors, and any other R object for which a method has been defined.

**Range():** Range returns a vector containing the minimum and maximum of all the given arguments. The range is a generic function, its methods can be defined for it directly or via the summary group generic. Its arguments should be unnamed and dispatch is on the first argument. The keyword is “arith” , “Univar”.

**Rep(x,n):** Rep reproduces the values in X. It is a generic function and the default method is described here. The keyword is “manip”, “chron”.

**Rev(): **Rev provides a reverse version argument. It is a generic function with a default method for vectors and one for dendrograms.

**Seq(x,y,n): **Generated regular sequences,seq is a standard generic with a default method. Seq.int is a primitive that can be much faster but has a few restrictions. Seq.along and seq.len are very fast primitives for two common cases. The keyword is “manip”.

**Unique(): **Unique goes back to a vector, data frame or array-like x but with duplicate elements removed. The keyword is “manip”, “logic”.

Now we will be learning about a couple of Statistical Function as well as Character Function and how to write down our own function.

**Statistical Function:**

Now we are learning the Statistical function in R. The first function which we will be talking now is the mean() function.

**mean() function:**The function mean() is mainly used to calculate in R, it is calculated by taking the sum of the values and dividing with a number of values in data series. On the other word, mean() of an observation variable is a numerical measure of the central area of the data values.

The keyword is “Univar”. Mean() function is the arithmetic average and is a common statistic used with ratio data. Mean can be calculated on an isolated variable via the mean(VAR) command. VAR is the name of the variable.

**Median(x):**The middle most value in a data series is called Median. The median of an observation variable is the middle value when the data is sorted in ascending order.

It is an ordinal measure of the central area of the data values. This is a generic function where methods can be written. The median is called a reasonable concept for its default method, which will work for most classes.

**Sd(x):**The Standard Deviation of an observation variable is a square root of its variance. This function computes the standard deviation of the values in x. If na.rm is TRUE therefore missing the values are removed before computation proceeds. In R Standard Deviations are calculated in the same as the mean.

The Standard Deviation of a single variable can be computed with the sd command, where VAR is the name of the variable. A Standard Deviation can be calculated for each of the variables in a dataset by using the SD (DATAVAR) command, where DATAVAR is the name of the variable containing the data.

**Range(x):**The range of an observation variable is the difference between its largest and smallest data values. This is a measure of how far apart the entire data spread in value.

Range returns a vector which is containing the minimum and maximum of all the given arguments. The keyword is “:arith”, “Univar”. It is recommended that ranges also be computed on individual variables.

**Sum(x):**Sum function in R is used to calculate the sum of vector elements.

Sum returns the sum of all the values present in its arguments. These generic function methods can be defined for it directly or in via the summary group generic.

**min(x):**min function computes the minimum value of a vector.

A minimum can be computed on a single variable using the min (VAR) command.

**max(x):**max function computes the maximum value of a vector.

The maximum, via max(VAR), generates identically.

**Character Function:**

Now we will deal with some character variable. Suppose, you have your customer, customer’s names, location and customer other attributes those are mainly character in nature. Lots of time we have to manipulate and clean the data before we use it in a model that’s why we have no couple of simple example as well.

Here we discuss some inbuilt Character Function. Now we are talking about the first function which is tolower function.

**Tolower():**It converts a string to lower case letter.

**Toupper():**It converts a string actually uppercase letter.

**Substr(X,star=n1,stop=n2):**It extract or replace substrings in a character vector. How does it extract?

It has a starting point and an ending point and it does on a top of X. It can also be used to overwrite a part of the character string.

**Grep(pattern,x,ignore.case=FALSE):**It searches a pattern in X.

Substr is actually extracting but this grep() actually find a particular pattern in each element of a vector x.

**Sub(pattern,replacement,x,ignore.case=FALSE,Fixed=FAlse):**It finds a pattern in x and replaces with the replacement text. According to Sub, there is a little bit different between Sub and Gsub.

Sub replaces only at the first place but Gsub replaces at all the places where ever it finds, it shows. All matches of a string replace by gsub() function.

**Paste(...,sep=””):**In this function, you can paste two words or two letter.

It converts its arguments to character strings and concatenates them.

Now we will be learning, how to write our own function? Before learning this, we have to learn where we actually create a function. Whenever we are doing a lot of copy paste of the same code with minimal change we just create our own function. Whenever you are copy pasting lot of time the same number of code you are copy pasting again and again then it’s a time to create your own function.

Why would actually create a function? Writing your own functions is one way to reduce duplication in your code. It is easier to use. Not Only easier to use but also easier to understand and its actually removing lots of your deduplication work so that’s why we create a function.

Structure Function(): Structure is as you name your function, like mean, standard deviation, range or append etc some kind of meaningful name you have to do.

Suppose, you want to create a function and the function will give you anything which it passed you, and then you will give it triple. That so, you will multiply it that 3 and you will return the value. So here, the meaningful name is triple and how to write that, that is our query. Perhaps, you start with the name > triple <- then you are writing the function and then write down how many arguments you have? Here, suppose you have only one argument that’s why you write ‘x’ and then you have to create that body in the function and whatever you take for your input that is ‘x’ and then multiply it with 3 and then return it. Now the function is being created. (>triple<- function(x) {3*x} and now if you pass 6 in your triple function you will get 18. Here we are working with one argument which is pretty easy,next we are doing with two argument.

In the second argument we will be creating a new function called math magic which takes actually two arguments suppose one is ‘a’ and another is ‘b’,and you pass ‘a’ multiply ‘b’ plus ‘a’ by ‘b’ like “a*b+a/b” that’s why its work suppose this the formula and we have created math magic. If you pass a=2 and b=1 and you are getting 4 as a result. Here,2 multiple 1 equal to 2 and 2 by 1 equal to 2 and then 2 plus 2 equal to 4, so the result is 4 as well. (2*1+2/1=4)

Somehow you don’t remember the second value which is for 'b' and then you get an error. Remember instead of this you could be written it this way also. like, you write ‘a’ equal to 2 and ‘b’ equal to 1. You can do this By name and since the position also same, that’s a reason the function is working. But if you forget to give the second value then you get an error. That’s why, whenever you are creating your function try to give a meaningful default value as well.

Here we are passing a default value ‘b’ equal to 1 and get the same result and same set of code but we have set the default value that ‘b’ would be 1, if we pass 2 and 1 then the result would be same, here if we forget to give the value of ‘b’ still we get a result. So earlier when it gave, we are getting an error result but now it automatically takes the default value as 1 that’s why we get the result.

That is the way, how to create your own function. It is a simple way. And above example also very simple. But whenever you are doing a lot of data manipulation where has a lot of columns and lot of attributes and you want to do a similar kind of operation on each of those attributes then instead of doing and writing again and again you are feel bore so what can you do? you can simply create a function which can look after each of those columns one by one and that’s why you can reuse the number of code.

### R Data Structure

**Type Data Structure:**

Data Structure can be defined as the specific form of organizing and storing the data. R Programming supports five basic types of data structure namely Vector, matrix, Array, , and list.

**Vector: **

Vector is a sequence of data elements of the same basic type. Members of a Vector are formally called components. Whenever we are storing the numerical value in ‘a’ or ‘b’ that ‘a’ or ‘b’ nothing but a Vector, so Vector is one-dimensional array used to store collection data of the same data type.

Similar kind of data type you can store like all the Numeric data(data type numeric) in a Vector, again you can store complex(data type complex) similarly in logical(data type logical) or character(data type character) you can store this thing in a Vector. There is six type of atomic Vector, they are Logical, Integer, Double, Complex, Character, and Raw.

**Matrices:**

Matrix is a collection of data elements of the same mode arranged in a two-dimensional rectangular layout. We can create a matrix containing only characters or only logical values, these are not of much use. We use matrices containing numeric elements to be used in mathematical calculations. They are accessed by two integer indices.

Three kinds of matrices are

*Matrix Multiplication**R matrix transpose**Matrix power*

**Arrays:**

Similar to matrices but they can be multi-dimensional(more than two dimensions). Array function() takes vectors as input and uses the values in the dim parameter to create an array.

If you want to store age, salary, location, designation then it becomes your array. Salary or age all those things are numeric but if you store numeric as well as a character variable, character variable is nothing but a gender like male or female then you can use an array.

**Data Frames:**

Generalization of matrices where different columns can store in a different mode then it’s called Data Frame. It is a list of vectors of equal length. Data frame is a table or two-dimensional array like structure in which each column contains values of one variable and each row contains one set of values from each column. It can store a different kind of data types like you can store numerical with categorical and logical

**Lists:**

Lists are something where you can store all those other four, you can store data frame in lists, you can store an array in a list or you can store matrices and you can store a vector as well. Lists ordered a collection of objects where the elements can be of different types. Lists contain the elements of different types like-number, strings, vectors and another list inside it. Lists are created using lists function().

**Types of Data Structure:**

Vector, Matrix, and Array those are in the homogeneous so they can store homogeneous data either numeric, character or logical but only different in dimension, Vector is one dimension, Matrix is two dimension and Array is multidimensional.

On the other side if you talk about Data frame and lists those are in heterogeneous so they can store a different kind of data type.

** VECTOR**

**Numeric Vector:**

Now we will learn about the Vector. Suppose you want to store 42 (vec1<-42)in vector 1 then you can see the result in vector 1. Similarly if you store more than variable 1-5 (vec1<-c(1,2,3,4,5) .

Here we show you both the way either you can use combine function and use “,” and you can store 1,2,3,4,5 (vec1<-c(1,2,3,4,5) or you can store 1-5 (vec1<-c(1:5) the only advantage is suppose you want to store it 1,7,9,3,5 then also you can use the combine function and you can store it.

If you see the class of the vector 1 then it would be an integer. Likewise, if you want to access a particular suppose you have like 1,2,3,4,5 but you only access the second element of your vector then you would get probably 2 since 1-5 in store. Likewise, you can also access 1 and 3 from that vector

**Character** **Vector**:

Within double coat, you just pass the value which you want to store. Suppose in Vec2 we want to in “universe” (vec2<-”Universe” ) but if you store more than 1 character variable.

Then you have to use the combine function (vec2<-c(“Universe”,”sun”,”moon”) and if you see the class of it then you definitely get the class as character.

If you mix the character and numeric then all the values will be converted to character.

**Logical Vector:**

You can store TRUE and FAlSE for the logical function. Suppose we are using vector 3. Here you store FALSE for vec3(Vec3<-FALSE). In vec3 if you store more than one variable which is TRUE and FALSE (Vec3<-c(TRUE, FALSE) and finally if you class of it, it would be logical.

If you mix the logical vector function with numeric vector function then the numeric function gets the preferences and If you mix character with logical function and numeric function then all value converted to character function so character is given the preferences.

**Matrices:**

Matrices are the R object, which is a collection of data elements arranged in a two-dimensional data array.Although we can create a matrix containing only characters or only logic values which are not of much use. We use matrices containing numeric elements to be used in mathematical calculations.

Matrix is created using the matrix() function.Matrix() function is being used to create a matrix. Here we show an argument.

Here,the argument is matrix(data=NA,nrow=1,ncol=1,byrow=FALSE,dimnames=NULL). We have a data, word data which we want to create a matrix, suppose 1-4 we want to create a matrix. Then, how many rows here, there are by default 1 row but we are changing it two and by default it’s column is 1 but we are also changing this to two and suppose by default byrow is equal to False and by row is nothing but suppose 1-4 you want to create a matrix then how you want to do it? Either this way byrow equal to TRUE equal to 1,2,3 and 4.

But if you change that byrow equal to FALSE then you want to do it by column so first 1 then 2 then 3 then 4 that’s how you can do it. Here, we are keeping all the numerical values but here we are showing it 2 rows and 2 columns and by rows equal to TRUE that means we will start from the row and then in the second row. Similarly, if you access any element of a matrix.

Suppose, we are creating matrix 2 where we access the first row then you can say 1,so this is row, column so we are asking for the first row and it would give us first row, the second row then it also gave us first column because we said row, column then second column.

We haven’t written anything in a row that’s why it would give us all the rows and only it would give us first column. Similarly, it would give us all the rows but an only second column.

**Array:**

Now we have discussed the Array. An Array is similar to matrices but it can have more than two dimensions. Here, you can store 2*3*4 anything you can create. R Array is the data objects which can store data in more than two dimensions. An Array is created using the Array() function. The array can store only data type. Array takes vectors as input and uses the values in the dim parameter to create an Array.

It can contain multidimensional rectangular shaped data storage structure. “Rectangular” in the word, each row is having the same length similarly for each column and other dimensions. But Array can store only values which have similar kind of data,i.e. variables/elements having a similar data type.We create Array using the Array() function. The argument is array (data=NA, dim=Length(data), dimnames=NULL).

We first pass the data here we pass 1 to 27 with this dataset we want to create the Array. What is those dimension? The dimension is 3*3*3 here you can store salary, age, designation, grade or something like that. It is pretty simple and you can see the result as well.

**Dataframe:**

Dataframe is a table or two-dimensional array like structure where each column contains values of one variable and each row contains one set of values from each column.

Data frame is used for storing a data tables. This is a list of vectors of equal length. All the vectors we have learned previously like Numerical vectors, Character Vectors, Logical Vectors and all those things. Now you just mix them the three vectors you mix them. Suppose you want to create a data frame where you have one attribute as Numerical like age or salary which is numerical but you have three employees so all of this vector like the same length. So you have three employees then you want to store their gender as well male, female. Similarly, you want to understand whether they left the company or not something like that TRUE FALSE or something like that. If you achieve this kind of data set then you have to use data frame.

Now we use data. frame, we use this data. frame function. Before that we have to three vectors first, suppose we are storing 2,3,5 a vector called num then “aa”,”bb”,”cc” in char and then TRUE, FALSE and TRUE in logical then ultimately using data.frame()function and creating that data frame. Here we are using the data frame name df. If you want to see what is df? Or what is being stored in df? Thereafter you get the result.

Think about a business scenario, where you have a lot of employees and here you use the data frame. That’s why data frame is one of the most important structures in R. In R we have a lot of inbuilt data frame which you can explore. One of the popular dataframe is mtcars. Mtcars is a dataset of a car distribution. We have df which is already been created in R that’s why we said this inbuilt database. Here mtcars has 32 model of cars. If you want to see a couple of them then you use head() function, you can use the head() function for first several rows. similarly, you can use tail() function to see last couple rows. The tail function gives you the below of the lists. Another function is str() function which is the structure of a dataset. The str() function gives you the detail observation. For str() function you get a sense how your data set looks like. Likewise, you can see the summary which is better than the full data structure. The summary() function gives you the min or max from each of those attributes.

**Factor:**

In a data frame, character variables are automatically changed or converted into factor, and the number of levels can be determined as the number of different values in such a vector.

Factor takes a limited number of different values, such variables are referred to as categorical variables. So, Factor represents the categorical data, the factor can be ordered or unordered and are an important class for statistical analysis and for plotting. Factor variables are very useful to many different types of graphics.

Storing data factors insures that the modeling functions will treat such data correctly. The factor can store both integers and strings. These are very useful in the columns which have a limited number of unique values such as “Male, Female” and “True, False” etc.

Factors in R has two varieties

ordered

unordered.

Factors are stored as a vector of integer values, with a corresponding set of character values to use when the factor is shown. factor() function is used to create a factor. The required argument to factor is a vector of values, which will be returned as a vector of factor values. Numeric and Character variables both can be made into factors, but a factor’s levels will always be character values.

**Factor levels**

Getting a dataset you will look that it contains factors with specific factor levels. By the way, sometimes you will willing to change the names of these levels for clarity or any other reasons. R permits you to do this with the function levels().

Examples:

for this is mtcars data:

>Str (mtcars)

$ cyl : num 6 6 4 6 8 6 8 4 4 6

Here, have shown an example, suppose this is mtcars data where has 32 car brands. Each of those cars we have eleven attributes like horsepower, cylinder, displacement, mileage or all those things. If you see the cylinder, either it has 6 as cylinder or 4 as cylinder or 8 as a cylinder. So, since if you have a minimum number of unique value of particular attributes then that’s an ideal candidate for factors. Because it does not take 6.5 or 4.32 or 5.67. It takes either 4 or 6 or 8 so we want to change this to a factor.

Str (mtcars)

mtcars$cyl = as.factor(mtcars$cyl)

How to use factor() function? Just use as.factor() function, here the first query is which one you want to change? Suppose, here change the “mtcars$cylinder”.You can access it any column using the dollar function. Using that “as.factor()” function,the “mtcars$cylinder” is converting into Factor.

Str (mtcars)

$ cyl : num 6 6 4 6 8 6 8 4 4 6

Now, looking carefully at the above example, look the structure of “mtcars” when it changes to factor then everything is changed into the numeric function. Changing this when you look at the cylinder, it is converted to a factor. After changing it has three level 4,6 and 8.

str (mtcars)

(four column or four attributes change to numericals factors)

$ am : factor w/2 levels “0” , “1” : 2 2 2 1 1 1 1 1 1 1 . . .

If it is changed whether manually or automatically. What is the number of gear of the mtcars or how many numbers of carburetors is there, then you see the number of the structure of mtcars. Now you see all those columns or all those attributes as change into factor. You can notice that automation has two levels if it is manual or automatic.

str (mtcars)

(four column or four attributes change to numericals factors)

$ gear : factor w/3 levels “3” , “4” , “5” : 2 2 2 1 1 1 1 2 2 2 . . .

You have three gears levels.

str (mtcars)

(four column or four attributes change to numericals factors)

$ carb: factor w/6 levels “1” , “2” , “3” , “4”,. . : 4 4 1 1 2 1 4 2 2 4. . .

Similarly, you have 6 levels of carbs. That’s how you can change it.

How to change the name of the level?

If you change the name of the level, then you have to create a new variable called gender vector and then you want to store “Male”, “Female” , “Female” , “Male” , “Male”. It may be a data for five employees. Thereafter, you see that the five things have been recorded. If you see the class of gender vector then it is a character vector.

gender vector <-c ( “Male”, “Female” , “Female” , “Male” , “Male” )

gender vector

class ( gender vector )

But, if you want to change the gender vector to factor

#Convert gender vector to factor

factor-gender-vector <-as.factor(gender vector)

factor-gender-vector # factor gender has two levels Male and Female

then use “as.factor” and then see it has changed to a factor and it’s showing the level is “Female Male”.

Now you want to change the name of the factors using level()

levels(factor-gender-vector)<-c(“f” , “M”)

levels: Female and Male(earlier)

Now, it changes to “F & M”

That’s how you want to do it.

How to do that? In this case, level() function helps you. Using level() function which you want to change, just give this. Suppose you want to change “factor-gender and vector”. Here suppose you want to change F and M for Female and Male. once you can do it and you can again see this, this levels have changed. Previously it showed Female and Male now it changes to F and M. in this process you want to do it.

**Accessing Column from data frame:**

Accessing Column from data frame is quite easy. Once you write any data frame and after that if you put “dollar sign ($)” then you would see all the columns whatever it contains and you can see whatever is there in that column.

Now, there is another way to do it, if you use “third bracket[]” and if you use row, column as I discussed earlier.

Similarly, you can name the column and you can see the column as well. But it has a problem if you have more than one column then how do you do that? In that case, you use combine function and give the number of all those columns and you can see those columns as well.

**Accessing Row:**

Similarly, Accessing Row is also easy. As said use third bracket and between third bracket row, column.

For an example: df[2,]

Here row value is 2 and all column value is NULL. The function is used for accessing the second row.

**Dropping column:**

If you want to drop a column then you don’t want the column to be included in your data frame.You simply put (-sign) before that number. If you say that you don’t want to see drop third column which is actually displacement column.

For an example: df[,-3]

df[,-c(2,3)]

So if you say within third bracket all the rows,-3 then it would actually drop that column. Similarly, if you drop second and third then it would actually drop displacement and cylinder.

**Subset**:

The fourth one is a subset. So, what if you don’t want to see all the rows of that column. You want to see only those observation or those car brands where the cylinder value is more than 6 or the horsepower is more than 50.

For an instance: car 1<-subset (df,cyl>6)

car 2<-subset (df,hp>50)

Then you create new data frame called car 1 using the subset function so in subset function you passing the data when you are passing the condition. Based on that your car is created, you can see it cylinder column of car data frame, has the only cylinder which is more than 6. Similarly, for horsepower also you can see the result.

If you have two data frame, so you have data frame one and data frame two now you have combined them by row. Suppose, in the first data frame you have twenty car brands and second data frame you have twelve car brands. Now you want to combine this two rbind () function.

Likewise, if you have two columns. Suppose all the 32 car brands you have the mileage which is a column and you have another one may be a cylinder. So you want to combine this column that also you can do using cbind() function.

**Factor:**

In a data frame, character vectors are automatically converted into factors, and the number of levels can be determined as the number of different values in such a vector. You can create a data frame using more than one data types. It can be a mixture of numeric, with a character with logical all those things.

Similarly, if you see the showing data frame which has three vectors like name, age and gender and you have created a new data frame using data.frame() function. If you see the class of it, you would get it as data.frame. But surprisingly if you see the class of name this data frame dollar could give all those names. If you see the class of it you would be surprised this is now not a character variable rather it’s now a factor. So it is another new data type.

**Factor:**

In a data frame, character variables are automatically changed or converted into factor, and the number of levels can be determined as the number of different values in such a vector.

Factor takes a limited number of different values, such variables are referred to as categorical variables. So, Factor represents the categorical data, the factor can be ordered or unordered and are an important class for statistical analysis and for plotting. Factor variables are very useful to many different types of graphics.

Storing data factors insures that the modeling functions will treat such data correctly. The factor can store both integers and strings. These are very useful in the columns which have a limited number of unique values such as “Male, Female” and “True, False” etc.

Factors in R has two varieties

ordered

unordered.

Factors are stored as a vector of integer values, with a corresponding set of character values to use when the factor is shown. factor() function is used to create a factor. The required argument to factor is a vector of values, which will be returned as a vector of factor values. Numeric and Character variables both can be made into factors, but a factor’s levels will always be character values.

**Factor levels**

Getting a dataset you will look that it contains factors with specific factor levels. By the way, sometimes you will willing to change the names of these levels for clarity or any other reasons. R permits you to do this with the function levels().

Examples:

for this is mtcars data:

>Str (mtcars)

$ cyl : num 6 6 4 6 8 6 8 4 4 6

Here, have shown an example, suppose this is mtcars data where has 32 car brands. Each of those cars we have eleven attributes like horsepower, cylinder, displacement, mileage or all those things. If you see the cylinder, either it has 6 as cylinder or 4 as cylinder or 8 as a cylinder. So, since if you have a minimum number of unique value of particular attributes then that’s an ideal candidate for factors. Because it does not take 6.5 or 4.32 or 5.67. It takes either 4 or 6 or 8 so we want to change this to a factor.

Str (mtcars)

mtcars$cyl = as.factor(mtcars$cyl)

How to use factor() function? Just use as.factor() function, here the first query is which one you want to change? Suppose, here change the “mtcars$cylinder”.You can access it any column using the dollar function. Using that “as.factor()” function,the “mtcars$cylinder” is converting into Factor.

Str (mtcars)

$ cyl : num 6 6 4 6 8 6 8 4 4 6

Now, looking carefully at the above example, look the structure of “mtcars” when it changes to factor then everything is changed into the numeric function. Changing this when you look at the cylinder, it is converted to a factor. After changing it has three level 4,6 and 8.

str (mtcars)

(four column or four attributes change to numericals factors)

$ am : factor w/2 levels “0” , “1” : 2 2 2 1 1 1 1 1 1 1 . . .

If it is changed whether manually or automatically. What is the number of gear of the mtcars or how many numbers of carburetors is there, then you see the number of the structure of mtcars. Now you see all those columns or all those attributes as change into factor. You can notice that automation has two levels if it is manual or automatic.

str (mtcars)

(four column or four attributes change to numericals factors)

$ gear : factor w/3 levels “3” , “4” , “5” : 2 2 2 1 1 1 1 2 2 2 . . .

You have three gears levels.

str (mtcars)

(four column or four attributes change to numericals factors)

$ carb: factor w/6 levels “1” , “2” , “3” , “4”,. . : 4 4 1 1 2 1 4 2 2 4. . .

Similarly, you have 6 levels of carbs. That’s how you can change it.

How to change the name of the level?

If you change the name of the level, then you have to create a new variable called gender vector and then you want to store “Male”, “Female” , “Female” , “Male” , “Male”. It may be a data for five employees. Thereafter, you see that the five things have been recorded. If you see the class of gender vector then it is a character vector.

gender vector <-c ( “Male”, “Female” , “Female” , “Male” , “Male” )

gender vector

class ( gender vector )

But, if you want to change the gender vector to factor

#Convert gender vector to factor

factor-gender-vector <-as.factor(gender vector)

factor-gender-vector # factor gender has two levels Male and Female

then use “as.factor” and then see it has changed to a factor and it’s showing the level is “Female Male”.

Now you want to change the name of the factors using level()

levels(factor-gender-vector)<-c(“f” , “M”)

levels: Female and Male(earlier)

Now, it changes to “F & M”

That’s how you want to do it.

How to do that? In this case, level() function helps you. Using level() function which you want to change, just give this. Suppose you want to change “factor-gender and vector”. Here suppose you want to change F and M for Female and Male. once you can do it and you can again see this, this levels have changed. Previously it showed Female and Male now it changes to F and M. in this process you want to do it.

**List:**

These are the most complex data structure. A List may contain a combination of vectors, matrices, data frames and even other list itself. The list is being created using List() function in R. A list is a generic vector containing other objects. Lists is a data structure containing of mixed data types. A vector which have all elements of same type is called atomic vector but a vector having elements of various type is called List.

Before creating a list, creating a vector suppose you create a vector with one to ten(1-10).

Thereafter you create a matrix which is two dimensional array.

Then you will create a data frame, that is “mtcars” which was inbuilt data frame but here you just take only three observation and create a data frame called “my-dataframe” from “mtcars”.

Finally, you will store this vector, matrix and dataframe in a list called “my list”, using the list() function.

If you created the list() then you see the result. You can see the output from “mylist”. Here the list starts from the first vector,”my-vector” is one to ten.

Creating list:

# vector with numerics from 1 up to 10

>my-vector <-1:10

The output is

[[1]]

Thereafter you create a matrix which is two dimensional array that “my-matrix”.

# matrix with numerics from 1 up to 9

>my-matrix <-matrix(1:9,ncol =3)

The output is

[[2]]

Then you just created the data frame

# first 3 rows of the built in data frame “mtcars”

>my-df <-mtcars [1:3,]

The result is

[[3]]

That’s the way to do a list.

If you see these examples there's no name like [1], [2], and [3] is written of the list but you can change those name as well using the name() function.

#give name using name()

>names (my list)<-c (“vec”, ”mat”, ”df”)

Then you can check the output and see the name would be changed.

How does a list() work?

At first you create vector() function

my-vector<-1:10

Then use matrix function

my- matrix<-(matrix 1:9, ncol=3)

Thereafter creating a dataframe

my-df<-mtcars[1:3,]

Here”mtcars” is the data where you can see all those 32 car brands and eleven attributes but the first 3 rows uses for this dataframe.

Then using the list() function,created a list

my-list < - list(my-vector,my-matrix,my-df)

Then the output is my-matrix ,my-vector, and my-df

### Import and Export in R

**Import and Export in R**

You might find that loading data into R can be quite frustrating. Almost every single type of file that you want to get into R seems to require its own function, and even then you might get lost in the functions’ arguments. In short, you might agree that it can be fairly easy to mix up things from time to time, whether you are a beginner or a more advanced R user.

**Types of files that we‘ll import **

Importing CSV file

Importing Text file

Importing Excel file

Importing files from Database

Importing files from Web

Importing files from Statistical Tool

And lastly Exporting the Data

**Importing CSV file**

The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.

Use read.csv() to import a data frame

**Now use this commands to import CSV Files**

#Importing csv file

# read.csv()

titanic_train<- read.csv(file.choose())

class(titanic_train)

titanic <- read.csv("titanic_train.csv")

str(titanic)

#Using readr package

install.packages("readr")

library(readr)

titanic <- read_csv("titanic_train.csv")

titanic

All the codes which are used in this video is given at the end of this chapter.The CSV files which are used here is available in the resource section of this lecture

** **

**Importing Text File**

The utils package, which is automatically loaded in your R session on startup, can import text files with the read-table function.

Use read-table to import a data frame

**Now use this commands to import Text Files**

If you have a .txt or a tab-delimited text file, you can easily import it with the basic R function read.table(). In other words, the contents of your file will look similar to this and can be imported as follows:

# Importing table/text

# read.table ()

# Import the hotdogs.txt file: hotdogs

?read.table

hotdogs <- read.table( "hotdog.txt",sep = "t", header = TRUE)

# Call head() on hotdogs

head(hotdogs)

All the codes which are used in this video is given at the end of this chapter.The Text files which are used here is available in the resource section of this lecture

This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.

**Importing Of Excel Files**

As most of you know, Excel is a spreadsheet application developed by Microsoft. It is an easily accessible tool for organizing, analyzing and storing data in tables and has a widespread use in many different application fields all over the world. It doesn't need to surprise that R has implemented some ways to read, write and manipulate Excel files (and spreadsheets in general).

**How To Import Excel Files**

Before you start thinking about how to load your Excel files and spreadsheets into R, you need to first make sure that your data is well prepared to be imported.

The readxl package, which is automatically loaded in your R session on startup, can import Excel files with the read_excel() function.

Use the read_excel() to import a data frame

If you would neglect to do this, you might experience problems when using the R functions

**Using this command you can import Excel File in R**

#Importing xls file using readxl package - read_excel()

#install redxl package

install.packages("readxl")

# Load the readxl package

library(readxl)

# Print out the names of both spreadsheets

excel_sheets("urbanpop.xlsx")

# Read the sheets, one by one

pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)

pop_2 <- read_excel("urbanpop.xlsx", sheet = 2)

pop_3 <- read_excel("urbanpop.xlsx", sheet = 3)

# Put pop_1, pop_2 and pop_3 in a list: pop_list

pop_list <- list(pop_1,pop_2,pop_3)

# Display the structure of pop_list

str(pop_list)

# Explore other packages - XLConnect, xlsx, gdata

** **

All the codes which are used in this video is given at the end of this chapter.This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and **THANK YOU.**

** **

**Export Data in R - Text,CSV,Excel**

In this tutorial, we will learn how to export data from R environment to different formats.

To export data to the hard drive, you need the file path and an extension. First of all, the path is the location where the data will be stored.

**Exporting Text File**

You can export text files with write.table(mydata, "Path../../mydata.txt", sep="t")function.

**Now use this commands to Export Text Files**

# Export data in a text file

write.table(hotdogs, "D:\Rajib Backup\Project\Innovation\Analytics\Machine Learning\Tutorial\EduCBA\Chap5 -Import and Export\NewHotdog.txt", sep = "t")

**Exporting CSV File**

You can export text files with write csv(mydata, " Path../../mydata.csv")function.

**Now use this commands to Export CSV Files**

#Export data in csv

write.csv(my_df, "D:\Rajib Backup\Project\Innovation\Analytics\Machine Learning\Tutorial\EduCBA\Chap5 -Import and Export\my_df.csv")

**Exporting Excel File**

You can export text files with write xlsx(mydata, " Path../../mydata.xlsx")function.

**Now use this commands to Export Excel Files**

# Export data in excel

install.packages("writexl")

library(writexl)

my_df <- mtcars[1:3,]

write_xlsx(my_df,"D:\Rajib Backup\Project\Innovation\Analytics\Machine Learning\Tutorial\EduCBA\Chap5 -Import and Export\Newmtcars.xlsx")

All the codes which are used in this video is given at the end of this chapter.The Text,CSV,Excel files which are used here is available in the resource section of this lecture

This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.

### Data Manipulation in R

**Data Manipulation**

The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the

**apply()****lapply()****sapply()****tapply()****by functions.**

**How To Use apply() in R**

Let’s start with the apply(), which operates on arrays.

The R base manual tells you that it’s called as follows: apply**(X, MARGIN, FUNCTION)**

where:

**X**is an array or a matrix if the dimension of the array is 2;

**MARGIN**is a variable defining how the function is applied,

when

**MARGIN=1**, it applies over rows,

whereas with

**MARGIN=2**, it works over columns.**FUNCTION**which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).

**By this command you can use Apply() function**

# Topic 1: Apply Function ################################################################################### # apply function helps to apply a function to a matrix row or a column and returns a vector, array or list # Syntax : apply(x, margin, function), where margin indicates whether the function is to be applied to a row or a column # margin =1 indicates that the function needs to be applied to a row # margin =2 indicates that the function needs to be applied to a column # function can be any function such as mean , average, sum m <- matrix(c(1,2,3,4),2,2) m apply(m, 1, sum) apply(m, 2,sum) apply(m, 1, mean) apply(m, 2, mean)

**The lapply() Function**

You want to apply a given function to every element of a list and obtain a list as result. When you execute ?lapply, you see that the syntax looks like the apply() function.

The difference is that:

It can be used for other objects like dataframes, lists or vectors;

And

The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.

**By this command you can use lapply() function**

################################################ #Using sapply and lapply ################################################ #Lapply() function #lapply is similar to apply, but it takes a list as an input, and returns a list as the output. # syntax is lapply(list, function) #example 1: data <- list(x = 1:5, y = 6:10, z = 11:15) data lapply(data, FUN = median) #example 2: data2 <- list(a=c(1,1), b=c(2,2), c=c(3,3)) data2 lapply(data2, sum) lapply(data2, mean)

**The sapply() Function**

The sapply() function works like lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().

An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row for each matrix.

Applying the lapply() function would give us a list, unless you pass simplify=FALSE as parameter to sapply(). Then, a list will be returned.

**By this command you can use sapply() function**

#Sapply function # sapply is the same as lapply, but returns a vector instead of a list. # syntax is sapply(list, function) #example 1 : data <- list(x = 1:5, y = 6:10, z = 11:15) data lapply(data, FUN = sum) lapply(data, FUN = median) unlist(lapply(data, FUN = median)) sapply(data, FUN = sum) sapply(data, FUN = median) #Note : if the result are all scalars, then a vector is returned # however if the result are of same size (>1) then a matrix is returned. Otherwise, the result is returned as list itself sapply(data, FUN = range)

**The vapply() Function**

And lastly the vapply function .This function is shown in below

**Arguments**

**.x:**A vector.**.f:**A function to be applied.**fun_value:**A (generalized) vector; a template for the return value from .f.**... :**Optional arguments to .f.**use_names:**Logical; if TRUE and if X is character, use .x as names for the result unless it had names already.

**By this command you can use vapply() function**

#vapply function # vapply() is similar to sapply() but it explicitly specify the type of return value (integer, double, characters). vapply(data,sum, FUN.VALUE = double(1)) vapply(data,range, FUN.VALUE = double(2))

**By this command you can use tapply() and mapply() function**

################################################ # Using tapply() and mapply() ################################################ # tapply() tapply works on vector, # it apply the function by grouping factors inside the vector. # syntax is tapply(x, factor, function) #example 1: age <- c(23,33,28,21,20,19,34) gender <- c("m" , "m", "m" , "f", "f", "f" , "m") f <- factor(gender) f tapply(age, f, mean) tapply(age, gender, mean) #example number 2 #load the datasets library(datasets) #you can view all the datasets data() View(mtcars) class(mtcars) mtcars$wt mtcars$cyl f <- factor(mtcars$cyl) f tapply(mtcars$wt, f, mean) ############################################################################## # mapply() - mapply is a multivariate version of sapply. It will apply the specified function # to the first element of each argument first, followed by the second element, and so on. # syntax is mapply(function...) ## example number 1 # create a list: rep(1,4) rep(2,3) rep(3,2) rep(4,1) a <- list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) a # We can see that we are calling the same function (rep) where th first argument # variaes from 1 to 4 and second argument varies from 4 to 1. # instaed we can use mapply function b <- mapply(rep, 1:4, 4:1) b ##################################################################################### ####################################################################################

This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and **THANK YOU**.

**Data Manipulation**

The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the

**apply()****lapply()****sapply()****tapply()****by functions.**

**How To Use apply() in R**

Let’s start with the apply(), which operates on arrays.

The R base manual tells you that it’s called as follows: apply**(X, MARGIN, FUNCTION)**

where:

**X**is an array or a matrix if the dimension of the array is 2;

**MARGIN**is a variable defining how the function is applied,

when

**MARGIN=1**, it applies over rows,

whereas with

**MARGIN=2**, it works over columns.**FUNCTION**which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).

**By this command you can use Apply() function**

# Topic 1: Apply Function ################################################################################### # apply function helps to apply a function to a matrix row or a column and returns a vector, array or list # Syntax : apply(x, margin, function), where margin indicates whether the function is to be applied to a row or a column # margin =1 indicates that the function needs to be applied to a row # margin =2 indicates that the function needs to be applied to a column # function can be any function such as mean , average, sum m <- matrix(c(1,2,3,4),2,2) m apply(m, 1, sum) apply(m, 2,sum) apply(m, 1, mean) apply(m, 2, mean)

**The lapply() Function**

You want to apply a given function to every element of a list and obtain a list as result. When you execute ?lapply, you see that the syntax looks like the apply() function.

The difference is that:

It can be used for other objects like dataframes, lists or vectors;

And

The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.

**By this command you can use lapply() function**

################################################ #Using sapply and lapply ################################################ #Lapply() function #lapply is similar to apply, but it takes a list as an input, and returns a list as the output. # syntax is lapply(list, function) #example 1: data <- list(x = 1:5, y = 6:10, z = 11:15) data lapply(data, FUN = median) #example 2: data2 <- list(a=c(1,1), b=c(2,2), c=c(3,3)) data2 lapply(data2, sum) lapply(data2, mean)

**The sapply() Function**

The sapply() function works like lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().

An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row for each matrix.

Applying the lapply() function would give us a list, unless you pass simplify=FALSE as parameter to sapply(). Then, a list will be returned.

**By this command you can use sapply() function**

#Sapply function # sapply is the same as lapply, but returns a vector instead of a list. # syntax is sapply(list, function) #example 1 : data <- list(x = 1:5, y = 6:10, z = 11:15) data lapply(data, FUN = sum) lapply(data, FUN = median) unlist(lapply(data, FUN = median)) sapply(data, FUN = sum) sapply(data, FUN = median) #Note : if the result are all scalars, then a vector is returned # however if the result are of same size (>1) then a matrix is returned. Otherwise, the result is returned as list itself sapply(data, FUN = range)

**The vapply() Function**

And lastly the vapply function .This function is shown in below

**Arguments**

**.x:**A vector.**.f:**A function to be applied.**fun_value:**A (generalized) vector; a template for the return value from .f.**... :**Optional arguments to .f.**use_names:**Logical; if TRUE and if X is character, use .x as names for the result unless it had names already.

**By this command you can use vapply() function**

#vapply function # vapply() is similar to sapply() but it explicitly specify the type of return value (integer, double, characters). vapply(data,sum, FUN.VALUE = double(1)) vapply(data,range, FUN.VALUE = double(2))

**By this command you can use tapply() and mapply() function**

################################################ # Using tapply() and mapply() ################################################ # tapply() tapply works on vector, # it apply the function by grouping factors inside the vector. # syntax is tapply(x, factor, function) #example 1: age <- c(23,33,28,21,20,19,34) gender <- c("m" , "m", "m" , "f", "f", "f" , "m") f <- factor(gender) f tapply(age, f, mean) tapply(age, gender, mean) #example number 2 #load the datasets library(datasets) #you can view all the datasets data() View(mtcars) class(mtcars) mtcars$wt mtcars$cyl f <- factor(mtcars$cyl) f tapply(mtcars$wt, f, mean) ############################################################################## # mapply() - mapply is a multivariate version of sapply. It will apply the specified function # to the first element of each argument first, followed by the second element, and so on. # syntax is mapply(function...) ## example number 1 # create a list: rep(1,4) rep(2,3) rep(3,2) rep(4,1) a <- list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) a # We can see that we are calling the same function (rep) where th first argument # variaes from 1 to 4 and second argument varies from 4 to 1. # instaed we can use mapply function b <- mapply(rep, 1:4, 4:1) b ##################################################################################### ####################################################################################

This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and **THANK YOU**.

**Dplyr Package**

**Load the dplyr and hflights package:**

Welcome to the interactive exercises part of your dplyrcourse. Here you will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.

Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.

Both dplyr and hflights are already installed on DataCamp's servers, so loading them with library() will get you up and running.

**Instructions:**

• Load the dplyr package.

• Load the hflights package. A variable called hflights will become available, a data.frame representing the data set.

• Use both head() and summary() on the hflights data frame to get to know the data.

**Use This Command To Perform The Above Mentioned Function**

#install.packages("dplyr")

# Load the dplyr package

library(dplyr)

# Load the hflights package

library(hflights)

# Call both head() and summary() on hflights

str(hflights)

head(hflights)

summary(hflights)

# tbl - tibble (Special type of dataframe)

# Convert the hflights data.frame into a hflights tbl

hflights <- tbl_df(hflights)

class(hflights)

# Glimpse at hflights

glimpse(hflights)

dplyr::glimpse(hflights)

The five verbs and their meaning

The dplyr package contains five key data manipulation functions, also called verbs:

• select(), which returns a subset of the columns,

• filter(), that is able to return a subset of the rows,

• arrange(), that reorders the rows according to single or multiple variables,

• mutate(), used to add columns from existing data,

• summarise(), which reduces each group to a single row by calculating aggregate measures.

Use This Command To Perform The Above Mentioned Function

# Five verbs of dplyr - select, filter, arrange, mutate, summarize

# The dplyr package contains five key data manipulation functions, also called verbs:

# 1. select(), -> select specific column from a tbl,

# 2. filter(), -> filter specific rows which matches the logical condition

# 3. arrange(), -> that reorders the rows according to single or multiple variables,

# 4. mutate(), -> add columns from existing data,

# 5. summarise(), which reduces each group to a single row by calculating aggregate

measures.

**Selecting columns using select()**

select() keeps only the variables you mention

**Use This Command To Perform The Above Mentioned Function**

#######################################

#select(): Select specific column from tbl

#######################################

tbl <- select (hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay )

glimpse(tbl)

#starts_with("X"): every name that starts with "X",

#ends_with("X"): every name that ends with "X",

#contains("X"): every name that contains "X",

#matches("X"): every name that matches "X", where "X" can be a regular expression,

#num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,

#one_of(x): every name that appears in x, which should be a character vector.

#Example: print out only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode columns of hflights

select(hflights, ends_with("Num"))

select(hflights, starts_with("Cancel"))

select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))

**Create new columns using mutate()**

mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.

**Use This Command To Perform The Above Mentioned Function**

#######################################

#mutate(): Add columns from existing data

#######################################

g2 <- mutate(hflights, loss = ArrDelay - DepDelay)

g2

g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

g1

#hflights$ActualGroundTime <- hflights$ActualElapsedTime - hflights$AirTime

#######################################

**Selecting rows using filter()**

Filtering data is one of the very basic operation when you work with data. You want to remove a part of the data that is invalid or simply you’re not interested in. Or, you want to zero in on a particular part of the data you want to know more about. Of course, dplyr has ’filter()’ function to do such filtering, but there is even more. With dplyr you can do the kind of filtering, which could be hard to perform or complicated to construct with tools like SQL and traditional BI tools, in such a simple and more intuitive way.

**R comes with a set of logical operators that you can use inside filter():****• <• <=• == • !=• !=• > **

**Use This Command To Perform The Above Mentioned Function**

#filter() : Filter specific rows which matches the logical condition

#######################################

#R comes with a set of logical operators that you can use inside filter():

#x < y, TRUE if x is less than y

#x <= y, TRUE if x is less than or equal to y

#x == y, TRUE if x equals y

#x != y, TRUE if x does not equal y

#x >= y, TRUE if x is greater than or equal to y

#x > y, TRUE if x is greater than y

#x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

# All flights that traveled 3000 miles or more

long_flight <- filter(hflights, Distance >= 3000)

View(long_flight)

glimpse(long_flight)

# All flights where taxing took longer than flying

long_journey <- filter(hflights, TaxiIn + TaxiOut > AirTime)

View(long_journey)

# All flights that departed before 5am or arrived after 10pm

All_Day_Journey <- filter(hflights, DepTime < 500 | ArrTime > 2200)

# All flights that departed late but arrived ahead of schedule

Early_Flight <- filter(hflights, DepDelay > 0, ArrDelay < 0)

glimpse(Early_Flight)

# All flights that were cancelled after being delayed

Cancelled_Delay <- filter(hflights, Cancelled == 1, DepDelay > 0)

#How many weekend flights flew a distance of more than 1000 miles but

#had a total taxiing time below 15 minutes?

w <- filter(hflights, DayOfWeek == 6 |DayOfWeek == 7, Distance >1000, TaxiIn + TaxiOut <15)

nrow(w)

y <- filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15)

nrow(y)

#######################################

**Arrange or re-order rows using arrange()**

To arrange (or re-order) rows by a particular column such as the taxonomic order, list the name of the column you want to arrange the rows

**Use This Command To Perform The Above Mentioned Function**

#######################################

#arrange(): reorders the rows according to single or multiple variables,

#######################################

dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) #Delay not equal to NA

glimpse(dtc)

# Arrange dtc by departure delays

d <- arrange(dtc, DepDelay)

# Arrange dtc so that cancellation reasons are grouped

c <- arrange(dtc,CancellationCode )

#By default, arrange() arranges the rows from smallest to largest.

#Rows with the smallest value of the variable will appear at the top of the data set.

#You can reverse this behavior with the desc() function.

# Arrange according to carrier and decreasing departure delays

des_Flight <- arrange(hflights, desc(DepDelay))

# Arrange flights by total delay (normal order).

arrange(hflights, ArrDelay + DepDelay)

#######################################

**Create summaries of the data frame using summarise()**

The summarise() function will create summary statistics for a given column in the data frame such as finding the mean.

**Use This Command To Perform The Above Mentioned Function**

#######################################

#summarise(): reduces each group to a single row by calculating aggregate measures.

#######################################

#summarise(), follows the same syntax as mutate(),

#but the resulting dataset consists of a single row instead of an entire new column in the case of mutate()

#min(x) - minimum value of vector x.

#max(x) - maximum value of vector x.

#mean(x) - mean value of vector x.

#median(x) - median value of vector x.

#quantile(x, p) - pth quantile of vector x.

#sd(x) - standard deviation of vector x.

#var(x) - variance of vector x.

#IQR(x) - Inter Quartile Range (IQR) of vector x.

#diff(range(x)) - total range of vector x.

# Print out a summary with variables

# min_dist, the shortest distance flown, and max_dist, the longest distance flown

summarise(hflights, max_dist = max(Distance),min_dist = min(Distance))

# Print out a summary of hflights with max_div: the longest Distance for diverted flights.

# Print out a summary with variable max_div

div <- filter(hflights, Diverted ==1 )

summarise(div, max_div = max(Distance))

summarise(filter(hflights, Diverted == 1), max_div = max(Distance))

###########################################################

**Pipe operator: %>%**

Before we go any futher, let’s introduce the pipe operator: %>%. dplyr imports this operator from another package (magrittr). This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.

**Use This Command To Perform The Above Mentioned Function**

#######################################

#Chaining function using Pipe Operators

#######################################

hflights %>%

filter(DepDelay>240) %>%

mutate(TaxingTime = TaxiIn + TaxiOut) %>%

arrange(TaxingTime)%>%

select(TailNum )

# Write the 'piped' version of the English sentences.

# Use dplyr functions and the pipe operator to transform the following English sentences into R code:

# Take the hflights data set and then ...

# Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then ...

# Pick all of the rows whose diff value does not equal NA, and then ...

# Summarise the data set with a value named avg that is the mean diff value.

hflights %>%

mutate(diff = TaxiOut - TaxiIn) %>%

filter(!is.na(diff)) %>%

summarise(avg = mean(diff))

# mutate() the hflights dataset and add two variables:

# RealTime: the actual elapsed time plus 100 minutes (for the overhead that flying involves) and

# mph: calculated as Distance / RealTime * 60, then

# filter() to keep observations that have an mph that is not NA and that is below 70, finally

# summarise() the result by creating four summary variables:

# n_less, the number of observations,

# n_dest, the number of destinations,

# min_dist, the minimum distance and

# max_dist, the maximum distance.

# Chain together mutate(), filter() and summarise()

hflights %>%

mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%

filter(!is.na(mph), mph < 70) %>%

summarise(n_less = n(),

n_dest = n_distinct(Dest),

min_dist = min(Distance),

max_dist = max(Distance))

#######################################

Dates can be imported from character, numeric formats using the *as.Date* function from the **base** package.

If your data were exported from Excel, they will possibly be in numeric format. Otherwise, they will most likely be stored in character format. If your dates are stored as characters, you simply need to provide *as.Date* with your vector of dates and the format they are currently stored in

### Data Visualization

**Data Visualization **

**Basic Visualization**

Scatter Plot

Line Chart

Bar Plot

Pie Chart

Histogram

Density plot

Box Plot

**Advanced Visualization**

Mosaic Plot

Heat Map

3D charts

Correlation Plot

Word Cloud

**Scatter Plot**

Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists.

**Data Visualization – mfrow**

Create a multi-paneled plotting window. The par(mfrow) function is handy for creating a simple multi-paneled plot, while layout should be used for customized panel plots of varying sizes.

**Data Visualization - pch**

Different **plotting symbols** are available in **R**. The **graphical** argument used to specify **point shapes** is **pch**.

**Data Visualization – Color**

**Data visualization** **(visualisation),** or the visual communication of data, is the study or creation of data represented visually. A good graph is easy to read. A goal when creating data visualizations is to convey information in a clear and concise way. One of the most prominent features of most data visualizations is color.Color is important because it lets you set the mood and color lets you guide the viewer’s eye, draw attention to something and therefore tell a story.Both aspects are important for data visualisations.

In data visualization

There are 657 builtin color names

R uses hexadecimal to represent colors

You can create vectors of using rainbow(

*n*),heat.colos(*n*),terrain.color(*n*),topo.colors(*n*) and cm.colors(*n*).

**Data Visualization -Line Chart**

Line charts display information as a series of data points connected by straight line segments on an X-Y axis. They are best used to track changes over time, using equal intervals of time between each data point.

**CHARACTERISTICS**

INCLUDE A ZERO BASELINE IF POSSIBLE

DON’T PLOT MORE THAN 4 LINES

USE SOLID LINES ONLY

USE THE RIGHT HEIGHT

LABEL THE LINES DIRECTLY

**When to use a line chart**

Line graphs are useful in that they show data variables and trends very clearly.

It helps to make predictions about the results of data not yet recorded. If seeing the trend of your data is the goal, then this is the chart to use.

Line charts show time-series relationships using continuous data.

They allow a quick assessment of acceleration (lines curving upward), deceleration (lines curving downward), and volatility (up/down frequency).

They are excellent for tracking multiple data sets on the same chart to see any correlation in trends.

They can also be used to display several dependent variables against one independent variable.

Line charts are great visualizations to see how a metric changes over time. For example, the exchange rate for GBP to USD.

The classic Bar Chart uses either horizontal or vertical bars (column chart) to show discrete, numerical comparisons across categories. One axis of the chart shows the specific categories being compared and the other axis represents a discrete value scale.

Bars Charts are distinguished from Histograms , as they do not display continuous developments over an interval. Bar Chart's discrete data is categorical data

**CHARACTERISTICS**

START THE Y-AXIS VALUE AT 0

USE HORIZONTAL LABELS

ORDER DATA APPROPRIATELY

SPACE BARS APPROPRIATELY

USE CONSISTENT COLORS

**Data Visualization - Pie Chart**

A pie chart shows a static number and how categories represent part of a whole -- the composition of something. A pie chart represents numbers in percentages, and the total sum of all segments needs to equal 100%.

**Design Best Practices for Pie Charts:**

Don't illustrate too many categories to ensure differentiation between slices.

Ensure that the slice values add up to 100%.

Order slices according to their size.

Pie charts are best used for making part-to-whole comparisons with discrete or continuous data. They are most impactful with a small data set.

**CHARACTERISTICS**

VISUALIZE NO MORE THAN 5 CATEGORIES PER CHART

DON’T USE MULTIPLE PIE CHARTS FOR COMPARISON

MAKE SURE ALL DATA ADDS UP TO 100%

ORDER SLICES CORRECTLY

**Data Visualization - Histogram **

A Histogram visualizes the distribution of data over a continuous interval or certain time period. Each bar in a histogram represents the tabulated frequency at each interval/bin.

Histograms help give an estimate as to where values are concentrated, what the extremes are and whether there are any gaps or unusual values.

They are also useful for giving a rough view of the probability distribution.

Histogram is a common variation of charts used to present distribution and relationships of a single variable over a set of categories.

**Data Visualization - Density Plot**

A Density Plot visualizes the distribution of data over a continuous interval or time period.

This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise.

The peaks of a Density Plot help display where values are concentrated over the interval.

A Histogram comprising of only 4 bins wouldn't produce a distinguishable enough shape of distribution as a 20-bin Histogram would. However, with Density Plots, this isn't an issue.

· An advantage Density Plots have over Histograms is that they're better at determining the shape because they're not affected by the number of bins used (each bar used in a typical histogram).