How to Merge Data in R

16

This tutorial will discuss how to merge data in R data frames. DataFrame in R is a data structure that stores data in rows and columns with multiple data type elements. We can create a dataframe in R by using data.frame() function.

Also ReadMatrices in R | How to Create, Update, and Modify R Matrices

How to merge data in R

We can use data.frame() to create a dataframe.

Syntax:

data.frame(data)

Where data can be a vector with items.

Example 1: We will create a data frame with 4 rows and 3 columns in this example.

#create dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red

Also Read10 Best R Programming Courses On Udemy

Merging data

Merging is also known as joining. Based on columns, we can merge/join the data present in the data frame.

Syntax:

merge(x=dataframe_input1,y=dataframe_input2)

Where,

  1. dataframe_input1 is the first input dataframe. Also known as the left dataframe.
  2. dataframe_input2 is the second input dataframe. Also known as the right dataframe.

There are 4 types of merges. Let’s discuss them one by one in detail.

Also ReadHow To Learn Data Science [Beginner’s Guide]

1. Inner Join

This will join both the dataframes where matched column elements are present in both the dataframes, like the elements present in the columns. It will not join the unmatched columns.

Syntax:

merge(x=dataframe_input1,y=dataframe_input2)

where,

  1. dataframe_input1 is the first input dataframe. Also known as left dataframe
  2. dataframe_input2, is the second input dataframe. Also known as the right dataframe.

Example: We will create 2 dataframes with 4 rows and 3 columns each in this example.

first dataframe has columns – [plant_id,plant_name,plant_color] and second dataframe has columns – [plant_id,plant_name,plant_area] and perform inner join.

#create first dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

#create second dataframe with plant details
my_dataframe2=data.frame(plant_id=c(1,2,8,5), 
               plant_name=c("rose","lotus","marigold","datura"),
               plant_area=c("TS","AP","AP","Delhi"))

#display
print(my_dataframe2)

#perform inner join
print(merge(x=my_dataframe1,y=my_dataframe2))

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
  plant_id plant_name plant_area
1        1       rose         TS
2        2      lotus         AP
3        8   marigold         AP
4        5     datura      Delhi
  plant_id plant_name plant_color plant_area
1        1       rose        pink         TS
2        2      lotus       white         AP

In the above output, we have seen that only two rows match in both the dataframes – plant_id 1 and 2. hence they are merged.

Also Read15 Best Python Libraries for Data Science and Analysis

2. Outer Join

This will join both the dataframes where column elements are either matched/unmatched.

Suppose the columns are unmatched concerning the dataframe. It will replace the NA value in the other dataframe. In this way, it will return both the joined dataframes.

Syntax:

merge(x=dataframe_input1,y=dataframe_input2,all=TRUE)

where,

  1. dataframe_input1 is the first input dataframe. Also known as left dataframe
  2. dataframe_input2, is the second input dataframe. Also known as the right dataframe.
  3. All parameter is used to join all dataframes, and it is set to TRUE.

Example: We will create 2 dataframes with 4 rows and 3 columns each in this example.

first dataframe has columns – [plant_id,plant_name,plant_color] and second dataframe has columns – [plant_id,plant_name,plant_area] and perform outer join.

#create first dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

#create second dataframe with plant details
my_dataframe2=data.frame(plant_id=c(1,2,8,5), 
               plant_name=c("rose","lotus","marigold","datura"),
               plant_area=c("TS","AP","AP","Delhi"))

#display
print(my_dataframe2)

#perform outer join
print(merge(x=my_dataframe1,y=my_dataframe2,all=TRUE))

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
  plant_id plant_name plant_area
1        1       rose         TS
2        2      lotus         AP
3        8   marigold         AP
4        5     datura      Delhi
  plant_id plant_name plant_color plant_area
1        1       rose        pink         TS
2        2      lotus       white         AP
3        3    jasmine       white       <NA>
4        4       rose         red       <NA>
5        5     datura        <NA>      Delhi
6        8   marigold        <NA>         AP

In the above output, we have seen that all the rows from both the dataframes are merged.

Also Read10 Best Data Science Coursera Courses For Beginners

3. Left Join

It will join both the dataframe, where it will return all rows from the left/first dataframe and matched columns from the second(right) dataframe concerning the first dataframe.

If the values in the second dataframe are not matched concerning the first dataframe, then NA is filled in the second dataframe.

Syntax:

merge(x=dataframe_input1,y=dataframe_input2,all.x=TRUE)

where,

  1. dataframe_input1 is the first input dataframe. Also known as left dataframe
  2. dataframe_input2, is the second input dataframe. Also known as the right dataframe.
  3. all.x parameter represents the left dataframe, which is set to TRUE.

Example: We will create 2 dataframes with 4 rows and 3 columns each in this example.

first dataframe has columns – [plant_id,plant_name,plant_color] and second dataframe has columns – [plant_id,plant_name,plant_area] and perform left join.

#create first dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

#create second dataframe with plant details
my_dataframe2=data.frame(plant_id=c(1,2,8,5), 
               plant_name=c("rose","lotus","marigold","datura"),
               plant_area=c("TS","AP","AP","Delhi"))

#display
print(my_dataframe2)

#perform left join
print(merge(x=my_dataframe1,y=my_dataframe2,all.x=TRUE))

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
  plant_id plant_name plant_area
1        1       rose         TS
2        2      lotus         AP
3        8   marigold         AP
4        5     datura      Delhi
  plant_id plant_name plant_color plant_area
1        1       rose        pink         TS
2        2      lotus       white         AP
3        3    jasmine       white       <NA>
4        4       rose         red       <NA>

Also Read11 Best Free Android Apps To Learn Data Science

4. Right Join

It will join both the dataframe, where it will return all rows from the right/second dataframe and matched columns from the first(left) dataframe concerning the second dataframe.

If the values in the first dataframe are not matched for the second dataframe, then NA is filled in the first dataframe.

Syntax:

merge(x=dataframe_input1,y=dataframe_input2,all.y=TRUE)

where,

  1. dataframe_input1 is the first input dataframe. Also known as left dataframe
  2. dataframe_input2, is the second input dataframe. Also known as the right dataframe.
  3. all.y parameter represents the right dataframe and is set to TRUE.

Example: We will create 2 dataframes with 4 rows and 3 columns each in this example.

first dataframe has columns – [plant_id,plant_name,plant_color] and second dataframe has columns – [plant_id,plant_name,plant_area] and perform right join.

#create first dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

#create second dataframe with plant details
my_dataframe2=data.frame(plant_id=c(1,2,8,5), 
               plant_name=c("rose","lotus","marigold","datura"),
               plant_area=c("TS","AP","AP","Delhi"))

#display
print(my_dataframe2)

#perform right join
print(merge(x=my_dataframe1,y=my_dataframe2,all.y=TRUE))

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
  plant_id plant_name plant_area
1        1       rose         TS
2        2      lotus         AP
3        8   marigold         AP
4        5     datura      Delhi
  plant_id plant_name plant_color plant_area
1        1       rose        pink         TS
2        2      lotus       white         AP
3        5     datura        <NA>      Delhi
4        8   marigold        <NA>         AP

Also ReadWhat Is a Business Intelligence Analyst?

Merge row-wise

We can join rows by appending the second dataframe to the first dataframe using the rbind() function.

Syntax:

rbind(my_dataframe1,my_dataframe2)

where,

my_dataframe1 is the first dataframe and my_dataframe2 is the second dataframe.

Example:

In this example, we will join two dataframes using rbind().

#create first dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

#create second dataframe with plant details
my_dataframe2=data.frame(plant_id=c(5,6,7,8), 
               plant_name=c("rose","lotus","marigold","datura"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe2)

#join rows
print(rbind(my_dataframe1,my_dataframe2))

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
  plant_id plant_name plant_color
1        5       rose        pink
2        6      lotus       white
3        7   marigold       white
4        8     datura         red
  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
5        5       rose        pink
6        6      lotus       white
7        7   marigold       white
8        8     datura         red

Also Read10 Best Data Science Courses On Udemy [2022]

Merge columns-wise.

We can join columns by appending the second dataframe to the first dataframe using the cbind() function.

Syntax:

cbind(my_dataframe1,my_dataframe2)

where,

my_dataframe1 is the first dataframe and my_dataframe2 is the second dataframe.

Example:

In this example, we will join two dataframes using cbind().

#create first dataframe with plant details
my_dataframe1=data.frame(plant_id=c(1,2,3,4), 
               plant_name=c("rose","lotus","jasmine","rose"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe1)

#create second dataframe with plant details
my_dataframe2=data.frame(plant_id=c(5,6,7,8), 
               plant_name=c("rose","lotus","marigold","datura"),
               plant_color=c("pink","white","white","red"))

#display
print(my_dataframe2)

#join by columns
print(cbind(my_dataframe1,my_dataframe2))

Output:

  plant_id plant_name plant_color
1        1       rose        pink
2        2      lotus       white
3        3    jasmine       white
4        4       rose         red
  plant_id plant_name plant_color
1        5       rose        pink
2        6      lotus       white
3        7   marigold       white
4        8     datura         red
  plant_id plant_name plant_color plant_id plant_name plant_color
1        1       rose        pink        5       rose        pink
2        2      lotus       white        6      lotus       white
3        3    jasmine       white        7   marigold       white
4        4       rose         red        8     datura         red

Also Read10 Best Books On Data Science For Beginners [2022]

Summary

From this tutorial, we discussed how to merge dataframes by row and column. We have also seen different types of joins to merge two dataframes in R.