Introduction

Identifying meaningful locations, such as home and/or work locations from mobile technology is an essential step in the field of mobility analysis. The homelocator library is designed for identifying home locations of user based on the spatio-temporally features contained within the mobility data, such as social media data, mobile phone data, mobile app data and so on.

Although there is a myriad of different approaches to inferring meaningful locations in the last 10 years, a consensus or single best approach has not emerged in the current state-of-the-art. The actual algorithms used are not always discussed in detail in publications and the source codes are seldom released in public, which makes comparing algorithms as well as reproducing work difficult.

So, the main objective for developing this package is to provide a consistent framework and interface for the adoption of different approaches so that researchers are able to write structured, algorithmic ‘recipes’, or use the existing embedded “recipes” to identify meaningful locations according to their research requirements. With this package, users are able to achieve an apples-to-apples comparison across approaches.

We hope that through packages like homelocator, future work that relies on the inference of meaningful locations becomes less ‘custom’ (with each researcher writing their own algorithm) but instead will use common, comparable algorithms in order to increase transparency and reproducibility in the field.

Load Library

# Load homelocator library
library(homelocator)
#> Welcome to homelocator package!

Test Data

To explore the basic data manipulation functions of homelocator, we’ll use a test_sample dataset.

# load data
data("test_sample", package = "homelocator")

# show test sample 
test_sample %>% head(3)
#> # A tibble: 3 × 3
#>       u_id created_at          grid_id
#>      <int> <dttm>                <int>
#> 1 92298565 2016-04-17 22:43:06    1581
#> 2 33908340 2014-10-03 16:29:48    1461
#> 3 92298565 2014-02-07 07:26:15    1136

Note: if you use your own dataset, please make sure that you have converted the timestamps to the appropriate local time zone.

Usage

Functions

Validate input dataset

The validate_dataset() function makes sure your input dataset contains all three necessary variables: user, location and timestamp. There are 4 arguments in this function:

  • user: name of column that holds a unique identifier for each user.
  • timestamp: name of column that holds specific timestamp for each data point. This timestamp should be in POSIXct format.
  • location: name of column that holds a unique identifier for each location.
  • keep_other_vars: option to keep or remove any other variables of the input dataset. The default is FALSE.
df_validated <- validate_dataset(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", keep_other_vars = FALSE)
#> 🎉 Congratulations!! Your dataset has passed validation.
#> 👤 There are 100 unique users in your dataset.
#> 🌏 Now start your journey identifying their meaningful location(s)!
#> 👏 Good luck!
#> 

# show result 
df_validated %>% head(3)
#> # A tibble: 3 × 3
#>       u_id grid_id created_at         
#>      <int>   <int> <dttm>             
#> 1 92298565    1581 2016-04-17 22:43:06
#> 2 33908340    1461 2014-10-03 16:29:48
#> 3 92298565    1136 2014-02-07 07:26:15

Nest and Unnest

The nest_verbose() and unnest_verbose() functions work in the same way as nest and unnest functions in tidyverse but with some additional status information such as the elapsed running time.

# nest data
df_nested <- nest_verbose(df_validated, c("created_at", "grid_id"))
#> 🛠 Start nesting...
#> ✅ Finish nesting!
#> ⌛ Nesting time: 0.019 secs
#> 

# show result 
df_nested %>% head(3)
#> # A tibble: 3 × 2
#>       u_id data                
#>      <int> <list>              
#> 1 92298565 <tibble [1,291 × 2]>
#> 2 33908340 <tibble [1,170 × 2]>
#> 3 11616678 <tibble [938 × 2]>

# show result 
df_nested$data[[1]] %>% head(3)
#> # A tibble: 3 × 2
#>   created_at          grid_id
#>   <dttm>                <int>
#> 1 2016-04-17 22:43:06    1581
#> 2 2014-02-07 07:26:15    1136
#> 3 2012-08-18 19:26:31    1038

# unnest data
df_unnested <- unnest_verbose(df_nested, data)
#> 🛠 Start unnesting...
#> ✅ Finish unnesting!
#> ⌛ Unnesting time: 0.015 secs
#> 

# show result 
df_unnested %>% head(3)
#> # A tibble: 3 × 3
#>       u_id created_at          grid_id
#>      <int> <dttm>                <int>
#> 1 92298565 2016-04-17 22:43:06    1581
#> 2 92298565 2014-02-07 07:26:15    1136
#> 3 92298565 2012-08-18 19:26:31    1038

Double nest and Double unnest

The nest_double_nest() and unnest_double_nested() functions work in a similar way as nest_verbose() and unnest_verbose() functions but they apply to an already nested data frame. In other words, they map nest and unnest function to each element of a list created a double-nested data frame, or vice versa. This is an essential step in many home location algorithms as they often operate ‘per user, per location’.

  • df: a nested dataframe
  • ...: for nest_double_nest() refers to selected columns to nest and for unnest_double_nested() refers to list-column to unnest.
# double nest data (e.g., nesting column: created_at)
df_double_nested <- nest_nested(df_nested %>% head(100), c("created_at"))
#> Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
#>  The deprecated feature was likely used in the homelocator package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> 🛠 Start nesting...
#> 
#> ✅ Finish nesting!
#> ⌛ Nesting time: 0.79 secs
#> 

# show result 
df_double_nested %>% head(3)
#> # A tibble: 3 × 2
#>       u_id data              
#>      <int> <list>            
#> 1 92298565 <tibble [200 × 2]>
#> 2 33908340 <tibble [83 × 2]> 
#> 3 11616678 <tibble [61 × 2]>

# show result 
df_double_nested$data[[1]] %>% head(3)
#> # A tibble: 3 × 2
#>   grid_id data            
#>     <int> <list>          
#> 1    1581 <tibble [3 × 1]>
#> 2    1136 <tibble [7 × 1]>
#> 3    1038 <tibble [8 × 1]>

df_double_nested$data[[1]]$data[[1]] %>% head(3)
#> # A tibble: 3 × 1
#>   created_at         
#>   <dttm>             
#> 1 2016-04-17 22:43:06
#> 2 2014-12-28 03:26:34
#> 3 2014-12-03 09:26:36

# unnest nested data 
df_double_unnested <- df_double_nested %>% 
  head(100) %>%  ## take first 100 rows for example
  unnest_double_nested(., data) 
#> 🛠 Start unnesting...
#> ✅ Finish unnesting!
#> ⌛ Unnesting time: 0.702 secs
#> 

# show result 
df_double_unnested %>% head(3)
#> # A tibble: 3 × 3
#>       u_id grid_id created_at         
#>      <int>   <int> <dttm>             
#> 1 92298565    1581 2016-04-17 22:43:06
#> 2 92298565    1581 2014-12-28 03:26:34
#> 3 92298565    1581 2014-12-03 09:26:36

Enrich timestamp

The enrich_timestamp() function creates additional variables that are derived from the timestamp column. These include the year, month, day, day of the week and hour of the day. These are often used/needed as intermediate variables in home location algorithms.

  • df: a nested dataframe
  • timestamp: name of column that holds the specific timestamp for each data point. This timestamp should be in POSIXct format.
#original variables 
df_nested[1, ] %>% unnest(cols = c(data)) %>% head(3)
#> # A tibble: 3 × 3
#>       u_id created_at          grid_id
#>      <int> <dttm>                <int>
#> 1 92298565 2016-04-17 22:43:06    1581
#> 2 92298565 2014-02-07 07:26:15    1136
#> 3 92298565 2012-08-18 19:26:31    1038

# create new variables from "created_at" timestamp column 
df_enriched <- enrich_timestamp(df_nested, timestamp = "created_at")
#> 🛠 Enriching variables from timestamp...
#> 
#> ✅ Finish enriching! New added variables: year, month, day, wday, hour, ymd.
#> ⌛ Enriching time: 0.262 secs
#> 

# show result 
df_enriched[1, ] %>% unnest(cols = c(data)) %>% head(3)
#> # A tibble: 3 × 9
#>       u_id created_at          grid_id  year month   day  wday  hour ymd       
#>      <int> <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#> 1 92298565 2016-04-17 22:43:06    1581  2016     4    17     1    22 2016-04-17
#> 2 92298565 2014-02-07 07:26:15    1136  2014     2     7     6     7 2014-02-07
#> 3 92298565 2012-08-18 19:26:31    1038  2012     8    18     7    19 2012-08-18

Summarize in nested and double nested dataframe

The summarise_nested() function works similar to dplyr’s regular summarise function but operates within a nested tibble. summarise_double_nested() conversely operates within a double-nested tibble.

  • df: a nested dataframe
  • nest_cols: a selection of columns to nest in existing list-column
  • ...: name-value pairs of summary functions
# summarize in nested dataframe 
# e.g., summarise total number of tweets and total number of places per user
df_summarize_nested <- summarise_nested(df_enriched, 
                                        n_tweets = n(),
                                        n_locs = n_distinct(grid_id))
#> 🛠 Start summarising values...
#> 
#> ✅ Finish summarising! There are 2 new added variables: n_tweets, n_locs
#> ⌛ Summarising time: 0.101 secs
#> 

# show result 
df_summarize_nested %>% head(3)
#> # A tibble: 3 × 4
#>       u_id data                 n_tweets n_locs
#>      <int> <list>                  <int>  <int>
#> 1 92298565 <tibble [1,291 × 8]>     1291    200
#> 2 33908340 <tibble [1,170 × 8]>     1170     83
#> 3 11616678 <tibble [938 × 8]>        938     61

# summarize in double nested dataframe 
# take first 100 users for example
# e.g summarise total number of tweets and totla number of distinct days 
df_summarize_double_nested <- summarise_double_nested(df_enriched %>% head(100), 
                    nest_cols = c("created_at", "ymd", "year", "month", "day", "wday", "hour"), 
                    n_tweets = n(), n_days = n_distinct(ymd))
#> 🛠 Start summarising values...
#> ✅ Finish summarising! There are 2 new added variables: n_tweets, n_days
#> ⌛ Summarising time: 3.919 secs
#> 

# show result 
df_summarize_double_nested[1, ]
#> # A tibble: 1 × 2
#>       u_id data              
#>      <int> <list>            
#> 1 92298565 <tibble [200 × 4]>
df_summarize_double_nested[1, ]$data[[1]] %>% head(3)
#> # A tibble: 3 × 4
#>   grid_id data             n_tweets n_days
#>     <int> <list>              <int>  <int>
#> 1    1581 <tibble [3 × 7]>        3      3
#> 2    1136 <tibble [7 × 7]>        7      6
#> 3    1038 <tibble [8 × 7]>        8      3

Remove active users

The remove_top_users() function allows to remove top N percent of active users based on the total number of data points per user. Although the majority of users are real people, some accounts are run by algorithms or ‘bots’, whereas others can be considered as spam accounts. Removing a certain top N percent of active users is an oft-used approach to remove such accounts and reduce the number of such users in the final dataset.

  • df: a dataframe with columns of user id, and data point counts
  • user: name of column that holds unique identifier for each user
  • counts: name of column that holds the data points frequency for each user
  • topNpct_user: a decimal number that represent the certain percentage of users to remove
# remove top 1% active users (e.g based on the frequency of tweets sent by users)
df_removed_active_users <- remove_top_users(df_summarize_nested, user = "u_id", 
                                            counts = "n_tweets", topNpct_user = 1) 
#> 👤 There are 100 users at this moment.
#> 👤 Skip removing active users

# show result 
df_removed_active_users %>% head(3)
#> # A tibble: 3 × 4
#>       u_id data                 n_tweets n_locs
#>      <int> <list>                  <int>  <int>
#> 1 40153763 <tibble [1,688 × 8]>     1688    233
#> 2 92298565 <tibble [1,291 × 8]>     1291    200
#> 3 31088399 <tibble [1,210 × 8]>     1210     79

Filter

The filter_verbose() function works in the same way as filter function in tidyverse but with additional information about the number of (filtered) users remaining in the dataset. And the filter_nested() function works in the similar way as filter_verbose() but is applied within a nested tibble.

  • df: a dataframe with columns of user id, and variables that your are going to apply the filter function. If the column not in the dataset, you need to add that column before you apply the filter
  • user: name of column that holds unique identifier for each user
  • ...: Logical predicates defined in terms of the variables in df. Only rows match conditions are kept.
# filter users with less than 10 tweets sent at less than 10 places 
df_filtered <- filter_verbose(df_removed_active_users, user = "u_id", 
                              n_tweets > 10 & n_locs > 10)
#> 👤 There are 100 users at this moment.
#> 🛠 Start filtering users...
#> ✅ Finish filtering! Filterred 36 users!
#> 👤 There are 64 users left.
#> ⌛ Filtering time: 0.002 secs
#> 

# show result 
df_filtered %>% head(3)
#> # A tibble: 3 × 4
#>       u_id data                 n_tweets n_locs
#>      <int> <list>                  <int>  <int>
#> 1 40153763 <tibble [1,688 × 8]>     1688    233
#> 2 92298565 <tibble [1,291 × 8]>     1291    200
#> 3 31088399 <tibble [1,210 × 8]>     1210     79

# filter tweets that sent on weekends and during daytime (8am to 6pm)
df_filter_nested <- filter_nested(df_filtered, user = "u_id", 
                                  !wday %in% c(1, 7), # 1 means Sunday and 7 means Saturday
                                  !hour %in% seq(8, 18, 1)) 
#> 👤 There are 64 users at this moment.
#> 🛠 Start filtering user...
#> 
#> ✅ Finish Filtering! Filterred 0 users!
#> 👤 There are 64 users left!
#> ⌛ Filtering time: 0.072 secs
#> 

# show result 
df_filter_nested %>% head(3)
#> # A tibble: 3 × 4
#>       u_id data               n_tweets n_locs
#>      <int> <list>                <int>  <int>
#> 1 40153763 <tibble [495 × 8]>     1688    233
#> 2 92298565 <tibble [424 × 8]>     1291    200
#> 3 31088399 <tibble [463 × 8]>     1210     79

df_filter_nested$data[[1]] %>% head(3)
#> # A tibble: 3 × 8
#>   created_at          grid_id  year month   day  wday  hour ymd       
#>   <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#> 1 2016-06-23 00:36:25     759  2016     6    23     5     0 2016-06-23
#> 2 2015-02-17 19:14:10     778  2015     2    17     3    19 2015-02-17
#> 3 2015-03-31 23:21:51     759  2015     3    31     3    23 2015-03-31

Mutate

The mutate_verbose() function works in the same way as mutate function in tidyverse but with additional information about the elapsed running time.

  • df: a dataframe
  • ...: name-value pairs of expressions

Function mutate_nested() works in the similar way as mutate_verbose() but it adds new variables inside a nested tibble.

  • df: a nested dataframe
  • ...: name-value pairs of expressions
## let's use pre-discussed functions to filter some users first 
colnmaes_data <- df_filtered$data[[1]] %>% names()
colnmaes_to_nest <- colnmaes_data[-which(colnmaes_data == "grid_id")] 

df_cleaned <- df_filtered %>% 
  summarise_double_nested(., nest_cols = colnmaes_to_nest,
                          n_tweets_loc = n(), # number of tweets sent at each location
                          n_hrs_loc = n_distinct(hour), # number of unique hours of sent tweets 
                          n_days_loc = n_distinct(ymd), # number of unique days of sent tweets 
                          period_loc = as.numeric(max(created_at) - min(created_at), "days")) %>% # period of tweeting 
  unnest_verbose(data) %>% 
  filter_verbose(., user = "u_id",
                 n_tweets_loc > 10 & n_hrs_loc > 10 & n_days_loc > 10 & period_loc > 10) 
#> 🛠 Start summarising values...
#> ✅ Finish summarising! There are 4 new added variables: n_tweets_loc, n_hrs_loc, n_days_loc, period_loc
#> ⌛ Summarising time: 4.545 secs
#> 
#> 🛠 Start unnesting...
#> ✅ Finish unnesting!
#> ⌛ Unnesting time: 0.009 secs
#> 
#> 👤 There are 64 users at this moment.
#> 🛠 Start filtering users...
#> ✅ Finish filtering! Filterred 35 users!
#> 👤 There are 29 users left.
#> ⌛ Filtering time: 0.002 secs
#> 
  
# show cleaned dataset 
df_cleaned %>% head(3)
#> # A tibble: 3 × 9
#>       u_id n_tweets n_locs grid_id data     n_tweets_loc n_hrs_loc n_days_loc
#>      <int>    <int>  <int>   <int> <list>          <int>     <int>      <int>
#> 1 40153763     1688    233     759 <tibble>          467        23        366
#> 2 40153763     1688    233     842 <tibble>           57        16         51
#> 3 40153763     1688    233    1212 <tibble>           71        13         59
#> # ℹ 1 more variable: period_loc <dbl>

# ok, then let's apply the mutate_nested function to add four new variables: wd_or_wk, time_numeric, rest_or_work, wk.am_or_wk.pm
df_expanded <- df_cleaned %>% 
  mutate_nested(wd_or_wk = if_else(wday %in% c(1,7), "weekend", "weekday"),
                time_numeric = lubridate::hour(created_at) + lubridate::minute(created_at) / 60 + lubridate::second(created_at) / 3600, 
                rest_or_work = if_else(time_numeric >= 9 & time_numeric <= 18, "work", "rest"), 
                wk.am_or_wk.pm = if_else(time_numeric >= 6 & time_numeric <= 12 & wd_or_wk == "weekend", "weekend_am", "weekend_pm")) 
#> 🛠 Start adding variable(s)...
#> ✅ Finish adding! There are 4 new added variables: wd_or_wk, time_numeric, rest_or_work, wk.am_or_wk.pm
#> ⌛ Adding time: 0.255 secs
#> 

# show result 
df_expanded %>% head(3)
#> # A tibble: 3 × 9
#>       u_id n_tweets n_locs grid_id data     n_tweets_loc n_hrs_loc n_days_loc
#>      <int>    <int>  <int>   <int> <list>          <int>     <int>      <int>
#> 1 40153763     1688    233     759 <tibble>          467        23        366
#> 2 40153763     1688    233     842 <tibble>           57        16         51
#> 3 40153763     1688    233    1212 <tibble>           71        13         59
#> # ℹ 1 more variable: period_loc <dbl>

The function prop_factor_nested() allows you to calculate the proportion of categories for a variable inside the list-column and convert each categories to a new variable adding to the dataframe. For example, inside the list-column, the variable wd_or_wk has two categories named weekend or weekday, when you call prop_factor_nested() function, it calculates the proportion of weekend and weekday separately, and the results are then expanded to two new columns called weekday and weekend adding to the dataframe.

  • df: a nested dataframe
  • var: name of column to calculate inside the list-column
# categories for a variable inside the list-column: e.g weekend or weekday
df_expanded$data[[1]] %>% head(3)
#> # A tibble: 3 × 11
#>   created_at           year month   day  wday  hour ymd    wd_or_wk time_numeric
#>   <dttm>              <dbl> <dbl> <int> <dbl> <int> <chr>  <chr>           <dbl>
#> 1 2016-06-23 00:36:25  2016     6    23     5     0 2016-… weekday         0.607
#> 2 2014-09-03 15:46:45  2014     9     3     4    15 2014-… weekday        15.8  
#> 3 2015-03-31 23:21:51  2015     3    31     3    23 2015-… weekday        23.4  
#> # ℹ 2 more variables: rest_or_work <chr>, wk.am_or_wk.pm <chr>

# calculate proportion of categories for four variables: wd_or_wk, rest_or_work, wk.am_or_wk.pm
df_expanded <- df_expanded %>% 
  prop_factor_nested(wd_or_wk, rest_or_work, wk.am_or_wk.pm) 
#> 🛠 Start calculating proportion...
#> 
#> ✅ Finish calculating! There are 6 new calculated variables: rest, weekday, weekend, weekend_am, weekend_pm, work
#> ⌛ Calculating time: 1.839 secs
#> 

# show result 
df_expanded %>% head(3)
#> # A tibble: 3 × 15
#>       u_id n_tweets n_locs grid_id data     n_tweets_loc n_hrs_loc n_days_loc
#>      <int>    <int>  <int>   <int> <list>          <int>     <int>      <int>
#> 1 40153763     1688    233     759 <tibble>          467        23        366
#> 2 40153763     1688    233     842 <tibble>           57        16         51
#> 3 40153763     1688    233    1212 <tibble>           71        13         59
#> # ℹ 7 more variables: period_loc <dbl>, rest <dbl>, weekday <dbl>,
#> #   weekend <dbl>, weekend_am <dbl>, weekend_pm <dbl>, work <dbl>

Score

The score_nested() function allows you to give a weighted value for one or more variables in a nested dataframe.

  • df: a nested dataframe by user
  • user: name of column that holds unique identifier for each user
  • location: name of column that holds unique identifier for each location
  • keep_ori_vars: option to keep or drop original varialbes
  • ...: name-value pairs of expression

The score_summary() function summarises all scored columns and return one single summary score per row.

  • df: a dataframe
  • user: name of column that holds unique identifier for each user
  • location: name of column that holds unique identifier for each location
  • ...: name of scored columns
## let's add two more variables before we do the scoring 
df_expanded <- df_expanded %>% 
  summarise_nested(n_wdays_loc = n_distinct(wday),
                   n_months_loc = n_distinct(month))
#> 🛠 Start summarising values...
#> 
#> ✅ Finish summarising! There are 2 new added variables: n_wdays_loc, n_months_loc
#> ⌛ Summarising time: 0.113 secs
#> 

df_expanded %>% head(3)
#> # A tibble: 3 × 17
#>       u_id n_tweets n_locs grid_id data     n_tweets_loc n_hrs_loc n_days_loc
#>      <int>    <int>  <int>   <int> <list>          <int>     <int>      <int>
#> 1 40153763     1688    233     759 <tibble>          467        23        366
#> 2 40153763     1688    233     842 <tibble>           57        16         51
#> 3 40153763     1688    233    1212 <tibble>           71        13         59
#> # ℹ 9 more variables: period_loc <dbl>, rest <dbl>, weekday <dbl>,
#> #   weekend <dbl>, weekend_am <dbl>, weekend_pm <dbl>, work <dbl>,
#> #   n_wdays_loc <int>, n_months_loc <int>

# when calculating scores, you can give weight to different variables, but the total weight should add up to 1
df_scored <- df_expanded %>% 
  score_nested(., user = "u_id", location = "grid_id", keep_original_vars = F,
               s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)),
               s_n_hrs_loc = 0.1 * (n_hrs_loc/24), 
               s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)),
               s_period_loc = 0.1 * (period_loc/max(period_loc)),
               s_n_wdays_loc = 0.1 * (n_wdays_loc/7),
               s_n_months_loc = 0.1 * (n_months_loc/12),
               s_weekend = 0.1 * (weekend),
               s_rest = 0.2 * (rest),
               s_wk_am = 0.1 * (weekend_am))
#> 🛠 Start scoring ...
#> ✅ Finish scoring! There are 9 new added variables: s_n_tweets_loc, s_n_hrs_loc, s_n_days_loc, s_period_loc, s_n_wdays_loc, s_n_months_loc, s_weekend, s_rest, s_wk_am
#> ⌛ Scoring time: 0.199 secs
#> 

df_scored %>% head(3)
#> # A tibble: 3 × 2
#>       u_id data              
#>      <int> <list>            
#> 1 40153763 <tibble [12 × 10]>
#> 2 92298565 <tibble [11 × 10]>
#> 3 31088399 <tibble [4 × 10]>
df_scored$data[[1]]
#> # A tibble: 12 × 10
#>    grid_id s_n_tweets_loc s_n_hrs_loc s_n_days_loc s_period_loc s_n_wdays_loc
#>      <int>          <dbl>       <dbl>        <dbl>        <dbl>         <dbl>
#>  1     759        0.1          0.0958      0.1           0.0932        0.1   
#>  2     842        0.0122       0.0667      0.0139        0.1           0.1   
#>  3    1212        0.0152       0.0542      0.0161        0.0225        0.1   
#>  4     593        0.0133       0.0583      0.0142        0.0863        0.1   
#>  5    1125        0.00600      0.0542      0.00656       0.0778        0.1   
#>  6     797        0.00835      0.0667      0.0104        0.0851        0.1   
#>  7     843        0.0274       0.0583      0.0301        0.0692        0.1   
#>  8     862        0.0278       0.0583      0.0295        0.0722        0.1   
#>  9    1709        0.00343      0.0542      0.00437       0.0870        0.1   
#> 10    1174        0.00835      0.0542      0.0101        0.0752        0.1   
#> 11     614        0.00321      0.0542      0.00410       0.0986        0.0714
#> 12    1038        0.00428      0.0458      0.00519       0.0784        0.1   
#> # ℹ 4 more variables: s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>,
#> #   s_wk_am <dbl>

### we can replace the score function by mutate function 
df_scored_2 <- df_expanded %>% 
  nest_verbose(-u_id) %>% 
  mutate_nested(s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)),
                s_n_hrs_loc = 0.1 * (n_hrs_loc/24), 
                s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)),
                s_period_loc = 0.1 * (period_loc/max(period_loc)),
                s_n_wdays_loc = 0.1 * (n_wdays_loc/7),
                s_n_months_loc = 0.1 * (n_months_loc/12),
                s_weekend = 0.1 * (weekend),
                s_rest = 0.2 * (rest),
                s_wk_am = 0.1 * (weekend_am))
#> 🛠 Start nesting...
#> ✅ Finish nesting!
#> ⌛ Nesting time: 0.009 secs
#> 
#> 🛠 Start adding variable(s)...
#> 
#> ✅ Finish adding! There are 9 new added variables: s_n_tweets_loc, s_n_hrs_loc, s_n_days_loc, s_period_loc, s_n_wdays_loc, s_n_months_loc, s_weekend, s_rest, s_wk_am
#> ⌛ Adding time: 0.069 secs
#> 

df_scored_2 %>% head(3)
#> # A tibble: 3 × 2
#>       u_id data              
#>      <int> <list>            
#> 1 40153763 <tibble [12 × 25]>
#> 2 92298565 <tibble [11 × 25]>
#> 3 31088399 <tibble [4 × 25]>
df_scored_2$data[[1]]
#> # A tibble: 12 × 25
#>    n_tweets n_locs grid_id data     n_tweets_loc n_hrs_loc n_days_loc period_loc
#>       <int>  <int>   <int> <list>          <int>     <int>      <int>      <dbl>
#>  1     1688    233     759 <tibble>          467        23        366      1211.
#>  2     1688    233     842 <tibble>           57        16         51      1300.
#>  3     1688    233    1212 <tibble>           71        13         59       292.
#>  4     1688    233     593 <tibble>           62        14         52      1121.
#>  5     1688    233    1125 <tibble>           28        13         24      1011.
#>  6     1688    233     797 <tibble>           39        16         38      1106.
#>  7     1688    233     843 <tibble>          128        14        110       900.
#>  8     1688    233     862 <tibble>          130        14        108       938.
#>  9     1688    233    1709 <tibble>           16        13         16      1130.
#> 10     1688    233    1174 <tibble>           39        13         37       977.
#> 11     1688    233     614 <tibble>           15        13         15      1281.
#> 12     1688    233    1038 <tibble>           20        11         19      1018.
#> # ℹ 17 more variables: rest <dbl>, weekday <dbl>, weekend <dbl>,
#> #   weekend_am <dbl>, weekend_pm <dbl>, work <dbl>, n_wdays_loc <int>,
#> #   n_months_loc <int>, s_n_tweets_loc <dbl>, s_n_hrs_loc <dbl>,
#> #   s_n_days_loc <dbl>, s_period_loc <dbl>, s_n_wdays_loc <dbl>,
#> #   s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>, s_wk_am <dbl>
# sum varialbes score for each location 
df_score_summed <- df_scored %>% 
  score_summary(., user = "u_id", location = "grid_id", starts_with("s_"))
#> 🛠 Start summing scores...
#> 
#> ✅ Finish summing! There are 1 new added variables: score
#> ⌛ Summing time: 0.088 secs
#> 

df_score_summed %>% head(3)
#> # A tibble: 3 × 2
#>       u_id data              
#>      <int> <list>            
#> 1 40153763 <tibble [12 × 11]>
#> 2 92298565 <tibble [11 × 11]>
#> 3 31088399 <tibble [4 × 11]>
df_score_summed$data[[1]]
#> # A tibble: 12 × 11
#>    grid_id s_n_tweets_loc s_n_hrs_loc s_n_days_loc s_period_loc s_n_wdays_loc
#>      <int>          <dbl>       <dbl>        <dbl>        <dbl>         <dbl>
#>  1     759        0.1          0.0958      0.1           0.0932        0.1   
#>  2     842        0.0122       0.0667      0.0139        0.1           0.1   
#>  3    1212        0.0152       0.0542      0.0161        0.0225        0.1   
#>  4     593        0.0133       0.0583      0.0142        0.0863        0.1   
#>  5    1125        0.00600      0.0542      0.00656       0.0778        0.1   
#>  6     797        0.00835      0.0667      0.0104        0.0851        0.1   
#>  7     843        0.0274       0.0583      0.0301        0.0692        0.1   
#>  8     862        0.0278       0.0583      0.0295        0.0722        0.1   
#>  9    1709        0.00343      0.0542      0.00437       0.0870        0.1   
#> 10    1174        0.00835      0.0542      0.0101        0.0752        0.1   
#> 11     614        0.00321      0.0542      0.00410       0.0986        0.0714
#> 12    1038        0.00428      0.0458      0.00519       0.0784        0.1   
#> # ℹ 5 more variables: s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>,
#> #   s_wk_am <dbl>, score <dbl>

Extract locations

The extract_location() function allows you to sort the locations of each each user based on descending value of selected column, and return the top N location(s) for each user. The extracted location(s) then represent the most likely home location of the user.

  • df: a nested dataframe by user
  • user: name of column that holds unique identifier for each user
  • location: name of column that holds unique identifier for each location
  • show_n_loc: a single number that decides the number of locations to be extracted
  • keep_score: option to keep or remove columns with scored value
  • ...: name of column(s) that used to sort the locations
# extract homes for users based on score value (each user return 1 most possible home)
df_home <- df_score_summed %>% 
  extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = F, desc(score))
#> 🛠 Start extracting homes for users...
#> Warning: There was 1 warning in `dplyr::mutate()`.
#>  In argument: `home = purrr::map_chr(df[[colname_nested_data]],
#>   ~get_loc_with_progress(.))`.
#> Caused by warning:
#> ! Automatic coercion from integer to character was deprecated in purrr 1.0.0.
#>  Please use an explicit call to `as.character()` within `map_chr()` instead.
#>  The deprecated feature was likely used in the homelocator package.
#>   Please report the issue to the authors.
#> 
#> 🎉 Congratulations!! Your have found 29 users' potential home(s).
#> ⌛ Extracting time: 0.194 secs
#> 

df_home %>% head(3)
#> # A tibble: 3 × 2
#>       u_id home 
#>      <int> <chr>
#> 1 40153763 759  
#> 2 92298565 1582 
#> 3 31088399 1113

# extract homes for users and keep scores of locations
df_home <- df_score_summed %>% 
  extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = T, desc(score))
#> 🛠 Start extracting homes for users...
#> 🎉 Congratulations!! Your have found 29 users' potential home(s).
#> ⌛ Extracting time: 0.105 secs
#> 

df_home %>% head(3)
#> # A tibble: 3 × 3
#>       u_id data               home 
#>      <int> <list>             <chr>
#> 1 40153763 <tibble [12 × 11]> 759  
#> 2 92298565 <tibble [11 × 11]> 1582 
#> 3 31088399 <tibble [4 × 11]>  1113
df_home$data[[3]]
#> # A tibble: 4 × 11
#>   grid_id s_n_tweets_loc s_n_hrs_loc s_n_days_loc s_period_loc s_n_wdays_loc
#>     <int>          <dbl>       <dbl>        <dbl>        <dbl>         <dbl>
#> 1     880        0.0774       0.1         0.0577        0.0796        0.1   
#> 2    1113        0.1          0.0917      0.1           0.1           0.1   
#> 3     862        0.0519       0.1         0.0477        0.0716        0.1   
#> 4    1594        0.00501      0.05        0.00846       0.0621        0.0714
#> # ℹ 5 more variables: s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>,
#> #   s_wk_am <dbl>, score <dbl>

Spread

The spread_nested() function works in the same way as spread function in tidyverse but works inside a nested data frame. It allows to spread a key-value pairs across multiple columns.

  • df: a nested dataframe
  • var_key: column name or position
  • var_value: column name or position
# let's add one timeframe variable first and calculate the number of data points at each timeframe
df_timeframe <- df_enriched %>% 
  mutate_nested(timeframe = if_else(hour >= 2 & hour < 8, "Rest", if_else(hour >= 8 & hour < 19, "Active", "Leisure")))
#> 🛠 Start adding variable(s)...
#> 
#> ✅ Finish adding! There are 1 new added variables: timeframe
#> ⌛ Adding time: 0.124 secs
#> 

colnames_nested_data <- df_timeframe$data[[1]] %>% names()
colnames_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id", "timeframe"))]

df_timeframe <- df_timeframe %>%
  head(20) %>% # take first 20 users as example
  summarise_double_nested(., nest_cols = colnames_to_nest, 
                            n_points_timeframe = n()) 
#> 🛠 Start summarising values...
#> ✅ Finish summarising! There are 1 new added variables: n_points_timeframe
#> ⌛ Summarising time: 1.961 secs
#> 
df_timeframe$data[[2]]
#> # A tibble: 102 × 4
#>    grid_id timeframe data               n_points_timeframe
#>      <int> <chr>     <list>                          <int>
#>  1    1461 Active    <tibble [280 × 7]>                280
#>  2    1461 Leisure   <tibble [173 × 7]>                173
#>  3    1461 Rest      <tibble [418 × 7]>                418
#>  4    1239 Active    <tibble [1 × 7]>                    1
#>  5    1460 Leisure   <tibble [53 × 7]>                  53
#>  6    1449 Leisure   <tibble [21 × 7]>                  21
#>  7    1629 Leisure   <tibble [1 × 7]>                    1
#>  8    1260 Active    <tibble [2 × 7]>                    2
#>  9    1399 Rest      <tibble [2 × 7]>                    2
#> 10    1345 Active    <tibble [1 × 7]>                    1
#> # ℹ 92 more rows

# spread timeframe in nested dataframe with key is timeframe and value is n_points_timeframe
df_timeframe_spreaded <- df_timeframe %>% 
    spread_nested(., key_var = "timeframe", value_var = "n_points_timeframe") 
#> 🛠 Start spreading timeframe variable...
#> ✅ Finish spreading! There are 3 new added variables: Active, Leisure, Rest
#> ⌛ Spreading time: 0.278 secs
#> 

df_timeframe_spreaded$data[[2]]
#> # A tibble: 102 × 5
#>    grid_id data             Active Leisure  Rest
#>      <int> <list>            <int>   <int> <int>
#>  1     354 <tibble [1 × 7]>      0       1     0
#>  2     355 <tibble [1 × 7]>      1       0     0
#>  3     374 <tibble [1 × 7]>      0       1     0
#>  4     597 <tibble [1 × 7]>      0       1     0
#>  5     611 <tibble [1 × 7]>      1       0     0
#>  6     612 <tibble [1 × 7]>      1       0     0
#>  7     613 <tibble [1 × 7]>      1       0     0
#>  8     614 <tibble [1 × 7]>      1       0     0
#>  9     645 <tibble [3 × 7]>      3       0     0
#> 10     645 <tibble [1 × 7]>      0       0     1
#> # ℹ 92 more rows

Arrange

The arrange_nested() function works in the same way as arrange function in tidyverse but works inside a nested data frame. It allows to sort rows by variables.

  • df: a nested dataframe
  • ...: comma separated list of unquoted variable names
df_enriched$data[[3]]  
#> # A tibble: 938 × 8
#>    created_at          grid_id  year month   day  wday  hour ymd       
#>    <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#>  1 2014-07-18 10:08:21    1375  2014     7    18     6    10 2014-07-18
#>  2 2013-11-24 23:16:24    1375  2013    11    24     1    23 2013-11-24
#>  3 2013-06-12 00:59:42    1375  2013     6    12     4     0 2013-06-12
#>  4 2013-06-07 19:56:56    1375  2013     6     7     6    19 2013-06-07
#>  5 2013-06-12 01:08:29    1375  2013     6    12     4     1 2013-06-12
#>  6 2013-06-02 20:53:09    1375  2013     6     2     1    20 2013-06-02
#>  7 2013-06-10 20:57:17    1375  2013     6    10     2    20 2013-06-10
#>  8 2013-06-04 17:42:06    1375  2013     6     4     3    17 2013-06-04
#>  9 2014-07-10 01:00:45    1375  2014     7    10     5     1 2014-07-10
#> 10 2013-06-13 01:48:27    1375  2013     6    13     5     1 2013-06-13
#> # ℹ 928 more rows

df_arranged <- df_enriched %>% 
  arrange_nested(desc(hour)) # arrange the hour in descending order
#> 🛠 Start sorting...
#> 
#> ✅ Finish sorting!
#> ⌛ Sorting time: 0.173 secs
#> 

df_arranged$data[[3]]
#> # A tibble: 938 × 8
#>    created_at          grid_id  year month   day  wday  hour ymd       
#>    <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#>  1 2013-11-24 23:16:24    1375  2013    11    24     1    23 2013-11-24
#>  2 2013-06-08 23:41:19    1375  2013     6     8     7    23 2013-06-08
#>  3 2013-06-02 23:21:04    1375  2013     6     2     1    23 2013-06-02
#>  4 2013-06-05 23:31:59    1375  2013     6     5     4    23 2013-06-05
#>  5 2013-06-02 23:29:43    1375  2013     6     2     1    23 2013-06-02
#>  6 2013-06-06 23:51:50    1375  2013     6     6     5    23 2013-06-06
#>  7 2014-04-27 23:40:59    1375  2014     4    27     1    23 2014-04-27
#>  8 2013-06-05 23:47:27    1375  2013     6     5     4    23 2013-06-05
#>  9 2013-06-02 23:02:04    1375  2013     6     2     1    23 2013-06-02
#> 10 2013-06-06 23:07:49    1375  2013     6     6     5    23 2013-06-06
#> # ℹ 928 more rows

The arrange_double_nested() function works in the similar way as arrange_nested() function, but it applies to a double nested data frame. You need to choose the columns to nest inside a nested dataframe and then sort rows by selected column.

- `df`: a nested dataframe 
- `nest_cols`: name of columns to nest in existing list-column
- `...`: comma separated list of unquoted variable names 
# get the name of columns to nest 
colnames_nested_data <- df_enriched$data[[1]] %>% names()
colnmaes_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id"))]

df_double_arranged <- df_enriched %>% 
  head(20) %>% # take the first 20 users for example
  arrange_double_nested(., nest_cols = colnmaes_to_nest, 
                        desc(created_at)) # sort by time in descending order
#> 🛠 Start sorting...
#> ✅ Finish sorting!
#> ⌛ Sorting time: 2.544 secs
#> 


# original third user
df_enriched[3, ]
#> # A tibble: 1 × 2
#>       u_id data              
#>      <int> <list>            
#> 1 11616678 <tibble [938 × 8]>
# third user data points 
df_enriched$data[[3]]
#> # A tibble: 938 × 8
#>    created_at          grid_id  year month   day  wday  hour ymd       
#>    <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#>  1 2014-07-18 10:08:21    1375  2014     7    18     6    10 2014-07-18
#>  2 2013-11-24 23:16:24    1375  2013    11    24     1    23 2013-11-24
#>  3 2013-06-12 00:59:42    1375  2013     6    12     4     0 2013-06-12
#>  4 2013-06-07 19:56:56    1375  2013     6     7     6    19 2013-06-07
#>  5 2013-06-12 01:08:29    1375  2013     6    12     4     1 2013-06-12
#>  6 2013-06-02 20:53:09    1375  2013     6     2     1    20 2013-06-02
#>  7 2013-06-10 20:57:17    1375  2013     6    10     2    20 2013-06-10
#>  8 2013-06-04 17:42:06    1375  2013     6     4     3    17 2013-06-04
#>  9 2014-07-10 01:00:45    1375  2014     7    10     5     1 2014-07-10
#> 10 2013-06-13 01:48:27    1375  2013     6    13     5     1 2013-06-13
#> # ℹ 928 more rows
# arranged third user
df_double_arranged[3, ]
#> # A tibble: 1 × 2
#>       u_id data             
#>      <int> <list>           
#> 1 11616678 <tibble [61 × 2]>
# arranged time 
df_double_arranged$data[[3]]$data[[2]]
#> # A tibble: 3 × 7
#>   created_at           year month   day  wday  hour ymd       
#>   <dttm>              <dbl> <dbl> <int> <dbl> <int> <chr>     
#> 1 2013-06-07 09:44:12  2013     6     7     6     9 2013-06-07
#> 2 2013-06-04 09:42:57  2013     6     4     3     9 2013-06-04
#> 3 2013-06-02 13:40:22  2013     6     2     1    13 2013-06-02

Top N

The top_n_nested() function works in the same way as top_n function in tidyverse but for nested dataframe. It allows to select the top entries in each group, ordered by wt.

df_enriched$data[[2]]
#> # A tibble: 1,170 × 8
#>    created_at          grid_id  year month   day  wday  hour ymd       
#>    <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#>  1 2014-10-03 16:29:48    1461  2014    10     3     6    16 2014-10-03
#>  2 2014-01-29 23:29:27    1461  2014     1    29     4    23 2014-01-29
#>  3 2014-12-07 04:54:01    1461  2014    12     7     1     4 2014-12-07
#>  4 2014-09-24 00:27:25    1461  2014     9    24     4     0 2014-09-24
#>  5 2014-05-05 09:19:41    1239  2014     5     5     2     9 2014-05-05
#>  6 2014-12-17 16:36:28    1461  2014    12    17     4    16 2014-12-17
#>  7 2014-01-29 23:00:14    1461  2014     1    29     4    23 2014-01-29
#>  8 2014-12-15 07:47:08    1461  2014    12    15     2     7 2014-12-15
#>  9 2014-09-28 01:28:52    1461  2014     9    28     1     1 2014-09-28
#> 10 2015-01-20 16:24:24    1461  2015     1    20     3    16 2015-01-20
#> # ℹ 1,160 more rows

## get the top 1 row based on hour 
df_top_1 <- df_enriched %>% 
  top_n_nested(., n = 1, wt = "hour")
#> 🛠 Start selecting top 1 row(s)...
#> 
#> ✅ Finish selecting top 1 row(s)!
#> ⌛ Selecting time: 0.286 secs
#> 

df_top_1$data[[2]]
#> # A tibble: 46 × 8
#>    created_at          grid_id  year month   day  wday  hour ymd       
#>    <dttm>                <int> <dbl> <dbl> <int> <dbl> <int> <chr>     
#>  1 2014-01-29 23:29:27    1461  2014     1    29     4    23 2014-01-29
#>  2 2014-01-29 23:00:14    1461  2014     1    29     4    23 2014-01-29
#>  3 2014-05-07 23:15:11    1460  2014     5     7     4    23 2014-05-07
#>  4 2014-06-20 23:00:19    1461  2014     6    20     6    23 2014-06-20
#>  5 2014-05-07 23:05:08    1425  2014     5     7     4    23 2014-05-07
#>  6 2014-08-15 23:40:17    1460  2014     8    15     6    23 2014-08-15
#>  7 2014-04-28 23:27:11    1461  2014     4    28     2    23 2014-04-28
#>  8 2015-03-06 23:04:42    1471  2015     3     6     6    23 2015-03-06
#>  9 2014-12-21 23:01:58    1461  2014    12    21     1    23 2014-12-21
#> 10 2014-05-07 23:23:21    1460  2014     5     7     4    23 2014-05-07
#> # ℹ 36 more rows

Identify location(s) with embedded recipes

To use the embedded recipes to identify the home location for users, you can use identify_location() function.

  • df: a dataframe with columns for the user id, location, timestamp
  • user: name of column that holds unique identifier for each user
  • timestamp: name of timestamp column. Should be POSIXct
  • location: name of column that holds unique identifier for each location
  • recipe: embedded algorithms to identify the most possible home locations for users
  • show_n_loc: number of potential homes to extract
  • keep_score: option to keep or remove calculated result/score per user per location

Current available recipes:

  • recipe_HMLC:
    • Weighs data points across multiple time frames to ‘score’ potentially meaningful locations for each user
  • recipe_FREQ
    • Selects the most frequently ‘visited’ location assuming a user is active mainly around their home location.
  • recipe_OSNA: Efstathiades et al.2015
    • Finds the most ‘popular’ location during ‘rest’, ‘active’ and ‘leisure time. Here we focus on ’rest’ and ‘leisure’ time to find the most possible home location for each user.
  • recipe_APDM: Ahas et al. 2010
    • Calculates the average and standard deviation of start time data points by a single user, in a single location.
# recipe: homelocator -- HMLC
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", show_n_loc = 1, recipe = "HMLC")

# recipe: Frequency -- FREQ
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", show_n_loc = 1, recipe = "FREQ")

# recipe: Online Social Network Activity -- OSNA
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", show_n_loc = 1, recipe = "OSNA")

# recipe: Online Social Network Activity -- APDM
## APDM recipe strictly returns the most likely home location
## It is important to load the neighbors table before you use the recipe!!
## example: st_queen <- function(a, b = a) st_relate(a, b, pattern = "F***T****")
##          neighbors <- st_queen(df_sf) ===> convert result to dataframe 
data("df_neighbors", package = "homelocator")
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", recipe = "APDM", keep_score = F)