homelocator.Rmd
Identifying meaningful locations, such as home and/or work locations
from mobile technology is an essential step in the field of mobility
analysis. The homelocator
library is designed for
identifying home locations of user based on the spatio-temporally
features contained within the mobility data, such as social media data,
mobile phone data, mobile app data and so on.
Although there is a myriad of different approaches to inferring meaningful locations in the last 10 years, a consensus or single best approach has not emerged in the current state-of-the-art. The actual algorithms used are not always discussed in detail in publications and the source codes are seldom released in public, which makes comparing algorithms as well as reproducing work difficult.
So, the main objective for developing this package is to provide a consistent framework and interface for the adoption of different approaches so that researchers are able to write structured, algorithmic ‘recipes’, or use the existing embedded “recipes” to identify meaningful locations according to their research requirements. With this package, users are able to achieve an apples-to-apples comparison across approaches.
We hope that through packages like homelocator
, future work
that relies on the inference of meaningful locations becomes less
‘custom’ (with each researcher writing their own algorithm) but instead
will use common, comparable algorithms in order to increase transparency
and reproducibility in the field.
# Load homelocator library
library(homelocator)
#> Welcome to homelocator package!
To explore the basic data manipulation functions of
homelocator
, we’ll use a test_sample
dataset.
# load data
data("test_sample", package = "homelocator")
# show test sample
test_sample %>% head(3)
#> # A tibble: 3 × 3
#> u_id created_at grid_id
#> <int> <dttm> <int>
#> 1 92298565 2016-04-17 22:43:06 1581
#> 2 33908340 2014-10-03 16:29:48 1461
#> 3 92298565 2014-02-07 07:26:15 1136
Note: if you use your own dataset, please make sure that you have converted the timestamps to the appropriate local time zone.
The validate_dataset()
function makes sure your input
dataset contains all three necessary variables: user, location and
timestamp. There are 4 arguments in this function:
user
: name of column that holds a unique identifier for
each user.timestamp
: name of column that holds specific timestamp
for each data point. This timestamp should be in POSIXct
format.location
: name of column that holds a unique identifier
for each location.keep_other_vars
: option to keep or remove any other
variables of the input dataset. The default is FALSE
.
df_validated <- validate_dataset(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", keep_other_vars = FALSE)
#> 🎉 Congratulations!! Your dataset has passed validation.
#> 👤 There are 100 unique users in your dataset.
#> 🌏 Now start your journey identifying their meaningful location(s)!
#> 👏 Good luck!
#>
# show result
df_validated %>% head(3)
#> # A tibble: 3 × 3
#> u_id grid_id created_at
#> <int> <int> <dttm>
#> 1 92298565 1581 2016-04-17 22:43:06
#> 2 33908340 1461 2014-10-03 16:29:48
#> 3 92298565 1136 2014-02-07 07:26:15
The nest_verbose()
and unnest_verbose()
functions work in the same way as nest
and unnest
functions in tidyverse but with
some additional status information such as the elapsed running time.
df
: a dataframe...
: for nest_verbose()
refers to selected
columns to nest and for unnest_verbose()
refers to the list
column to unnest.
# nest data
df_nested <- nest_verbose(df_validated, c("created_at", "grid_id"))
#> 🛠 Start nesting...
#> ✅ Finish nesting!
#> ⌛ Nesting time: 0.019 secs
#>
# show result
df_nested %>% head(3)
#> # A tibble: 3 × 2
#> u_id data
#> <int> <list>
#> 1 92298565 <tibble [1,291 × 2]>
#> 2 33908340 <tibble [1,170 × 2]>
#> 3 11616678 <tibble [938 × 2]>
# show result
df_nested$data[[1]] %>% head(3)
#> # A tibble: 3 × 2
#> created_at grid_id
#> <dttm> <int>
#> 1 2016-04-17 22:43:06 1581
#> 2 2014-02-07 07:26:15 1136
#> 3 2012-08-18 19:26:31 1038
# unnest data
df_unnested <- unnest_verbose(df_nested, data)
#> 🛠 Start unnesting...
#> ✅ Finish unnesting!
#> ⌛ Unnesting time: 0.015 secs
#>
# show result
df_unnested %>% head(3)
#> # A tibble: 3 × 3
#> u_id created_at grid_id
#> <int> <dttm> <int>
#> 1 92298565 2016-04-17 22:43:06 1581
#> 2 92298565 2014-02-07 07:26:15 1136
#> 3 92298565 2012-08-18 19:26:31 1038
The nest_double_nest()
and
unnest_double_nested()
functions work in a similar way as
nest_verbose()
and unnest_verbose()
functions
but they apply to an already nested data frame. In other words, they map
nest
and unnest
function to each element of a
list created a double-nested data frame, or vice versa. This is an
essential step in many home location algorithms as they often operate
‘per user, per location’.
df
: a nested dataframe...
: for nest_double_nest()
refers to
selected columns to nest and for unnest_double_nested()
refers to list-column to unnest.
# double nest data (e.g., nesting column: created_at)
df_double_nested <- nest_nested(df_nested %>% head(100), c("created_at"))
#> Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the homelocator package.
#> Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> 🛠 Start nesting...
#>
#> ✅ Finish nesting!
#> ⌛ Nesting time: 0.79 secs
#>
# show result
df_double_nested %>% head(3)
#> # A tibble: 3 × 2
#> u_id data
#> <int> <list>
#> 1 92298565 <tibble [200 × 2]>
#> 2 33908340 <tibble [83 × 2]>
#> 3 11616678 <tibble [61 × 2]>
# show result
df_double_nested$data[[1]] %>% head(3)
#> # A tibble: 3 × 2
#> grid_id data
#> <int> <list>
#> 1 1581 <tibble [3 × 1]>
#> 2 1136 <tibble [7 × 1]>
#> 3 1038 <tibble [8 × 1]>
df_double_nested$data[[1]]$data[[1]] %>% head(3)
#> # A tibble: 3 × 1
#> created_at
#> <dttm>
#> 1 2016-04-17 22:43:06
#> 2 2014-12-28 03:26:34
#> 3 2014-12-03 09:26:36
# unnest nested data
df_double_unnested <- df_double_nested %>%
head(100) %>% ## take first 100 rows for example
unnest_double_nested(., data)
#> 🛠 Start unnesting...
#> ✅ Finish unnesting!
#> ⌛ Unnesting time: 0.702 secs
#>
# show result
df_double_unnested %>% head(3)
#> # A tibble: 3 × 3
#> u_id grid_id created_at
#> <int> <int> <dttm>
#> 1 92298565 1581 2016-04-17 22:43:06
#> 2 92298565 1581 2014-12-28 03:26:34
#> 3 92298565 1581 2014-12-03 09:26:36
The enrich_timestamp()
function creates additional
variables that are derived from the timestamp column. These include the
year, month, day, day of the week and hour of the day. These are often
used/needed as intermediate variables in home location algorithms.
df
: a nested dataframetimestamp
: name of column that holds the specific
timestamp for each data point. This timestamp should be in
POSIXct
format.
#original variables
df_nested[1, ] %>% unnest(cols = c(data)) %>% head(3)
#> # A tibble: 3 × 3
#> u_id created_at grid_id
#> <int> <dttm> <int>
#> 1 92298565 2016-04-17 22:43:06 1581
#> 2 92298565 2014-02-07 07:26:15 1136
#> 3 92298565 2012-08-18 19:26:31 1038
# create new variables from "created_at" timestamp column
df_enriched <- enrich_timestamp(df_nested, timestamp = "created_at")
#> 🛠 Enriching variables from timestamp...
#>
#> ✅ Finish enriching! New added variables: year, month, day, wday, hour, ymd.
#> ⌛ Enriching time: 0.262 secs
#>
# show result
df_enriched[1, ] %>% unnest(cols = c(data)) %>% head(3)
#> # A tibble: 3 × 9
#> u_id created_at grid_id year month day wday hour ymd
#> <int> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 92298565 2016-04-17 22:43:06 1581 2016 4 17 1 22 2016-04-17
#> 2 92298565 2014-02-07 07:26:15 1136 2014 2 7 6 7 2014-02-07
#> 3 92298565 2012-08-18 19:26:31 1038 2012 8 18 7 19 2012-08-18
The summarise_nested()
function works similar to dplyr’s
regular summarise
function but operates within a nested
tibble. summarise_double_nested()
conversely operates
within a double-nested tibble.
df
: a nested dataframenest_cols
: a selection of columns to nest in existing
list-column...
: name-value pairs of summary functions
# summarize in nested dataframe
# e.g., summarise total number of tweets and total number of places per user
df_summarize_nested <- summarise_nested(df_enriched,
n_tweets = n(),
n_locs = n_distinct(grid_id))
#> 🛠 Start summarising values...
#>
#> ✅ Finish summarising! There are 2 new added variables: n_tweets, n_locs
#> ⌛ Summarising time: 0.101 secs
#>
# show result
df_summarize_nested %>% head(3)
#> # A tibble: 3 × 4
#> u_id data n_tweets n_locs
#> <int> <list> <int> <int>
#> 1 92298565 <tibble [1,291 × 8]> 1291 200
#> 2 33908340 <tibble [1,170 × 8]> 1170 83
#> 3 11616678 <tibble [938 × 8]> 938 61
# summarize in double nested dataframe
# take first 100 users for example
# e.g summarise total number of tweets and totla number of distinct days
df_summarize_double_nested <- summarise_double_nested(df_enriched %>% head(100),
nest_cols = c("created_at", "ymd", "year", "month", "day", "wday", "hour"),
n_tweets = n(), n_days = n_distinct(ymd))
#> 🛠 Start summarising values...
#> ✅ Finish summarising! There are 2 new added variables: n_tweets, n_days
#> ⌛ Summarising time: 3.919 secs
#>
# show result
df_summarize_double_nested[1, ]
#> # A tibble: 1 × 2
#> u_id data
#> <int> <list>
#> 1 92298565 <tibble [200 × 4]>
df_summarize_double_nested[1, ]$data[[1]] %>% head(3)
#> # A tibble: 3 × 4
#> grid_id data n_tweets n_days
#> <int> <list> <int> <int>
#> 1 1581 <tibble [3 × 7]> 3 3
#> 2 1136 <tibble [7 × 7]> 7 6
#> 3 1038 <tibble [8 × 7]> 8 3
The remove_top_users()
function allows to remove top
N
percent of active users based on the total number of data
points per user. Although the majority of users are real people, some
accounts are run by algorithms or ‘bots’, whereas others can be
considered as spam accounts. Removing a certain top N
percent of active users is an oft-used approach to remove such accounts
and reduce the number of such users in the final dataset.
df
: a dataframe with columns of user id, and data point
countsuser
: name of column that holds unique identifier for
each usercounts
: name of column that holds the data points
frequency for each usertopNpct_user
: a decimal number that represent the
certain percentage of users to remove
# remove top 1% active users (e.g based on the frequency of tweets sent by users)
df_removed_active_users <- remove_top_users(df_summarize_nested, user = "u_id",
counts = "n_tweets", topNpct_user = 1)
#> 👤 There are 100 users at this moment.
#> 👤 Skip removing active users
# show result
df_removed_active_users %>% head(3)
#> # A tibble: 3 × 4
#> u_id data n_tweets n_locs
#> <int> <list> <int> <int>
#> 1 40153763 <tibble [1,688 × 8]> 1688 233
#> 2 92298565 <tibble [1,291 × 8]> 1291 200
#> 3 31088399 <tibble [1,210 × 8]> 1210 79
The filter_verbose()
function works in the same way as
filter
function in tidyverse but with
additional information about the number of (filtered) users remaining in
the dataset. And the filter_nested()
function works in the
similar way as filter_verbose()
but is applied within a
nested tibble.
df
: a dataframe with columns of user id, and variables
that your are going to apply the filter function. If the column not in
the dataset, you need to add that column before you apply the
filteruser
: name of column that holds unique identifier for
each user...
: Logical predicates defined in terms of the
variables in df. Only rows match conditions are kept.
# filter users with less than 10 tweets sent at less than 10 places
df_filtered <- filter_verbose(df_removed_active_users, user = "u_id",
n_tweets > 10 & n_locs > 10)
#> 👤 There are 100 users at this moment.
#> 🛠 Start filtering users...
#> ✅ Finish filtering! Filterred 36 users!
#> 👤 There are 64 users left.
#> ⌛ Filtering time: 0.002 secs
#>
# show result
df_filtered %>% head(3)
#> # A tibble: 3 × 4
#> u_id data n_tweets n_locs
#> <int> <list> <int> <int>
#> 1 40153763 <tibble [1,688 × 8]> 1688 233
#> 2 92298565 <tibble [1,291 × 8]> 1291 200
#> 3 31088399 <tibble [1,210 × 8]> 1210 79
# filter tweets that sent on weekends and during daytime (8am to 6pm)
df_filter_nested <- filter_nested(df_filtered, user = "u_id",
!wday %in% c(1, 7), # 1 means Sunday and 7 means Saturday
!hour %in% seq(8, 18, 1))
#> 👤 There are 64 users at this moment.
#> 🛠 Start filtering user...
#>
#> ✅ Finish Filtering! Filterred 0 users!
#> 👤 There are 64 users left!
#> ⌛ Filtering time: 0.072 secs
#>
# show result
df_filter_nested %>% head(3)
#> # A tibble: 3 × 4
#> u_id data n_tweets n_locs
#> <int> <list> <int> <int>
#> 1 40153763 <tibble [495 × 8]> 1688 233
#> 2 92298565 <tibble [424 × 8]> 1291 200
#> 3 31088399 <tibble [463 × 8]> 1210 79
df_filter_nested$data[[1]] %>% head(3)
#> # A tibble: 3 × 8
#> created_at grid_id year month day wday hour ymd
#> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2016-06-23 00:36:25 759 2016 6 23 5 0 2016-06-23
#> 2 2015-02-17 19:14:10 778 2015 2 17 3 19 2015-02-17
#> 3 2015-03-31 23:21:51 759 2015 3 31 3 23 2015-03-31
The mutate_verbose()
function works in the same way as
mutate
function in tidyverse but with
additional information about the elapsed running time.
df
: a dataframe...
: name-value pairs of expressionsFunction mutate_nested()
works in the similar way as
mutate_verbose()
but it adds new variables inside a nested
tibble.
df
: a nested dataframe...
: name-value pairs of expressions
## let's use pre-discussed functions to filter some users first
colnmaes_data <- df_filtered$data[[1]] %>% names()
colnmaes_to_nest <- colnmaes_data[-which(colnmaes_data == "grid_id")]
df_cleaned <- df_filtered %>%
summarise_double_nested(., nest_cols = colnmaes_to_nest,
n_tweets_loc = n(), # number of tweets sent at each location
n_hrs_loc = n_distinct(hour), # number of unique hours of sent tweets
n_days_loc = n_distinct(ymd), # number of unique days of sent tweets
period_loc = as.numeric(max(created_at) - min(created_at), "days")) %>% # period of tweeting
unnest_verbose(data) %>%
filter_verbose(., user = "u_id",
n_tweets_loc > 10 & n_hrs_loc > 10 & n_days_loc > 10 & period_loc > 10)
#> 🛠 Start summarising values...
#> ✅ Finish summarising! There are 4 new added variables: n_tweets_loc, n_hrs_loc, n_days_loc, period_loc
#> ⌛ Summarising time: 4.545 secs
#>
#> 🛠 Start unnesting...
#> ✅ Finish unnesting!
#> ⌛ Unnesting time: 0.009 secs
#>
#> 👤 There are 64 users at this moment.
#> 🛠 Start filtering users...
#> ✅ Finish filtering! Filterred 35 users!
#> 👤 There are 29 users left.
#> ⌛ Filtering time: 0.002 secs
#>
# show cleaned dataset
df_cleaned %>% head(3)
#> # A tibble: 3 × 9
#> u_id n_tweets n_locs grid_id data n_tweets_loc n_hrs_loc n_days_loc
#> <int> <int> <int> <int> <list> <int> <int> <int>
#> 1 40153763 1688 233 759 <tibble> 467 23 366
#> 2 40153763 1688 233 842 <tibble> 57 16 51
#> 3 40153763 1688 233 1212 <tibble> 71 13 59
#> # ℹ 1 more variable: period_loc <dbl>
# ok, then let's apply the mutate_nested function to add four new variables: wd_or_wk, time_numeric, rest_or_work, wk.am_or_wk.pm
df_expanded <- df_cleaned %>%
mutate_nested(wd_or_wk = if_else(wday %in% c(1,7), "weekend", "weekday"),
time_numeric = lubridate::hour(created_at) + lubridate::minute(created_at) / 60 + lubridate::second(created_at) / 3600,
rest_or_work = if_else(time_numeric >= 9 & time_numeric <= 18, "work", "rest"),
wk.am_or_wk.pm = if_else(time_numeric >= 6 & time_numeric <= 12 & wd_or_wk == "weekend", "weekend_am", "weekend_pm"))
#> 🛠 Start adding variable(s)...
#> ✅ Finish adding! There are 4 new added variables: wd_or_wk, time_numeric, rest_or_work, wk.am_or_wk.pm
#> ⌛ Adding time: 0.255 secs
#>
# show result
df_expanded %>% head(3)
#> # A tibble: 3 × 9
#> u_id n_tweets n_locs grid_id data n_tweets_loc n_hrs_loc n_days_loc
#> <int> <int> <int> <int> <list> <int> <int> <int>
#> 1 40153763 1688 233 759 <tibble> 467 23 366
#> 2 40153763 1688 233 842 <tibble> 57 16 51
#> 3 40153763 1688 233 1212 <tibble> 71 13 59
#> # ℹ 1 more variable: period_loc <dbl>
The function prop_factor_nested()
allows you to
calculate the proportion of categories for a variable inside the
list-column and convert each categories to a new variable adding to the
dataframe. For example, inside the list-column, the variable
wd_or_wk
has two categories named weekend or weekday, when
you call prop_factor_nested()
function, it calculates the
proportion of weekend and weekday separately, and the results are then
expanded to two new columns called weekday
and
weekend
adding to the dataframe.
df
: a nested dataframevar
: name of column to calculate inside the
list-column
# categories for a variable inside the list-column: e.g weekend or weekday
df_expanded$data[[1]] %>% head(3)
#> # A tibble: 3 × 11
#> created_at year month day wday hour ymd wd_or_wk time_numeric
#> <dttm> <dbl> <dbl> <int> <dbl> <int> <chr> <chr> <dbl>
#> 1 2016-06-23 00:36:25 2016 6 23 5 0 2016-… weekday 0.607
#> 2 2014-09-03 15:46:45 2014 9 3 4 15 2014-… weekday 15.8
#> 3 2015-03-31 23:21:51 2015 3 31 3 23 2015-… weekday 23.4
#> # ℹ 2 more variables: rest_or_work <chr>, wk.am_or_wk.pm <chr>
# calculate proportion of categories for four variables: wd_or_wk, rest_or_work, wk.am_or_wk.pm
df_expanded <- df_expanded %>%
prop_factor_nested(wd_or_wk, rest_or_work, wk.am_or_wk.pm)
#> 🛠 Start calculating proportion...
#>
#> ✅ Finish calculating! There are 6 new calculated variables: rest, weekday, weekend, weekend_am, weekend_pm, work
#> ⌛ Calculating time: 1.839 secs
#>
# show result
df_expanded %>% head(3)
#> # A tibble: 3 × 15
#> u_id n_tweets n_locs grid_id data n_tweets_loc n_hrs_loc n_days_loc
#> <int> <int> <int> <int> <list> <int> <int> <int>
#> 1 40153763 1688 233 759 <tibble> 467 23 366
#> 2 40153763 1688 233 842 <tibble> 57 16 51
#> 3 40153763 1688 233 1212 <tibble> 71 13 59
#> # ℹ 7 more variables: period_loc <dbl>, rest <dbl>, weekday <dbl>,
#> # weekend <dbl>, weekend_am <dbl>, weekend_pm <dbl>, work <dbl>
The score_nested()
function allows you to give a
weighted value for one or more variables in a nested dataframe.
df
: a nested dataframe by useruser
: name of column that holds unique identifier for
each userlocation
: name of column that holds unique identifier
for each locationkeep_ori_vars
: option to keep or drop original
varialbes...
: name-value pairs of expressionThe score_summary()
function summarises all scored
columns and return one single summary score per row.
df
: a dataframeuser
: name of column that holds unique identifier for
each userlocation
: name of column that holds unique identifier
for each location...
: name of scored columns
## let's add two more variables before we do the scoring
df_expanded <- df_expanded %>%
summarise_nested(n_wdays_loc = n_distinct(wday),
n_months_loc = n_distinct(month))
#> 🛠 Start summarising values...
#>
#> ✅ Finish summarising! There are 2 new added variables: n_wdays_loc, n_months_loc
#> ⌛ Summarising time: 0.113 secs
#>
df_expanded %>% head(3)
#> # A tibble: 3 × 17
#> u_id n_tweets n_locs grid_id data n_tweets_loc n_hrs_loc n_days_loc
#> <int> <int> <int> <int> <list> <int> <int> <int>
#> 1 40153763 1688 233 759 <tibble> 467 23 366
#> 2 40153763 1688 233 842 <tibble> 57 16 51
#> 3 40153763 1688 233 1212 <tibble> 71 13 59
#> # ℹ 9 more variables: period_loc <dbl>, rest <dbl>, weekday <dbl>,
#> # weekend <dbl>, weekend_am <dbl>, weekend_pm <dbl>, work <dbl>,
#> # n_wdays_loc <int>, n_months_loc <int>
# when calculating scores, you can give weight to different variables, but the total weight should add up to 1
df_scored <- df_expanded %>%
score_nested(., user = "u_id", location = "grid_id", keep_original_vars = F,
s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)),
s_n_hrs_loc = 0.1 * (n_hrs_loc/24),
s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)),
s_period_loc = 0.1 * (period_loc/max(period_loc)),
s_n_wdays_loc = 0.1 * (n_wdays_loc/7),
s_n_months_loc = 0.1 * (n_months_loc/12),
s_weekend = 0.1 * (weekend),
s_rest = 0.2 * (rest),
s_wk_am = 0.1 * (weekend_am))
#> 🛠 Start scoring ...
#> ✅ Finish scoring! There are 9 new added variables: s_n_tweets_loc, s_n_hrs_loc, s_n_days_loc, s_period_loc, s_n_wdays_loc, s_n_months_loc, s_weekend, s_rest, s_wk_am
#> ⌛ Scoring time: 0.199 secs
#>
df_scored %>% head(3)
#> # A tibble: 3 × 2
#> u_id data
#> <int> <list>
#> 1 40153763 <tibble [12 × 10]>
#> 2 92298565 <tibble [11 × 10]>
#> 3 31088399 <tibble [4 × 10]>
df_scored$data[[1]]
#> # A tibble: 12 × 10
#> grid_id s_n_tweets_loc s_n_hrs_loc s_n_days_loc s_period_loc s_n_wdays_loc
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 759 0.1 0.0958 0.1 0.0932 0.1
#> 2 842 0.0122 0.0667 0.0139 0.1 0.1
#> 3 1212 0.0152 0.0542 0.0161 0.0225 0.1
#> 4 593 0.0133 0.0583 0.0142 0.0863 0.1
#> 5 1125 0.00600 0.0542 0.00656 0.0778 0.1
#> 6 797 0.00835 0.0667 0.0104 0.0851 0.1
#> 7 843 0.0274 0.0583 0.0301 0.0692 0.1
#> 8 862 0.0278 0.0583 0.0295 0.0722 0.1
#> 9 1709 0.00343 0.0542 0.00437 0.0870 0.1
#> 10 1174 0.00835 0.0542 0.0101 0.0752 0.1
#> 11 614 0.00321 0.0542 0.00410 0.0986 0.0714
#> 12 1038 0.00428 0.0458 0.00519 0.0784 0.1
#> # ℹ 4 more variables: s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>,
#> # s_wk_am <dbl>
### we can replace the score function by mutate function
df_scored_2 <- df_expanded %>%
nest_verbose(-u_id) %>%
mutate_nested(s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)),
s_n_hrs_loc = 0.1 * (n_hrs_loc/24),
s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)),
s_period_loc = 0.1 * (period_loc/max(period_loc)),
s_n_wdays_loc = 0.1 * (n_wdays_loc/7),
s_n_months_loc = 0.1 * (n_months_loc/12),
s_weekend = 0.1 * (weekend),
s_rest = 0.2 * (rest),
s_wk_am = 0.1 * (weekend_am))
#> 🛠 Start nesting...
#> ✅ Finish nesting!
#> ⌛ Nesting time: 0.009 secs
#>
#> 🛠 Start adding variable(s)...
#>
#> ✅ Finish adding! There are 9 new added variables: s_n_tweets_loc, s_n_hrs_loc, s_n_days_loc, s_period_loc, s_n_wdays_loc, s_n_months_loc, s_weekend, s_rest, s_wk_am
#> ⌛ Adding time: 0.069 secs
#>
df_scored_2 %>% head(3)
#> # A tibble: 3 × 2
#> u_id data
#> <int> <list>
#> 1 40153763 <tibble [12 × 25]>
#> 2 92298565 <tibble [11 × 25]>
#> 3 31088399 <tibble [4 × 25]>
df_scored_2$data[[1]]
#> # A tibble: 12 × 25
#> n_tweets n_locs grid_id data n_tweets_loc n_hrs_loc n_days_loc period_loc
#> <int> <int> <int> <list> <int> <int> <int> <dbl>
#> 1 1688 233 759 <tibble> 467 23 366 1211.
#> 2 1688 233 842 <tibble> 57 16 51 1300.
#> 3 1688 233 1212 <tibble> 71 13 59 292.
#> 4 1688 233 593 <tibble> 62 14 52 1121.
#> 5 1688 233 1125 <tibble> 28 13 24 1011.
#> 6 1688 233 797 <tibble> 39 16 38 1106.
#> 7 1688 233 843 <tibble> 128 14 110 900.
#> 8 1688 233 862 <tibble> 130 14 108 938.
#> 9 1688 233 1709 <tibble> 16 13 16 1130.
#> 10 1688 233 1174 <tibble> 39 13 37 977.
#> 11 1688 233 614 <tibble> 15 13 15 1281.
#> 12 1688 233 1038 <tibble> 20 11 19 1018.
#> # ℹ 17 more variables: rest <dbl>, weekday <dbl>, weekend <dbl>,
#> # weekend_am <dbl>, weekend_pm <dbl>, work <dbl>, n_wdays_loc <int>,
#> # n_months_loc <int>, s_n_tweets_loc <dbl>, s_n_hrs_loc <dbl>,
#> # s_n_days_loc <dbl>, s_period_loc <dbl>, s_n_wdays_loc <dbl>,
#> # s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>, s_wk_am <dbl>
# sum varialbes score for each location
df_score_summed <- df_scored %>%
score_summary(., user = "u_id", location = "grid_id", starts_with("s_"))
#> 🛠 Start summing scores...
#>
#> ✅ Finish summing! There are 1 new added variables: score
#> ⌛ Summing time: 0.088 secs
#>
df_score_summed %>% head(3)
#> # A tibble: 3 × 2
#> u_id data
#> <int> <list>
#> 1 40153763 <tibble [12 × 11]>
#> 2 92298565 <tibble [11 × 11]>
#> 3 31088399 <tibble [4 × 11]>
df_score_summed$data[[1]]
#> # A tibble: 12 × 11
#> grid_id s_n_tweets_loc s_n_hrs_loc s_n_days_loc s_period_loc s_n_wdays_loc
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 759 0.1 0.0958 0.1 0.0932 0.1
#> 2 842 0.0122 0.0667 0.0139 0.1 0.1
#> 3 1212 0.0152 0.0542 0.0161 0.0225 0.1
#> 4 593 0.0133 0.0583 0.0142 0.0863 0.1
#> 5 1125 0.00600 0.0542 0.00656 0.0778 0.1
#> 6 797 0.00835 0.0667 0.0104 0.0851 0.1
#> 7 843 0.0274 0.0583 0.0301 0.0692 0.1
#> 8 862 0.0278 0.0583 0.0295 0.0722 0.1
#> 9 1709 0.00343 0.0542 0.00437 0.0870 0.1
#> 10 1174 0.00835 0.0542 0.0101 0.0752 0.1
#> 11 614 0.00321 0.0542 0.00410 0.0986 0.0714
#> 12 1038 0.00428 0.0458 0.00519 0.0784 0.1
#> # ℹ 5 more variables: s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>,
#> # s_wk_am <dbl>, score <dbl>
The extract_location()
function allows you to sort the
locations of each each user based on descending value of selected
column, and return the top N location(s) for each user. The extracted
location(s) then represent the most likely home location of the
user.
df
: a nested dataframe by useruser
: name of column that holds unique identifier for
each userlocation
: name of column that holds unique identifier
for each locationshow_n_loc
: a single number that decides the number of
locations to be extractedkeep_score
: option to keep or remove columns with
scored value...
: name of column(s) that used to sort the
locations
# extract homes for users based on score value (each user return 1 most possible home)
df_home <- df_score_summed %>%
extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = F, desc(score))
#> 🛠 Start extracting homes for users...
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `home = purrr::map_chr(df[[colname_nested_data]],
#> ~get_loc_with_progress(.))`.
#> Caused by warning:
#> ! Automatic coercion from integer to character was deprecated in purrr 1.0.0.
#> ℹ Please use an explicit call to `as.character()` within `map_chr()` instead.
#> ℹ The deprecated feature was likely used in the homelocator package.
#> Please report the issue to the authors.
#>
#> 🎉 Congratulations!! Your have found 29 users' potential home(s).
#> ⌛ Extracting time: 0.194 secs
#>
df_home %>% head(3)
#> # A tibble: 3 × 2
#> u_id home
#> <int> <chr>
#> 1 40153763 759
#> 2 92298565 1582
#> 3 31088399 1113
# extract homes for users and keep scores of locations
df_home <- df_score_summed %>%
extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = T, desc(score))
#> 🛠 Start extracting homes for users...
#> 🎉 Congratulations!! Your have found 29 users' potential home(s).
#> ⌛ Extracting time: 0.105 secs
#>
df_home %>% head(3)
#> # A tibble: 3 × 3
#> u_id data home
#> <int> <list> <chr>
#> 1 40153763 <tibble [12 × 11]> 759
#> 2 92298565 <tibble [11 × 11]> 1582
#> 3 31088399 <tibble [4 × 11]> 1113
df_home$data[[3]]
#> # A tibble: 4 × 11
#> grid_id s_n_tweets_loc s_n_hrs_loc s_n_days_loc s_period_loc s_n_wdays_loc
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 880 0.0774 0.1 0.0577 0.0796 0.1
#> 2 1113 0.1 0.0917 0.1 0.1 0.1
#> 3 862 0.0519 0.1 0.0477 0.0716 0.1
#> 4 1594 0.00501 0.05 0.00846 0.0621 0.0714
#> # ℹ 5 more variables: s_n_months_loc <dbl>, s_weekend <dbl>, s_rest <dbl>,
#> # s_wk_am <dbl>, score <dbl>
The spread_nested()
function works in the same way as spread
function in tidyverse but works
inside a nested data frame. It allows to spread a key-value pairs across
multiple columns.
df
: a nested dataframevar_key
: column name or positionvar_value
: column name or position
# let's add one timeframe variable first and calculate the number of data points at each timeframe
df_timeframe <- df_enriched %>%
mutate_nested(timeframe = if_else(hour >= 2 & hour < 8, "Rest", if_else(hour >= 8 & hour < 19, "Active", "Leisure")))
#> 🛠 Start adding variable(s)...
#>
#> ✅ Finish adding! There are 1 new added variables: timeframe
#> ⌛ Adding time: 0.124 secs
#>
colnames_nested_data <- df_timeframe$data[[1]] %>% names()
colnames_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id", "timeframe"))]
df_timeframe <- df_timeframe %>%
head(20) %>% # take first 20 users as example
summarise_double_nested(., nest_cols = colnames_to_nest,
n_points_timeframe = n())
#> 🛠 Start summarising values...
#> ✅ Finish summarising! There are 1 new added variables: n_points_timeframe
#> ⌛ Summarising time: 1.961 secs
#>
df_timeframe$data[[2]]
#> # A tibble: 102 × 4
#> grid_id timeframe data n_points_timeframe
#> <int> <chr> <list> <int>
#> 1 1461 Active <tibble [280 × 7]> 280
#> 2 1461 Leisure <tibble [173 × 7]> 173
#> 3 1461 Rest <tibble [418 × 7]> 418
#> 4 1239 Active <tibble [1 × 7]> 1
#> 5 1460 Leisure <tibble [53 × 7]> 53
#> 6 1449 Leisure <tibble [21 × 7]> 21
#> 7 1629 Leisure <tibble [1 × 7]> 1
#> 8 1260 Active <tibble [2 × 7]> 2
#> 9 1399 Rest <tibble [2 × 7]> 2
#> 10 1345 Active <tibble [1 × 7]> 1
#> # ℹ 92 more rows
# spread timeframe in nested dataframe with key is timeframe and value is n_points_timeframe
df_timeframe_spreaded <- df_timeframe %>%
spread_nested(., key_var = "timeframe", value_var = "n_points_timeframe")
#> 🛠 Start spreading timeframe variable...
#> ✅ Finish spreading! There are 3 new added variables: Active, Leisure, Rest
#> ⌛ Spreading time: 0.278 secs
#>
df_timeframe_spreaded$data[[2]]
#> # A tibble: 102 × 5
#> grid_id data Active Leisure Rest
#> <int> <list> <int> <int> <int>
#> 1 354 <tibble [1 × 7]> 0 1 0
#> 2 355 <tibble [1 × 7]> 1 0 0
#> 3 374 <tibble [1 × 7]> 0 1 0
#> 4 597 <tibble [1 × 7]> 0 1 0
#> 5 611 <tibble [1 × 7]> 1 0 0
#> 6 612 <tibble [1 × 7]> 1 0 0
#> 7 613 <tibble [1 × 7]> 1 0 0
#> 8 614 <tibble [1 × 7]> 1 0 0
#> 9 645 <tibble [3 × 7]> 3 0 0
#> 10 645 <tibble [1 × 7]> 0 0 1
#> # ℹ 92 more rows
The arrange_nested()
function works in the same way as
arrange
function in tidyverse but works
inside a nested data frame. It allows to sort rows by variables.
df
: a nested dataframe...
: comma separated list of unquoted variable
names
df_enriched$data[[3]]
#> # A tibble: 938 × 8
#> created_at grid_id year month day wday hour ymd
#> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2014-07-18 10:08:21 1375 2014 7 18 6 10 2014-07-18
#> 2 2013-11-24 23:16:24 1375 2013 11 24 1 23 2013-11-24
#> 3 2013-06-12 00:59:42 1375 2013 6 12 4 0 2013-06-12
#> 4 2013-06-07 19:56:56 1375 2013 6 7 6 19 2013-06-07
#> 5 2013-06-12 01:08:29 1375 2013 6 12 4 1 2013-06-12
#> 6 2013-06-02 20:53:09 1375 2013 6 2 1 20 2013-06-02
#> 7 2013-06-10 20:57:17 1375 2013 6 10 2 20 2013-06-10
#> 8 2013-06-04 17:42:06 1375 2013 6 4 3 17 2013-06-04
#> 9 2014-07-10 01:00:45 1375 2014 7 10 5 1 2014-07-10
#> 10 2013-06-13 01:48:27 1375 2013 6 13 5 1 2013-06-13
#> # ℹ 928 more rows
df_arranged <- df_enriched %>%
arrange_nested(desc(hour)) # arrange the hour in descending order
#> 🛠 Start sorting...
#>
#> ✅ Finish sorting!
#> ⌛ Sorting time: 0.173 secs
#>
df_arranged$data[[3]]
#> # A tibble: 938 × 8
#> created_at grid_id year month day wday hour ymd
#> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2013-11-24 23:16:24 1375 2013 11 24 1 23 2013-11-24
#> 2 2013-06-08 23:41:19 1375 2013 6 8 7 23 2013-06-08
#> 3 2013-06-02 23:21:04 1375 2013 6 2 1 23 2013-06-02
#> 4 2013-06-05 23:31:59 1375 2013 6 5 4 23 2013-06-05
#> 5 2013-06-02 23:29:43 1375 2013 6 2 1 23 2013-06-02
#> 6 2013-06-06 23:51:50 1375 2013 6 6 5 23 2013-06-06
#> 7 2014-04-27 23:40:59 1375 2014 4 27 1 23 2014-04-27
#> 8 2013-06-05 23:47:27 1375 2013 6 5 4 23 2013-06-05
#> 9 2013-06-02 23:02:04 1375 2013 6 2 1 23 2013-06-02
#> 10 2013-06-06 23:07:49 1375 2013 6 6 5 23 2013-06-06
#> # ℹ 928 more rows
The arrange_double_nested()
function works in the
similar way as arrange_nested()
function, but it applies to
a double nested data frame. You need to choose the columns to nest
inside a nested dataframe and then sort rows by selected column.
- `df`: a nested dataframe
- `nest_cols`: name of columns to nest in existing list-column
- `...`: comma separated list of unquoted variable names
# get the name of columns to nest
colnames_nested_data <- df_enriched$data[[1]] %>% names()
colnmaes_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id"))]
df_double_arranged <- df_enriched %>%
head(20) %>% # take the first 20 users for example
arrange_double_nested(., nest_cols = colnmaes_to_nest,
desc(created_at)) # sort by time in descending order
#> 🛠 Start sorting...
#> ✅ Finish sorting!
#> ⌛ Sorting time: 2.544 secs
#>
# original third user
df_enriched[3, ]
#> # A tibble: 1 × 2
#> u_id data
#> <int> <list>
#> 1 11616678 <tibble [938 × 8]>
# third user data points
df_enriched$data[[3]]
#> # A tibble: 938 × 8
#> created_at grid_id year month day wday hour ymd
#> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2014-07-18 10:08:21 1375 2014 7 18 6 10 2014-07-18
#> 2 2013-11-24 23:16:24 1375 2013 11 24 1 23 2013-11-24
#> 3 2013-06-12 00:59:42 1375 2013 6 12 4 0 2013-06-12
#> 4 2013-06-07 19:56:56 1375 2013 6 7 6 19 2013-06-07
#> 5 2013-06-12 01:08:29 1375 2013 6 12 4 1 2013-06-12
#> 6 2013-06-02 20:53:09 1375 2013 6 2 1 20 2013-06-02
#> 7 2013-06-10 20:57:17 1375 2013 6 10 2 20 2013-06-10
#> 8 2013-06-04 17:42:06 1375 2013 6 4 3 17 2013-06-04
#> 9 2014-07-10 01:00:45 1375 2014 7 10 5 1 2014-07-10
#> 10 2013-06-13 01:48:27 1375 2013 6 13 5 1 2013-06-13
#> # ℹ 928 more rows
# arranged third user
df_double_arranged[3, ]
#> # A tibble: 1 × 2
#> u_id data
#> <int> <list>
#> 1 11616678 <tibble [61 × 2]>
# arranged time
df_double_arranged$data[[3]]$data[[2]]
#> # A tibble: 3 × 7
#> created_at year month day wday hour ymd
#> <dttm> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2013-06-07 09:44:12 2013 6 7 6 9 2013-06-07
#> 2 2013-06-04 09:42:57 2013 6 4 3 9 2013-06-04
#> 3 2013-06-02 13:40:22 2013 6 2 1 13 2013-06-02
The top_n_nested()
function works in the same way as top_n
function in tidyverse but for
nested dataframe. It allows to select the top entries in each group,
ordered by wt.
df_enriched$data[[2]]
#> # A tibble: 1,170 × 8
#> created_at grid_id year month day wday hour ymd
#> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2014-10-03 16:29:48 1461 2014 10 3 6 16 2014-10-03
#> 2 2014-01-29 23:29:27 1461 2014 1 29 4 23 2014-01-29
#> 3 2014-12-07 04:54:01 1461 2014 12 7 1 4 2014-12-07
#> 4 2014-09-24 00:27:25 1461 2014 9 24 4 0 2014-09-24
#> 5 2014-05-05 09:19:41 1239 2014 5 5 2 9 2014-05-05
#> 6 2014-12-17 16:36:28 1461 2014 12 17 4 16 2014-12-17
#> 7 2014-01-29 23:00:14 1461 2014 1 29 4 23 2014-01-29
#> 8 2014-12-15 07:47:08 1461 2014 12 15 2 7 2014-12-15
#> 9 2014-09-28 01:28:52 1461 2014 9 28 1 1 2014-09-28
#> 10 2015-01-20 16:24:24 1461 2015 1 20 3 16 2015-01-20
#> # ℹ 1,160 more rows
## get the top 1 row based on hour
df_top_1 <- df_enriched %>%
top_n_nested(., n = 1, wt = "hour")
#> 🛠 Start selecting top 1 row(s)...
#>
#> ✅ Finish selecting top 1 row(s)!
#> ⌛ Selecting time: 0.286 secs
#>
df_top_1$data[[2]]
#> # A tibble: 46 × 8
#> created_at grid_id year month day wday hour ymd
#> <dttm> <int> <dbl> <dbl> <int> <dbl> <int> <chr>
#> 1 2014-01-29 23:29:27 1461 2014 1 29 4 23 2014-01-29
#> 2 2014-01-29 23:00:14 1461 2014 1 29 4 23 2014-01-29
#> 3 2014-05-07 23:15:11 1460 2014 5 7 4 23 2014-05-07
#> 4 2014-06-20 23:00:19 1461 2014 6 20 6 23 2014-06-20
#> 5 2014-05-07 23:05:08 1425 2014 5 7 4 23 2014-05-07
#> 6 2014-08-15 23:40:17 1460 2014 8 15 6 23 2014-08-15
#> 7 2014-04-28 23:27:11 1461 2014 4 28 2 23 2014-04-28
#> 8 2015-03-06 23:04:42 1471 2015 3 6 6 23 2015-03-06
#> 9 2014-12-21 23:01:58 1461 2014 12 21 1 23 2014-12-21
#> 10 2014-05-07 23:23:21 1460 2014 5 7 4 23 2014-05-07
#> # ℹ 36 more rows
To use the embedded recipes to identify the home location for users,
you can use identify_location()
function.
df
: a dataframe with columns for the user id, location,
timestampuser
: name of column that holds unique identifier for
each usertimestamp
: name of timestamp column. Should be
POSIXctlocation
: name of column that holds unique identifier
for each locationrecipe
: embedded algorithms to identify the most
possible home locations for usersshow_n_loc
: number of potential homes to extractkeep_score
: option to keep or remove calculated
result/score per user per locationCurrent available recipes:
recipe_HMLC
:
recipe_FREQ
recipe_OSNA
: Efstathiades et
al.2015
recipe_APDM
: Ahas et al. 2010
# recipe: homelocator -- HMLC
identify_location(test_sample, user = "u_id", timestamp = "created_at",
location = "grid_id", show_n_loc = 1, recipe = "HMLC")
# recipe: Frequency -- FREQ
identify_location(test_sample, user = "u_id", timestamp = "created_at",
location = "grid_id", show_n_loc = 1, recipe = "FREQ")
# recipe: Online Social Network Activity -- OSNA
identify_location(test_sample, user = "u_id", timestamp = "created_at",
location = "grid_id", show_n_loc = 1, recipe = "OSNA")
# recipe: Online Social Network Activity -- APDM
## APDM recipe strictly returns the most likely home location
## It is important to load the neighbors table before you use the recipe!!
## example: st_queen <- function(a, b = a) st_relate(a, b, pattern = "F***T****")
## neighbors <- st_queen(df_sf) ===> convert result to dataframe
data("df_neighbors", package = "homelocator")
identify_location(test_sample, user = "u_id", timestamp = "created_at",
location = "grid_id", recipe = "APDM", keep_score = F)