r - More efficient strategy for which() or match() -

- July 15, 2015

i have vector of positive , negative numbers

vec<-c(seq(-100,-1), rep(0,20), seq(1,100))

the vector larger example, , takes on random set of values. have repetitively find number of negative numbers in vector... finding quite inefficient.

since need find number of negative numbers, , vector sorted, need know index of first 0 or positive number (there may no 0s in actual random vectors).

currently using code find length

length(which(vec<0))

but forces r go through entire vector, since sorted, there no need.

i use

match(0, vec)

but vector not have 0s

so question is, there kind of match() function applies condition instead of finding specific value? or there more efficient way run which() code?

thank you

the solutions offered far imply creating logical(length(vec)) , doing full or partial scan on this. note, vector sorted. can exploit doing binary search. started thinking i'd super-clever , implement in c greater speed, had trouble debugging indexing of algorithm (which tricky part!). wrote in r:

f3 <- function(x) {     imin <- 1l     imax <- length(x)     while (imax >= imin) {         imid <- as.integer(imin + (imax - imin) / 2)         if (x[imid] >= 0)             imax <- imid - 1l         else             imin <- imid + 1l     }     imax }

for comparison other suggestions

f0 <- function(v) length(which(v < 0)) f1 <- function(v) sum(v < 0) f2 <- function(v) which.min(v < 0) - 1l

and fun

library(compiler) f3.c <- cmpfun(f3)

leading to

> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6)) > identical(f0(vec), f1(vec)) [1] true > identical(f0(vec), f2(vec)) [1] true > identical(f0(vec), f3(vec)) [1] true > identical(f0(vec), f3.c(vec)) [1] true > microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec)) unit: microseconds       expr       min        lq     median         uq       max neval    f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100    f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100    f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100    f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100  f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100

probably there tricky edge cases i've got wrong! moving c, did

library(inline) f4 <- cfunction(c(x = "numeric"), "     int imin = 0, imax = rf_length(x) - 1, imid;     while (imax >= imin) {         imid = imin + (imax - imin) / 2;         if (real(x)[imid] >= 0)             imax = imid - 1;         else             imin = imid + 1;     }     return scalarinteger(imax + 1); ")

with

> identical(f3(vec), f4(vec)) [1] true > microbenchmark(f3(vec), f3.c(vec), f4(vec)) unit: nanoseconds       expr   min      lq  median      uq   max neval    f3(vec) 52096 53192.0 54918.5 55539.0 69491   100  f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100    f4(vec)   553   796.0   893.5  1004.5  2908   100

findinterval came when similar question asked on r-help list. slow safe, checking vec sorted , dealing na values. if 1 wants live on edge (arguably no worse implementing f3 or f4) then

f5.i <- function(v)     .internal(findinterval(v, 0 - .machine$double.neg.eps, false, false))

is fast c implementation, more robust , vectorized (i.e., vector of values in second argument, easy range-like calculations).

Search This Blog

Employment & Recruiting

r - More efficient strategy for which() or match() -

Popular posts from this blog

Php - Delimiter must not be alphanumeric or backslash -

Delphi interface implements -

java - How to create Table using Apache PDFBox -