r - More efficient strategy for which() or match() -
i have vector of positive , negative numbers
vec<-c(seq(-100,-1), rep(0,20), seq(1,100))
the vector larger example, , takes on random set of values. have repetitively find number of negative numbers in vector... finding quite inefficient.
since need find number of negative numbers, , vector sorted, need know index of first 0 or positive number (there may no 0s in actual random vectors).
currently using code find length
length(which(vec<0))
but forces r go through entire vector, since sorted, there no need.
i use
match(0, vec)
but vector not have 0s
so question is, there kind of match() function applies condition instead of finding specific value? or there more efficient way run which() code?
thank you
the solutions offered far imply creating logical(length(vec))
, doing full or partial scan on this. note, vector sorted. can exploit doing binary search. started thinking i'd super-clever , implement in c greater speed, had trouble debugging indexing of algorithm (which tricky part!). wrote in r:
f3 <- function(x) { imin <- 1l imax <- length(x) while (imax >= imin) { imid <- as.integer(imin + (imax - imin) / 2) if (x[imid] >= 0) imax <- imid - 1l else imin <- imid + 1l } imax }
for comparison other suggestions
f0 <- function(v) length(which(v < 0)) f1 <- function(v) sum(v < 0) f2 <- function(v) which.min(v < 0) - 1l
and fun
library(compiler) f3.c <- cmpfun(f3)
leading to
> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6)) > identical(f0(vec), f1(vec)) [1] true > identical(f0(vec), f2(vec)) [1] true > identical(f0(vec), f3(vec)) [1] true > identical(f0(vec), f3.c(vec)) [1] true > microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec)) unit: microseconds expr min lq median uq max neval f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903 100 f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293 100 f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889 100 f3(vec) 51.715 56.050 75.4495 78.5295 100.730 100 f3.c(vec) 11.612 17.147 28.5570 31.3160 49.781 100
probably there tricky edge cases i've got wrong! moving c, did
library(inline) f4 <- cfunction(c(x = "numeric"), " int imin = 0, imax = rf_length(x) - 1, imid; while (imax >= imin) { imid = imin + (imax - imin) / 2; if (real(x)[imid] >= 0) imax = imid - 1; else imin = imid + 1; } return scalarinteger(imax + 1); ")
with
> identical(f3(vec), f4(vec)) [1] true > microbenchmark(f3(vec), f3.c(vec), f4(vec)) unit: nanoseconds expr min lq median uq max neval f3(vec) 52096 53192.0 54918.5 55539.0 69491 100 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038 100 f4(vec) 553 796.0 893.5 1004.5 2908 100
findinterval
came when similar question asked on r-help list. slow safe, checking vec
sorted , dealing na values. if 1 wants live on edge (arguably no worse implementing f3 or f4) then
f5.i <- function(v) .internal(findinterval(v, 0 - .machine$double.neg.eps, false, false))
is fast c implementation, more robust , vectorized (i.e., vector of values in second argument, easy range-like calculations).