java - Why the 20x ratio Thread sweet spot for IO? [formerly : Which ExecutionContext to use in playframework?] -
i know how create own executioncontext or import play framework global one. must admit far being expert on how multiple context/executionservices work in back.
so question is, better performance/behaviour of service executioncontext should use?
i tested 2 options:
import play.api.libs.concurrent.execution.defaultcontext
and
implicit val executioncontext = executioncontext.fromexecutorservice(executors.newfixedthreadpool(runtime.getruntime().availableprocessors()))
with both resulting in comparable performances.
the action use implemented in playframework 2.1.x. sedispool own object future wrapping of normal sedis/jedis client pool.
def testaction(application: string, platform: string) = action { async( sedispool.withasyncclient[result] { client => client.get(stringbuilder.newbuilder.append(application).append('-').append(platform).tostring) match { case some(x) => ok(x) case none => results.nocontent } }) }
this performance-wize behave or slower exact same function in node.js, , go. still slower pypy. way faster same thing in java (using blocking call redis using jedis in case). load tested gatling. doing "competition" of techs simple services on top of redis , criteria "with same amount of efforts coders". tested using fyrie (and apart fact not api) behaved same sedis implementation.
but that's beside question. want learn more part of playframework/scala.
is there advised behaviour? or point me in better direction? starting using scala now, far expert can walk myself through code answers.
thanks help.
update - more questions!
after tampering number of threads in pool found out that: runtime.getruntime().availableprocessors() * 20
gives around 15% 20% performance boost service (measured in request per seconds, , average response time), makes better node.js , go (barely though). have more questions : - tested 15x , 25x , 20 seems sweet spot. why? ideas? - there other settings might better? other "sweet spots"? - 20x sweet spot or dependent on other parameters of machine/jvm running on?
update - more docs on subject
found more information on play framework docs. http://www.playframework.com/documentation/2.1.0/threadpools
for io advise i've done gives way through akka.dispatchers configurable through *.conf files (this should make ops happy).
so using
implicit val redis_lookup_context: executioncontext = akka.system.dispatchers.lookup("simple-redis-lookup")
with dispatcher configured by
akka{ event-handlers = ["akka.event.slf4j.slf4jeventhandler"] loglevel = warning actor { simple-redis-lookup = { fork-join-executor { parallelism-factor = 20.0 #parallelism-min = 40 #parallelism-max = 400 } } } }
it gave me around 5% boost (eyeballing now), , more stability of performance once jvm "hot". , sysops happy play settings without rebuilding service.
my questions still there though. why numbers?
the way think optimization to:
- take @ single threaded performance, then
- see how things parallelise, then
- rinse , repeat until have performance need or give up.
single threaded optimization
the performance of single thread typically gated on single component or section of code, , might be:
- a cpu-bound section, may bound on reading ram (this not paging). jvm , higher level tools cannot distinguish between cpu , ram. performance profiler (eg jprofiler) really useful locate code hotspots)
- you can improve performance optimizing code decrease cpu usage or ram read/write rates
- a paging problem, application has run out of memory , paging or disk
- you can improve performance adding ram, reducing memory usage, allocating more physical ram process or reducing memory load on os
- a latency problem, thread waiting read socket, disk or similar, or waiting while data committed disk.
- you can improve single-threaded performance using faster disks (eg spinning rust -> ssd), using faster network (1ge -> 10ge) or improving responsiveness of network app using (tune db)
however, latencies in single thread not worrisome if can run multiple threads. while 1 thread blocked, can use cpu (for overhead of swapping out context , replacing of items in cpu cache). how many threads should run?
multi-threading
let's assume thread spends 50% of time on cpu , 50% waiting io. in case, each cpu can utilized 2 threads, , see 2x throughput improvement. if thread spends 1% of time using cpu, should (all things being equal) able run 100 threads concurrently.
however, lot of weird effects can occur:
- context switching has (some) cost , ideally need minimize them. greater overall system performance if periods of latency few , large rather frequent , small. effect means increasing threads
n
x, never quiten
x throughput improvement. , after critical point, increasen
, performance decrease. - synchronization, semaphores , mutexes. small areas of code acquire semaphores or mutexes ensure 1 (or limited number) of threads can enter @ 1 time. while there few threads, impacts performance. however, if code block takes appreciable time, , there many threads, become gating factor system performance. example, imagine guarded, single-threaded block takes 10ms execute, example querying database. because 1 thread @ time can enter, max threads can have executing 1000ms/10ms, or 100. other threads end behind each other in queue on block.
- resources: increase parallelism, loading manner of lightly loaded components. these become more heavily loaded, other threads end blocked waiting on data them. ultimately, parallelism ends creating latency in threads on computer. these components include:
- ram
- disk channels
- network
- network services (such db). can't tell how many times have optimized java point db limiting throughput.
if happens, need either rethink algorithm, change server, network or network services or decrease parallelism.
factors affect how many threads can run
from above, can see there metric ton of factors involved. result, sweet spot of threads/core accident of multiple causes, including:
- the performance of cpu use, especially:
- number of cores
- smt or not smt
- amount of cache
- speed
- how ram have , speed of memory bus
- the operating system , environment:
- how other work being executed on processors
- windows/linux/bsd/etc have different multitasking characteristics
- the jvm version (each version has different characteristics, more different others)
- traffic , congestion on network , effect on switches , routers involved
- your code
- your algorithm
- the libraries use
from experience, there no magic formula compute a priori best number of threads. problem best tackled empirically (as show above), have done. if need generalize, need sampling of performance on different cpu architectures, memory , networks on operating system of choice.
several observed metrics useful here:
- cpu utilization per core - detect if process cpu bound or not
- load average - reports how may processes (or threads if using lwp) waiting cpu. if creeps figure larger number of cpu cores, cpu cores cpu bound.
if need optimize, best profiling tools can. need specific tool monitoring operating system (eg dtrace solaris), , 1 jvm (i love jprofiler). these tools allow zoom in on precisely areas describe above.
conclusions
it happens particular code, on particular scala library version, jvm version, os, server , redis server, run each thread waiting i/o 95% of time. (if running single threaded, you'd find cpu load 5%).
this allows 20 threads share each cpu optimally in configuration.
this sweet spot because:
- if have fewer threads running, wasting cpu cycles waiting data
- if run more threads either:
- one component of architecture saturates (eg disk or cpu<->ram bus) blocking additional throughput (in case you'd see cpu utilization lower or lower ~90%), or
- the thread context switch cost starts exceed incremental gain of adding threads (and see cpu utilization hit > ~95%)