ruby - Big task or multiple small tasks with Sidekiq -
i'm writting worker add lot's of users group. i'm wondering if it's better run big task had users, or batch 100 users or 1 one per task.
for moment here code
class adduserstogroupworker include sidekiq::worker sidekiq_options :queue => :group_utility def perform(store_id, group_id, user_ids_to_add) begin store = store.find store_id group = group.find group_id rescue activerecord::recordnotfound => e airbrake.notify e return end users_to_process = store.users.where(id: user_ids_to_add) .where.not(id: group.user_ids) group.users += users_to_process users_to_process.map(&:id).each |user_to_process_id| updatelastupdatesforuserworker.perform_async store.id, user_to_process_id end end end
maybe it's better have in method :
def add_users users_to_process = store.users.where(id: user_ids_to_add) .where.not(id: group.user_ids) users_to_process.map(&:id).each |user_to_process_id| addusertogroupworker.perform_async group_id, user_to_process_id updatelastupdatesforuserworker.perform_async store.id, user_to_process_id end end
but many find
request. think ?
i have sidekig pro licence if needed (for batch example).
here thoughts.
1. single sql query instead of n queries
this line: group.users += users_to_process
produce n sql queries (where n users_to_process.count). assume have many-to-many connection between users , groups (with user_groups
join table/model), should use mass inserting data technique:
users_to_process_ids = store.users.where(id: user_ids_to_add) .where.not(id: group.user_ids) .pluck(:id) sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, now(), now())"} group.connection.execute(" insert groups_users (user_id, group_id, created_at, updated_at) values #{sql_values.join(",")} ")
yes, it's raw sql. , it's fast.
2. user pluck(:id)
instead of map(&:id)
pluck
quicker, because:
- it select 'id' column, less data transferred db
- more importantly, won't create activerecord object each raw
doing sql cheap. creating ruby objects expensive.
3. use horizontal parallelization instead of vertical parallelization
what mean here, if need sequential tasks a -> b -> c
dozen of records, there 2 major ways split work:
- vertical segmentation.
aworker
a(1)
,a(2)
,a(3)
;bworker
b(1)
, etc.;cworker
c(i)
jobs; - horizontal segmentation.
universalworker
a(1)+b(1)+c(1)
.
use latter (horizontal) way.
it's statement experience, not theoretical point of view (where both ways feasible).
why should that?
- when use vertical segmentation, errors when pass job 1 worker down another. such kind of errors. pull hair out if bump such errors, because aren't persistent , reproducible. happen , aren't. possible write code pass work down chain without errors? sure, is. it's better keep simple.
- imagine server @ rest. , new jobs arrive.
b
,c
workers waste ram, whilea
workers job. ,a
,c
waste ram, whileb
's @ work. , on. if make horizontal segmentation, resource drain out.
applying advice specific case: starters, don't call perform_async
in async task.
4. process in batches
answering original question – yes, process in batches. creating , managing async task takes resources itself, there's no need create many of them.
tl;dr in end, code this:
# model code batch_size = 100 def add_users users_to_process_ids = store.users.where(id: user_ids_to_add) .where.not(id: group.user_ids) .pluck(:id) # 100,000 users performance of query should acceptable # make in synchronous fasion sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, now(), now())"} group.connection.execute(" insert groups_users (user_id, group_id, created_at, updated_at) values #{sql_values.join(",")} ") users_to_process_ids.each_slice(batch_size) |batch| addusertogroupworker.perform_async group_id, batch end end # add_user_to_group_worker.rb def perform(group_id, user_ids_to_add) group = group.find group_id # heavy load batch whole # ... # ... # if nothing here left, call updatelastupdatesforuserworker model instead user_ids_to_add.each |id| # synchronously – parallelized job # splitting in slices in model above updatelastupdatesforuserworker.new.perform store.id, user_to_process_id end end
Comments
Post a Comment