Performance comparison of Ruby's Array and Set with strings and symbols
20 Mar 2017Yesterday I was refactoring a piece of code checking if we can handle an uploaded document. Users can upload different documents which sometimes aren't well described - some miss an extension, some have an extension that doesn't match the content (pdf files which are not pdfs, etc.). For some documents we want to generate previews so it's really important to run this job on files that we know we can handle. So instead of relaying on the extension I added a simple content check using mimemagic.
So later I was extending a function that decides can we generate a preview for a file, that's a first version:
def needs_preview?
return true if %w{
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/pdf
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/plain
application/rtf
image/jpeg
}.include? document_content_type
%w{.xls .xlsx .ppt .pptx .pdf .doc .docx .txt .rtf .jpg .jpeg}.include? extension
end
But as I was writing this I thought - hmm, why don't I be a good programmer and use a better structure for this - Set
. An array has this problem that you need to compare every element to check if it's in the array, it's O(n)
. Set is based on hashes so in theory can find an element in O(1)
(constant time).
So I rewrote the code and was happy about it. But later got really curious and decided to actually benchmark it.
creating a new
Warming up --------------------------------------
array 189.396k i/100ms
set 16.874k i/100ms
Calculating -------------------------------------
array 4.090M (± 7.2%) i/s - 20.455M in 5.027083s
set 172.724k (± 5.3%) i/s - 877.448k in 5.094293s
Comparison:
array: 4090134.6 i/s
set: 172724.4 i/s - 23.68x slower
include?
miss
Warming up --------------------------------------
array 173.707k i/100ms
array symbols 158.803k i/100ms
array symbols + to_sym
131.336k i/100ms
set 182.543k i/100ms
set symbols 199.911k i/100ms
set symbols + to_sym 170.006k i/100ms
Calculating -------------------------------------
array 3.818M (± 6.6%) i/s - 19.108M in 5.026220s
array symbols 2.891M (± 6.0%) i/s - 14.451M in 5.015273s
array symbols + to_sym
2.312M (± 6.0%) i/s - 11.558M in 5.016827s
set 4.726M (± 6.0%) i/s - 23.731M in 5.039017s
set symbols 5.874M (± 6.8%) i/s - 29.387M in 5.025485s
set symbols + to_sym 3.677M (± 7.1%) i/s - 18.361M in 5.018556s
Comparison:
set symbols: 5873938.4 i/s
set: 4726223.9 i/s - 1.24x slower
array: 3818309.1 i/s - 1.54x slower
set symbols + to_sym: 3676572.9 i/s - 1.60x slower
array symbols: 2891349.9 i/s - 2.03x slower
array symbols + to_sym: 2312024.8 i/s - 2.54x slower
hit
Warming up --------------------------------------
array 198.282k i/100ms
array symbols 188.585k i/100ms
array symbols + to_sym
155.503k i/100ms
set 187.467k i/100ms
set symbols 216.999k i/100ms
set symbols + to_sym 216.249k i/100ms
Calculating -------------------------------------
array 4.777M (± 6.2%) i/s - 23.794M in 4.999723s
array symbols 4.271M (± 6.3%) i/s - 21.310M in 5.008877s
array symbols + to_sym
3.043M (± 6.2%) i/s - 15.239M in 5.026281s
set 4.611M (± 6.6%) i/s - 23.058M in 5.022503s
set symbols 6.434M (± 8.2%) i/s - 32.116M in 5.024084s
set symbols + to_sym 6.318M (± 7.2%) i/s - 31.572M in 5.022345s
Comparison:
set symbols: 6434032.0 i/s
set symbols + to_sym: 6317710.7 i/s - same-ish: difference falls within error
array: 4776713.5 i/s - 1.35x slower
set: 4610506.9 i/s - 1.40x slower
array symbols: 4270964.5 i/s - 1.51x slower
array symbols + to_sym: 3043429.0 i/s - 2.11x slower
So that was interesting. First of all these are not scientific results, ran them on my MacBook Pro doing a lot of other things in the background but I repeated the benchmark and got similar results, that's from one of the runs (not a median of the runs, not going to bother with it).
What surpised me is the time I takes to create the set.
As expected set symbols
for miss scenario is the fastest, what's interesting that creating a symbol from string takes a lot of time and makes it even slower than browsing the array. This all depends on the size of the array of course. I ran this benchmark with a smaller array first and then the browsing through the array was faster than computing the hash.
What I don't understand is why there's such a difference between array
and array symbols
? Or why is set
that slower than set symbols
. I compared hash
cost and there is some difference that could influence the set:
hash
Warming up --------------------------------------
hash 227.244k i/100ms
hash symbol 238.749k i/100ms
Calculating -------------------------------------
hash 7.063M (± 4.7%) i/s - 35.450M in 5.030842s
hash symbol 8.551M (± 4.5%) i/s - 42.736M in 5.008114s
Comparison:
hash symbol: 8550770.0 i/s
hash: 7062788.4 i/s - 1.21x slower
For hit scenario the story is similar - not sure why it looks like it looks. So I learned that I don't understand how Ruby works :-)
But I know how it compares to Java
Benchmark Mode Cnt Score Error Units
ContainsHit.arrayContains thrpt 15 104613718,801 ± 1303381,183 ops/s
ContainsHit.setContains thrpt 15 192043168,043 ± 1822039,485 ops/s
ContainsMiss.arrayContains thrpt 15 47823394,222 ± 516549,078 ops/s
ContainsMiss.setContains thrpt 15 341901148,094 ± 3934969,687 ops/s
Create.createArray thrpt 15 51892174,423 ± 593297,573 ops/s
Create.createSet thrpt 15 4830737,889 ± 83773,408 ops/s
Now, creating a set is ~10 times slower.
Comparing to Ruby going through the whole array is ~12 times faster in Java (miss scenario). Now comparing sets: miss scenario is ~58 times faster in Java, and hit scenario is ~29 times faster in Java.
Does it matter to me? Not at all :-)
Was it fun? Yeah!
You want to try it out? There's the code available which you can run on your machine. You can also easily use it to compare other things.
Have comments? Share them with me @devonsteroids
PS
Looking at Ruby's source code array.c
you can find it uses an optimized rb_equal_opt
which is defined in vm_insnhelper.c
which finally uses obj_eq_func
- it has special handling for integers, floats, numbers and strings, but doesn't have any for symbols. Haven't debugged it yet but that could be the reason.
This article is Part 1 in a 2-Part Series.
- Part 1 - This Article
- Part 2 - Making Ruby's Array.include? faster for symbols