Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bertlebee
Copy link
Contributor

@bertlebee
Copy link
Contributor Author

adjustRead was returning the sum of all read locks in the map, but the only places it's used both say it returns the number of locks held by the current fiber. replaced summing an iterable of values in a map with a single map lookup

@bertlebee bertlebee force-pushed the improve-treentrantlock-performance branch from e613f55 to cf5fe94 Compare March 14, 2020 20:43
@bertlebee
Copy link
Contributor Author

@mijicd I've taken a reasonably detailed look at the code (just within TReentrantLock, haven't looked at TRef at all). The adjustRead cahnges mentioned above seems to have improved performance a bit, but not as much as I originally thought. I'm not really trusting the benchmarks. I was getting some inconsistent results so decided to up the warm-up and measurement runs to 20 each, and ran it 3 times in a row overnight with nothing else open. As you can see below, the fluctuation in some of the tests is significantly more than the reported error. Any ideas what may be causing this? I have a theory that sometimes contention spirals out of control and may in rare cases significantly slow down throughput (potentially for the rest of the run), but have no idea how I'd test that.
image

Apart from the 'fix' I've done with adjustRead, the only other idea I've come up with is to return unit from all the acquire/release functions. This seems to improve performance - the question is, do we need the number of read/write locks when we create/release them? I imagine the usual way of using the TReentrantLock is to use writeLock and readLock and not actually care how many read/write locks there are, only that you have one. If you do care for some reason you could still get the value fairly easily.

*/
def writeLocks: STM[Nothing, Int] = data.get.map(_.fold(_ => 0, _.writeLocks))

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not put the scaladoc on private method.

@mijicd
Copy link
Member

mijicd commented Mar 17, 2020

My intuition always is to run 15+ iterations for both phases, it can help with reducing the noise. Regarding the signature change, I don't think it's unreasonable to use Unit (although penalty will be paid in obtaining the count).

@bertlebee
Copy link
Contributor Author

@jdegoes what are your thoughts on swapping the return type to Unit? I seems to improve performance in most (but not all) cases (this is showing totals of 3 runs of 20 iterations each overnight with nothing else open).
image
@mijicd is working on some other magic at a deeper level which will hopefully help a lot more.

@mijicd
Copy link
Member

mijicd commented Mar 18, 2020

@jdegoes The magic @unclebob418 is referring to was the trick with locks we spoke about. My hunch here is that collect is being dreadful due to retries, but I have no proof about that. Perhaps benchmarking it separately might confirm that (or not)?

@jdegoes
Copy link
Member

jdegoes commented Mar 18, 2020

I am fine swapping return types to Any to avoid flatMapping to Unit overhead.

@mijicd Yes, we should create a ticket for dealing with "retry storms".

Honestly optimizing this lock is going to come down to two things:

  1. Minimizing allocations in the happy path, which will require careful inspection and translation from one form to another.
  2. Core STM performance work, which has degraded a bit in recent times due to necessary features like trampolining.

@mijicd
Copy link
Member

mijicd commented Mar 18, 2020

I'm already on retry storming. I'd say point 1 should be handled here, together with changing the signatures as you suggested.

@CLAassistant
Copy link

CLAassistant commented Mar 20, 2020

CLA assistant check
All committers have signed the CLA.

@bertlebee bertlebee force-pushed the improve-treentrantlock-performance branch 4 times, most recently from 309d7c7 to 02e48d6 Compare April 21, 2020 02:28
@bertlebee
Copy link
Contributor Author

@mijicd @jdegoes I've done some more work on this using the unsafe get and set that @mijicd recently added. It gives a nice performance boost over and above swapping to return unit (and I could do it without flatMapping to unit). The unsafe changes also provide cheaper access to the number of locks. I guess the question is how much performance are we willing to sacrifice for a nicer (and more performant) experience when a user cares about the number of locks?

two alternatives:

  1. provide both options (which would require code duplication)
  2. the middle ground where we provide the number of write locks but not read locks (WriteLock count is available for free - it's the ReadLock count that costs). I don't like the inconsistency here.

benchmark results:
image

@bertlebee
Copy link
Contributor Author

@KamalKang any thoughts on above? do you use the lock counts?

@jdegoes
Copy link
Member

jdegoes commented Apr 24, 2020

@unclebob418 How do these compare from baseline, pre-optimization?

@bertlebee
Copy link
Contributor Author

@jdegoes that's the blue bars, probably should have called it baseline instead of benchmark but .. I didn't :) Reran that when I rebased. It's a significant percentage but still nothing compared to the java reentrant lock (which I also added a benchmark for) or stamped lock.

@jdegoes
Copy link
Member

jdegoes commented Apr 25, 2020

@unclebob418 Are you open to improving the benchmark in this pull request?

There are some problems with it:

  1. First, writeLock and readLock are used directly, which utilize ZManaged. That gives you safety, ensuring you release what you acquire, but it also adds overhead, and Java's locks don't have this feature, so it's not a fair comparison. So instead of using functions that use ZManaged, we should acquire and release manually.
  2. Second, for purely functional code, it does not make sense to benchmark one operation in isolation and have JMH re-run it a lot. That's because for JMH to execute a purely functional test, it must call unsafeRun. This function creates a fiber (with a lot of resources) and uses a thread, so you end up measuring the overhead of fibers, and not the overhead of the operation itself.

To solve these problems: both Java and ZIO implementations should read/write lock manually and explicitly, in sequence; and these should be repeated a lot inside the benchmark methods (e.g. 10,000 times). This is separate from JMH iterations.

For the STM one, you can stay inside the STM monad, and just commit the repetition.

Do these changes make sense?

@bertlebee
Copy link
Contributor Author

@jdegoes I should be able to manage that. Once I've done that I'll compare unit/int return types with the new benchmark and we can make a call on if we're willing to sacrifice a bit of performance for better ergonomics. I don't really want to rewrite all the tests unless we're committed to that.

@bertlebee
Copy link
Contributor Author

Gut feel is that with fiber overhead reduced in the benchmarks, the disparity will be greater, but better to measure these things!

@mijicd
Copy link
Member

mijicd commented Apr 25, 2020

@unclebob418 you can check out tmap benchmarks where we added the amortisation @jdegoes mentioned for all single-element operations.

def reentrantLockRead(): Unit =
for (_ <- calls) {
reentrantLock.lock()
doWork()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete doWork and doWorkM, we don't want to measure the cost of doing any work, even a function call. We want to measure only the lock/unlock overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this have the effect of significantly reducing contention? the locks will be for such a short time that they'll rarely conflict

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you're right about that. Let's do ZIO.yieldTo for doWorkM

For doWork, we are going to have to do a sleep, but the question is how long? If we want it to be fair, it would be the same "size" as doWorkM (which involves submitting work to a thread pool, and having a thread pick up and execute that work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did fiddle with this a bit, main issue is even the smallest sleep (1 ms) kills so many iterations that we probably need to run the benchmark longer to get an accurate measure. I think we probably need to do something crazy like actually update some data structure. Something like the zio "compute pi" exercise - both the readers and writers have to do something to give some level of contention (updating the state or computing in vs out for the readers), but that will still be less than 1 ms, and it's a somewhat legitimate use case; the whole reason for having locks is to prevent concurrent updates in this sort of situation (I'm always suss on benchmarks that don't do anything somewhat useful in the real world).

@bertlebee
Copy link
Contributor Author

I'm not going to have time for the rest of the week to work on this, so how about we merge these changes (still a significant performance improvement) and we can revisit the benchmarks when I have time.

created #3469 for benchmark improvements.

@bertlebee bertlebee force-pushed the improve-treentrantlock-performance branch from 3e9778b to 0447dc0 Compare April 26, 2020 23:48
@bertlebee bertlebee changed the title WIP Improve performance of TReentrantLock Improve performance of TReentrantLock Apr 27, 2020
@bertlebee bertlebee force-pushed the improve-treentrantlock-performance branch 3 times, most recently from 2c69f82 to 475d1ee Compare May 2, 2020 00:00
@bertlebee bertlebee force-pushed the improve-treentrantlock-performance branch from 475d1ee to 42d6d04 Compare May 2, 2020 23:09
@bertlebee bertlebee requested a review from mijicd May 3, 2020 02:59
@mijicd
Copy link
Member

mijicd commented May 3, 2020

Can you post the results before and after the change, just for the reference? Besides that, the changes you made look good, and I agree with the idea to tweak the benchmark in #3469.

@bertlebee
Copy link
Contributor Author

bertlebee commented May 3, 2020

Sure, here's before and after @mijicd
image

and in text

[info] Benchmark                                                            Mode  Cnt        Score       Error  Units      Run        Change
[info] TReentrantLockBenchmark.ZioLockBasic                                thrpt   20   522139.251   11491.685  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockBasic:zioLockReadGroup               thrpt   20   230120.090    3995.350  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockBasic:zioLockWriteGroup              thrpt   20   292019.160    8303.931  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockHighContention                       thrpt   20   629125.356   22723.223  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockHighContention:zioLockReadGroup3     thrpt   20   304528.013   11244.138  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockHighContention:zioLockWriteGroup3    thrpt   20   324597.343   11495.718  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockLowContention                        thrpt   20   600956.357   25310.179  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockLowContention:zioLockReadGroup1      thrpt   20   475434.587   18639.291  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockLowContention:zioLockWriteGroup1     thrpt   20   125521.770    6680.379  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockMediumContention                     thrpt   20   628972.304   10211.665  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockMediumContention:zioLockReadGroup2   thrpt   20   409846.666    7833.818  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockMediumContention:zioLockWriteGroup2  thrpt   20   219125.638    2442.258  ops/s        1        before
[info] TReentrantLockBenchmark.ZioLockBasic                                thrpt   20   599344.626   25511.630  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockBasic:zioLockReadGroup               thrpt   20   299051.943   13191.109  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockBasic:zioLockWriteGroup              thrpt   20   300292.684   12327.459  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockHighContention                       thrpt   20   715508.996    9512.984  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockHighContention:zioLockReadGroup3     thrpt   20   360064.931    4667.825  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockHighContention:zioLockWriteGroup3    thrpt   20   355444.065    5042.625  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockLowContention                        thrpt   20   726110.785   20294.538  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockLowContention:zioLockReadGroup1      thrpt   20   581371.937   15294.006  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockLowContention:zioLockWriteGroup1     thrpt   20   144738.848    5064.187  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockMediumContention                     thrpt   20   697996.173   15483.251  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockMediumContention:zioLockReadGroup2   thrpt   20   468654.310    9765.954  ops/s        2         after
[info] TReentrantLockBenchmark.ZioLockMediumContention:zioLockWriteGroup2  thrpt   20   229341.862    5751.922  ops/s        2         after

@bertlebee
Copy link
Contributor Author

@jdegoes @mijicd can we please merge this?

@mijicd
Copy link
Member

mijicd commented May 5, 2020

@unclebob418 yes :)

@mijicd mijicd merged commit b999532 into zio:master May 5, 2020
@bertlebee bertlebee deleted the improve-treentrantlock-performance branch September 21, 2020 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants