Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Create database indices for unread items and deleted urls#2057

Open
noctux wants to merge 1 commit intonewsboat:masterfrom
noctux:additional-sqlite-indices
Open

Create database indices for unread items and deleted urls#2057
noctux wants to merge 1 commit intonewsboat:masterfrom
noctux:additional-sqlite-indices

Conversation

@noctux
Copy link
Contributor

@noctux noctux commented May 6, 2022

As reported by user Evil_Bob on IRC, we have several SQL queries on our
critical startup path that query the rss_items table on 1) unread
items and 2) a combination of feedurl and the deleted flag.
He suggests that this can be speed up significantly by introducing two
additional indices which this commit adds.

Some initial (not very scientific) measurements on my personal database:
(for i in {0..9}; do time newsboat -x print-unread; done):

Without indices:

newsboat -x print-unread  5,46s user 5,80s system 95% cpu 11,830 total
newsboat -x print-unread  5,44s user 5,83s system 95% cpu 11,827 total
newsboat -x print-unread  5,37s user 5,76s system 95% cpu 11,691 total
newsboat -x print-unread  5,33s user 5,84s system 95% cpu 11,748 total
newsboat -x print-unread  5,54s user 5,77s system 95% cpu 11,889 total
newsboat -x print-unread  5,36s user 5,86s system 94% cpu 11,931 total
newsboat -x print-unread  5,32s user 6,02s system 95% cpu 11,912 total
newsboat -x print-unread  5,22s user 5,95s system 95% cpu 11,735 total
newsboat -x print-unread  5,48s user 5,82s system 95% cpu 11,854 total
newsboat -x print-unread  5,34s user 5,85s system 95% cpu 11,756 total

With indices:

newsboat -x print-unread  1,27s user 0,34s system 74% cpu 2,179 total
newsboat -x print-unread  1,22s user 0,35s system 74% cpu 2,113 total
newsboat -x print-unread  1,21s user 0,34s system 73% cpu 2,099 total
newsboat -x print-unread  1,22s user 0,34s system 73% cpu 2,120 total
newsboat -x print-unread  1,24s user 0,35s system 72% cpu 2,201 total
newsboat -x print-unread  1,25s user 0,31s system 73% cpu 2,125 total
newsboat -x print-unread  1,19s user 0,38s system 74% cpu 2,124 total
newsboat -x print-unread  1,23s user 0,32s system 73% cpu 2,115 total
newsboat -x print-unread  1,21s user 0,33s system 73% cpu 2,095 total
newsboat -x print-unread  1,19s user 0,36s system 73% cpu 2,099 total

Of course on the other hand the indices consume a bit more space in the
.db file:

Before: ./cache.db: 911M
After: ./cache.db: 924M

Still, this seems like an adequate trade-off to make.

Please note: The release number embedded in the sql-migration has to be bumped on the next release. I don't know what the release engineering process is at this point :)

As reported by user Evil_Bob on IRC, we have several SQL queries on our
critical startup path that query the rss_items table on 1) unread
items and 2) a combination of feedurl and the deleted-flag.
He suggests that this can be speed up significantly by introducing two
additional indices which this commit introduces.

Some initial (not very scientific) measurements on my personal database:
(for i in {0..9}; do time newsboat -x print-unread; done):

Without indices:
newsboat -x print-unread  5,46s user 5,80s system 95% cpu 11,830 total
newsboat -x print-unread  5,44s user 5,83s system 95% cpu 11,827 total
newsboat -x print-unread  5,37s user 5,76s system 95% cpu 11,691 total
newsboat -x print-unread  5,33s user 5,84s system 95% cpu 11,748 total
newsboat -x print-unread  5,54s user 5,77s system 95% cpu 11,889 total
newsboat -x print-unread  5,36s user 5,86s system 94% cpu 11,931 total
newsboat -x print-unread  5,32s user 6,02s system 95% cpu 11,912 total
newsboat -x print-unread  5,22s user 5,95s system 95% cpu 11,735 total
newsboat -x print-unread  5,48s user 5,82s system 95% cpu 11,854 total
newsboat -x print-unread  5,34s user 5,85s system 95% cpu 11,756 total

With indices:
newsboat -x print-unread  1,27s user 0,34s system 74% cpu 2,179 total
newsboat -x print-unread  1,22s user 0,35s system 74% cpu 2,113 total
newsboat -x print-unread  1,21s user 0,34s system 73% cpu 2,099 total
newsboat -x print-unread  1,22s user 0,34s system 73% cpu 2,120 total
newsboat -x print-unread  1,24s user 0,35s system 72% cpu 2,201 total
newsboat -x print-unread  1,25s user 0,31s system 73% cpu 2,125 total
newsboat -x print-unread  1,19s user 0,38s system 74% cpu 2,124 total
newsboat -x print-unread  1,23s user 0,32s system 73% cpu 2,115 total
newsboat -x print-unread  1,21s user 0,33s system 73% cpu 2,095 total
newsboat -x print-unread  1,19s user 0,36s system 73% cpu 2,099 total

Of course on the other hand the indices consume a bit more space in the
.db file:

Before: ./cache.db: 911M
After:  ./cache.db: 924M

Still, this seems like an adequate trade-off to make.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.009%) to 59.291% when pulling 3cc6bbc on noctux:additional-sqlite-indices into 78506f2 on newsboat:master.

@Minoru
Copy link
Member

Minoru commented May 9, 2022

Weird, I can't reproduce the speedup. For me, this PR is consistently 1% faster than the current master, but that's it. This might be due to some settings I have in my config, or maybe I should experiment on a real disk instead of tmpfs. I'll try different configurations later and report back.

What I can reproduce is the size increase :) 4.4% for my database. It's okay as long as we get a comparable speed increase, I think.

@Minoru
Copy link
Member

Minoru commented May 9, 2022

Ideas from IRC discussion with noctux and Evil_Bob:

  • try this on a real disk, not tmpfs
  • try applying the patch to the latest release, not master
    • noctux's measurements above weren't even done with this PR; the indices were added manually. Could our DB versioning be the culprit? I wonder

@Minoru
Copy link
Member

Minoru commented May 11, 2022

Nope, can't reproduce. Things I tried:

  • running the tests on HDD rather than tmpfs (with plenty of free RAM for the page cache, and one warm-up run before each 10-run test)
  • cherry-picking this PR onto r2.27 tag (i.e. applying it to the latest release rather than master)
  • adding the indices manually (actually, I ran the same binary on two different databases, one "stock", the other one already converted by an earlier invocation of the code from this PR)
  • running with default settings (--config-file=/dev/null)

Each test run shows that this PR is 1% faster than the baseline, which is slightly comforting :)

I start to wonder if my data is somehow non-representative. @noctux, does your cache.db contain anything private? :) Would it be possible to share it with me (privately, under a promise that I won't pass it further and will destroy my copy once the mystery here is solved)?

Also, if you have a minute, can you please re-run your tests with --config-file=/dev/null to make sure that your config doesn't affect this? (It's not very likely that both you and Evil_Bob have a setting that I don't, but hey, it's easy enough to check)/

@Minoru
Copy link
Member

Minoru commented May 28, 2022

Notes from a couple discussions we had with Evil_Bob and noctux on IRC:

  • I made a subset of my cache.db, and both Evil_Bob and noctux don't see the expected speedup on it too! So my data is to blame for my (lack of) results above
  • Evil_Bob dumped my subset cache.db, created a new one, and still couldn't observe the promised speedup: it went from 17 seconds on my file to 14 on his, but there is no difference between indexed and non-indexed cases
  • my subset cache.db has a page size of 1024 bytes, whereas newly created databases have 4096 (I found that out using .dbinfo in sqlite3 shell). If I take my subset cache.db, open sqlite3 shell, type pragma page_size=4096; vacuum;, then echo qy | newsboat goes from 20 seconds to 16.8 (on tmpfs). Still no difference between indiced and non-indiced cases though. The database with 4K pages is 0.9% smaller
  • Evil_Bob wrote a script that generates a bunch of feeds and a urls file that includes them. With this setup, I can reproduce the speedup promised by this PR: 15s went down to 3.6s (12s to 3.18s for Evil_Bob)

I want to get to the bottom of this, because I feel uneasy merging something that doesn't quite behave the way we expect it to. OTOH we haven't seen a case where this PR is slower than current master, so maybe I'll just cave in and merge it even if I don't understand the behaviour.

I think the next step is to look at what SQL queries should use these indices, and compare the output of SQL ANALYZE between my subset DB and the one generated by the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments