|
1059 | 1059 | Received: from renoir.op.net ( [email protected] [207.29.195.4])
|
1060 | 1060 | by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
|
1061 | 1061 | for < [email protected]>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
|
1062 |
| -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1. 15 $) with ESMTP id TAA01989 for < [email protected]>; Thu, 20 Jan 2000 19:39:15 -0500 (EST) |
| 1062 | +Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1. 16 $) with ESMTP id TAA01989 for < [email protected]>; Thu, 20 Jan 2000 19:39:15 -0500 (EST) |
1063 | 1063 | Received: from localhost (majordom@localhost)
|
1064 | 1064 | by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
|
1065 | 1065 | Thu, 20 Jan 2000 19:35:19 -0500 (EST)
|
@@ -1586,3 +1586,254 @@ support a couple gigs of RAM now.
|
1586 | 1586 |
|
1587 | 1587 | ************
|
1588 | 1588 |
|
| 1589 | +From [email protected] Mon Aug 21 11:47:56 2000 |
| 1590 | +Received: from hub.org ( [email protected] [216.126.84.1]) |
| 1591 | + by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA07289 |
| 1592 | + for < [email protected]>; Mon, 21 Aug 2000 11:47:55 -0400 (EDT) |
| 1593 | +Received: from hub.org (majordom@localhost [127.0.0.1]) |
| 1594 | + by hub.org (8.10.1/8.10.1) with SMTP id e7LFlpT03383; |
| 1595 | + Mon, 21 Aug 2000 11:47:51 -0400 (EDT) |
| 1596 | +Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1]) |
| 1597 | + by hub.org (8.10.1/8.10.1) with SMTP id e7LFlaT03243 |
| 1598 | + for < [email protected]>; Mon, 21 Aug 2000 11:47:37 -0400 (EDT) |
| 1599 | +Received: (qmail 7416 invoked by alias); 21 Aug 2000 15:54:33 -0000 |
| 1600 | +Received: (qmail 7410 invoked from network); 21 Aug 2000 15:54:32 -0000 |
| 1601 | +Received: from eros.si.fct.unl.pt (193.136.120.112) |
| 1602 | + by fct1.si.fct.unl.pt with SMTP; 21 Aug 2000 15:54:32 -0000 |
| 1603 | +Date: Mon, 21 Aug 2000 16:48:08 +0100 (WEST) |
| 1604 | +From: =?iso-8859-1?Q?Tiago_Ant=E3o?= < [email protected]> |
| 1605 | + |
| 1606 | + |
| 1607 | + |
| 1608 | +Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, |
| 1609 | + constant-->index scan |
| 1610 | + |
| 1611 | + |
| 1612 | +MIME-Version: 1.0 |
| 1613 | +Content-Type: TEXT/PLAIN; charset=US-ASCII |
| 1614 | + |
| 1615 | +Precedence: bulk |
| 1616 | + |
| 1617 | +Status: ORr |
| 1618 | + |
| 1619 | +On Mon, 21 Aug 2000, Tom Lane wrote: |
| 1620 | + |
| 1621 | +> > One thing it might be interesting (please tell me if you think |
| 1622 | +> > otherwise) would be to improve pg with better statistical information, by |
| 1623 | +> > using, for example, histograms. |
| 1624 | +> |
| 1625 | +> Yes, that's been on the todo list for a while. |
| 1626 | + |
| 1627 | + If it's ok and nobody is working on that, I'll look on that subject. |
| 1628 | + I'll start by looking at the analize portion of vacuum. I'm thinking in |
| 1629 | +using arrays for the histogram (I've never used the array data type of |
| 1630 | +postgres). |
| 1631 | + Should I use 7.0.2 or the cvs version? |
| 1632 | + |
| 1633 | + |
| 1634 | +> Interesting article. We do most of what she talks about, but we don't |
| 1635 | +> have anything like the ClusterRatio statistic. We need it --- that was |
| 1636 | +> just being discussed a few days ago in another thread. Do you have any |
| 1637 | +> reference on exactly how DB2 defines that stat? |
| 1638 | + |
| 1639 | + |
| 1640 | + I don't remember seeing that information spefically. From what I've |
| 1641 | +read I can speculate: |
| 1642 | + |
| 1643 | + 1. They have clusterratios for both indexes and the relation itself. |
| 1644 | + 2. They might use an index even if there is no "order by" if the table |
| 1645 | +has a low clusterratio: just to get the RIDs, then sort the RIDs and |
| 1646 | +fetch. |
| 1647 | + 3. One possible way to calculate this ratio: |
| 1648 | + a) for tables |
| 1649 | + SeqScan |
| 1650 | + if tuple points to a next tuple on the same page then its |
| 1651 | +"good" |
| 1652 | + ratio = # good tuples / # all tuples |
| 1653 | + b) for indexes (high speculation ratio here) |
| 1654 | + foreach pointed RID in index |
| 1655 | + if RID is in same page of next RID in index than mark as |
| 1656 | +"good" |
| 1657 | + |
| 1658 | + I suspect that if a tuple size is big (relative to page size) than the |
| 1659 | +cluster ratio is always low. |
| 1660 | + |
| 1661 | + A tuple might also be "good" if it pointed to the next page. |
| 1662 | + |
| 1663 | +Tiago |
| 1664 | + |
| 1665 | + |
| 1666 | +From [email protected] Wed Aug 23 13:00:33 2000 |
| 1667 | +Received: from hub.org ( [email protected] [216.126.84.1]) |
| 1668 | + by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA10259 |
| 1669 | + for < [email protected]>; Wed, 23 Aug 2000 13:00:33 -0400 (EDT) |
| 1670 | +Received: from hub.org (majordom@localhost [127.0.0.1]) |
| 1671 | + by hub.org (8.10.1/8.10.1) with SMTP id e7NGsPN83008; |
| 1672 | + Wed, 23 Aug 2000 12:54:25 -0400 (EDT) |
| 1673 | +Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1]) |
| 1674 | + by hub.org (8.10.1/8.10.1) with SMTP id e7NGniN81749 |
| 1675 | + for < [email protected]>; Wed, 23 Aug 2000 12:49:44 -0400 (EDT) |
| 1676 | +Received: (qmail 9869 invoked by alias); 23 Aug 2000 15:10:04 -0000 |
| 1677 | +Received: (qmail 9860 invoked from network); 23 Aug 2000 15:10:04 -0000 |
| 1678 | +Received: from eros.si.fct.unl.pt (193.136.120.112) |
| 1679 | + by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 15:10:04 -0000 |
| 1680 | +Date: Wed, 23 Aug 2000 16:03:42 +0100 (WEST) |
| 1681 | +From: =?iso-8859-1?Q?Tiago_Ant=E3o?= < [email protected]> |
| 1682 | + |
| 1683 | + |
| 1684 | + |
| 1685 | +Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, |
| 1686 | + constant-->index scan |
| 1687 | + |
| 1688 | + |
| 1689 | +MIME-Version: 1.0 |
| 1690 | +Content-Type: TEXT/PLAIN; charset=US-ASCII |
| 1691 | + |
| 1692 | +Precedence: bulk |
| 1693 | + |
| 1694 | +Status: ORr |
| 1695 | + |
| 1696 | +Hi! |
| 1697 | + |
| 1698 | +On Wed, 23 Aug 2000, Tom Lane wrote: |
| 1699 | + |
| 1700 | +> Yes, we know about that one. We have stats about the most common value |
| 1701 | +> in a column, but no information about how the less-common values are |
| 1702 | +> distributed. We definitely need stats about several top values not just |
| 1703 | +> one, because this phenomenon of a badly skewed distribution is pretty |
| 1704 | +> common. |
| 1705 | + |
| 1706 | + |
| 1707 | + An end-biased histogram has stats on top values and also on the least |
| 1708 | +frequent values. So if a there is a selection on a value that is well |
| 1709 | +bellow average, the selectivity estimation will be more acurate. On some |
| 1710 | +research papers I've read, it's refered that this is a better approach |
| 1711 | +than equi-width histograms (which are said to be the "industry" standard). |
| 1712 | + |
| 1713 | + I not sure whether to use a table or a array attribute on pg_stat for |
| 1714 | +the histogram, the problem is what could be expected from the size of the |
| 1715 | +attribute (being a text). I'm very affraid of the cost of going through |
| 1716 | +several tuples on a table (pg_histogram?) during the optimization phase. |
| 1717 | + |
| 1718 | + One other idea would be to only have better statistics for special |
| 1719 | +attributes requested by the user... something like "analyze special |
| 1720 | +table(column)". |
| 1721 | + |
| 1722 | +Best Regards, |
| 1723 | +Tiago |
| 1724 | + |
| 1725 | + |
| 1726 | + |
| 1727 | +From [email protected] Thu Aug 24 00:21:39 2000 |
| 1728 | +Received: from hub.org ( [email protected] [216.126.84.1]) |
| 1729 | + by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA27662 |
| 1730 | + for < [email protected]>; Thu, 24 Aug 2000 00:21:38 -0400 (EDT) |
| 1731 | +Received: from hub.org (majordom@localhost [127.0.0.1]) |
| 1732 | + by hub.org (8.10.1/8.10.1) with SMTP id e7O46w585951; |
| 1733 | + Thu, 24 Aug 2000 00:06:58 -0400 (EDT) |
| 1734 | +Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) |
| 1735 | + by hub.org (8.10.1/8.10.1) with ESMTP id e7O3uv583775 |
| 1736 | + for < [email protected]>; Wed, 23 Aug 2000 23:56:57 -0400 (EDT) |
| 1737 | +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) |
| 1738 | + by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA20973; |
| 1739 | + Wed, 23 Aug 2000 23:56:35 -0400 (EDT) |
| 1740 | +To: =?iso-8859-1?Q?Tiago_Ant=E3o?= < [email protected]> |
| 1741 | + |
| 1742 | +Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan |
| 1743 | + |
| 1744 | + |
| 1745 | +Comments: In-reply-to =?iso-8859-1?Q?Tiago_Ant=E3o?= < [email protected]> |
| 1746 | + message dated "Wed, 23 Aug 2000 16:03:42 +0100" |
| 1747 | +Date: Wed, 23 Aug 2000 23:56:35 -0400 |
| 1748 | + |
| 1749 | +From: Tom Lane < [email protected]> |
| 1750 | + |
| 1751 | +Precedence: bulk |
| 1752 | + |
| 1753 | +Status: OR |
| 1754 | + |
| 1755 | +=?iso-8859-1?Q?Tiago_Ant=E3o?= < [email protected]> writes: |
| 1756 | +> One other idea would be to only have better statistics for special |
| 1757 | +> attributes requested by the user... something like "analyze special |
| 1758 | +> table(column)". |
| 1759 | + |
| 1760 | +This might actually fall out "for free" from the cheapest way of |
| 1761 | +implementing the stats. We've talked before about scanning btree |
| 1762 | +indexes directly to obtain data values in sorted order, which makes |
| 1763 | +it very easy to find the most common values. If you do that, you |
| 1764 | +get good stats for exactly those columns that the user has created |
| 1765 | +indexes on. A tad indirect but I bet it'd be effective... |
| 1766 | + |
| 1767 | + regards, tom lane |
| 1768 | + |
| 1769 | +From [email protected] Thu Aug 24 05:33:02 2000 |
| 1770 | +Received: from hub.org ( [email protected] [216.126.84.1]) |
| 1771 | + by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id FAA14309 |
| 1772 | + for < [email protected]>; Thu, 24 Aug 2000 05:33:01 -0400 (EDT) |
| 1773 | +Received: from hub.org (majordom@localhost [127.0.0.1]) |
| 1774 | + by hub.org (8.10.1/8.10.1) with SMTP id e7O9X0584670; |
| 1775 | + Thu, 24 Aug 2000 05:33:00 -0400 (EDT) |
| 1776 | +Received: from athena.office.vi.net (office-gwb.fulham.vi.net [194.88.77.158]) |
| 1777 | + by hub.org (8.10.1/8.10.1) with ESMTP id e7O9Ix581216 |
| 1778 | + for < [email protected]>; Thu, 24 Aug 2000 05:19:03 -0400 (EDT) |
| 1779 | +Received: from grommit.office.vi.net [192.168.1.200] (mail) |
| 1780 | + by athena.office.vi.net with esmtp (Exim 3.12 #1 (Debian)) |
| 1781 | + id 13Rt2Y-00073I-00; Thu, 24 Aug 2000 10:11:14 +0100 |
| 1782 | +Received: from jules by grommit.office.vi.net with local (Exim 3.12 #1 (Debian)) |
| 1783 | + id 13Rt2Y-0005GV-00; Thu, 24 Aug 2000 10:11:14 +0100 |
| 1784 | +Date: Thu, 24 Aug 2000 10:11:14 +0100 |
| 1785 | +From: Jules Bean < [email protected]> |
| 1786 | + |
| 1787 | + |
| 1788 | +Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan |
| 1789 | + |
| 1790 | + |
| 1791 | +Mime-Version: 1.0 |
| 1792 | +Content-Type: text/plain; charset=us-ascii |
| 1793 | +Content-Disposition: inline |
| 1794 | +User-Agent: Mutt/1.2i |
| 1795 | +In-Reply-To: < [email protected]>; from [email protected] on Wed, Aug 23, 2000 at 10:30:30AM -0400 |
| 1796 | + |
| 1797 | +Precedence: bulk |
| 1798 | + |
| 1799 | +Status: OR |
| 1800 | + |
| 1801 | +On Wed, Aug 23, 2000 at 10:30:30AM -0400, Tom Lane wrote: |
| 1802 | +> Jules Bean < [email protected]> writes: |
| 1803 | +> > I have in a table a 'category' column which takes a small number of |
| 1804 | +> > (basically fixed) values. Here by 'small', I mean ~1000, while the |
| 1805 | +> > table itself has ~10 000 000 rows. Some categories have many, many |
| 1806 | +> > more rows than others. In particular, there's one category which hits |
| 1807 | +> > over half the rows. Because of this (AIUI) postgresql assumes |
| 1808 | +> > that the query |
| 1809 | +> > select ... from thistable where category='something' |
| 1810 | +> > is best served by a seqscan, even though there is an index on |
| 1811 | +> > category. |
| 1812 | +> |
| 1813 | +> Yes, we know about that one. We have stats about the most common value |
| 1814 | +> in a column, but no information about how the less-common values are |
| 1815 | +> distributed. We definitely need stats about several top values not just |
| 1816 | +> one, because this phenomenon of a badly skewed distribution is pretty |
| 1817 | +> common. |
| 1818 | + |
| 1819 | +ISTM that that might be enough, in fact. |
| 1820 | + |
| 1821 | +If you have stats telling you that the most popular value is 'xyz', |
| 1822 | +and that it constitutes 50% of the rows (i.e. 5 000 000) then you can |
| 1823 | +conclude that, on average, other entries constitute a mere 5 000 |
| 1824 | +000/999 ~~ 5000 entries, and it would be definitely be enough. |
| 1825 | +(That's assuming you store the number of distinct values somewhere). |
| 1826 | + |
| 1827 | + |
| 1828 | +> BTW, if your highly-popular value is actually a dummy value ('UNKNOWN' |
| 1829 | +> or something like that), a fairly effective workaround is to replace the |
| 1830 | +> dummy entries with NULL. The system does account for NULLs separately |
| 1831 | +> from real values, so you'd then get stats based on the most common |
| 1832 | +> non-dummy value. |
| 1833 | + |
| 1834 | +I can't really do that. Even if I could, the distribution is very |
| 1835 | +skewed -- so the next most common makes up a very high proportion of |
| 1836 | +what's left. I forget the figures exactly. |
| 1837 | + |
| 1838 | +Jules |
| 1839 | + |
0 commit comments