Web lists-archives.com

Re: help with query to count rows while excluding certain rows




On Fri, Jan 1, 2016 at 9:31 PM, Peter Brawley
<peter.brawley@xxxxxxxxxxxxx> wrote:
> On 1/1/2016 19:24, Larry Martell wrote:
>>
>> On Fri, Jan 1, 2016 at 2:12 PM, Peter Brawley
>> <peter.brawley@xxxxxxxxxxxxx> wrote:
>>>
>>> On 12/31/2015 0:51, Larry Martell wrote:
>>>>
>>>> I need to count the number of rows in a table that are grouped by a
>>>> list of columns, but I also need to exclude rows that have more then
>>>> some count when grouped by a different set of columns. Conceptually,
>>>> this is not hard, but I am having trouble doing this efficiently.
>>>>
>>>> My first counting query would be this:
>>>>
>>>> SELECT count(*)
>>>> FROM cst_rollup
>>>> GROUP BY target_name_id, ep, roiname, recipe_process,
>>>> recipe_product, recipe_layer, f_tag_bottom,
>>>> measname, recipe_id
>>>>
>>>> But from this count I need to subtract the count of rows that have
>>>> more then 50 rows with a different grouping:
>>>>
>>>> SELECT count(*)
>>>> FROM cst_rollup
>>>> GROUP BY target_name_id, ep, wafer_id
>>>> HAVING count(*) >= 50
>>>>
>>>> As you can see, the second query has wafer_id, but the first query does
>>>> not.
>>>>
>>>> Currently I am doing this in python, and it's slow. In my current
>>>> implementation I have one query, and it selects the columns (i.e.
>>>> doesn't just count), and I have added wafer_id:
>>>>
>>>> SELECT target_name_id, ep, roiname, recipe_process,
>>>> recipe_product, recipe_layer, f_tag_bottom,
>>>> measname, recipe_id, wafer_id
>>>> FROM cst_rollup
>>>>
>>>> Then I go through the result set (which can be over 200k rows) and I
>>>> count the number of rows with matching (target_name_id, ep, wafer_id).
>>>> Then I go through the rows again and regroup them without wafer_id,
>>>> but skipping the rows that have more then 50 rows for that row's
>>>> (target_name_id, ep, wafer_id).
>>>>
>>>> Is this clear to everyone what I am trying to do?
>>>
>>>
>>> If I've understand this correctly, the resultset you wish to aggregate on
>>> is
>>> ...
>>>
>>> select target_name_id, ep, wafer_id
>>> from cst_rollup a
>>> left join (               -- exclude rows for which wafer_id count >= 50
>>>    select name_id, ep, wafer, count(*) n
>>>    from cst_rollup
>>>    group by target_name_id, ep, wafer_id
>>>    having n >= 50
>>> ) b using ( target_name_id, ep, wafer_id )
>>> where b.target_name is null ;
>>>
>>> If that's so, you could assemble that resultset in a temp table then run
>>> the
>>> desired aggregate query on it, or you could aggregate on it directly as a
>>> subquery.
>>
>> That query gives:
>>
>> ERROR 1137 (HY000): Can't reopen table: 'a'
>
>
> So, it's a temporary table, and you'll need to make that not so.

Yes, cst_rollup is a temp table. The underlying table is millions of
rows (with 300 columns) so for efficiency a subset of the rows and
columns are selected into the temp table based on some user input.
It's just the rows in the temp table that are of interest for the
current report.

I was able to get this working with a second temp table:

CREATE TEMPORARY TABLE rollup_exclude
SELECT target_name_id, ep, wafer_id, count(*) n
FROM cst_rollup
GROUP BY target_name_id, ep, wafer_id
HAVING n >= 50

And then:

SELECT count(*)
FROM cst_rollup
LEFT JOIN(
    SELECT target_name_id, ep, wafer_id
    FROM rollup_exclude) b
    USING (target_name_id, ep, wafer_id)
WHERE b.target_name_id IS NULL
GROUP by target_name_id, ep, roiname, recipe_process, recipe_product,
recipe_layer, f_tag_bottom, measname, recipe_id

And the rowcount from that query gave me what I needed.

Thanks very much for the help Peter, you gave me a push toward the right path.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql