Warum verlangsamt ein Gin-Index für eine JSONB-Spalte meine Abfrage und was kann ich dagegen tun?

Testdaten initialisieren:

CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE TABLE docs (data JSONB NOT NULL DEFAULT '{}');
-- generate 200k documents, ~half with type: "type1" and another half with type: "type2", unique incremented index and random uuid per each row
INSERT INTO docs (data)
SELECT json_build_object('id', gen_random_uuid(), 'type', (CASE WHEN random() > 0.5 THEN 'type1' ELSE 'type2' END) ,'index', n)::JSONB
FROM generate_series(1, 200000) n;
-- inset one more row with explicit uuid to query by it later
INSERT INTO docs (data) VALUES (json_build_object('id', '30e84646-c5c5-492d-b7f7-c884d77d1e0a', 'type', 'type1' ,'index', 200001)::JSONB);

Erste Abfrage - Filtern nach Daten-> Typ und Grenze:

-- FAST ~19ms
EXPLAIN ANALYZE
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
LIMIT 25;
/* "Limit  (cost=0.00..697.12 rows=25 width=90) (actual time=0.029..0.070 rows=25 loops=1)"
   "  ->  Seq Scan on docs  (cost=0.00..5577.00 rows=200 width=90) (actual time=0.028..0.061 rows=25 loops=1)"
   "        Filter: (data @> '{"type": "type1"}'::jsonb)"
   "        Rows Removed by Filter: 17"
   "Planning time: 0.069 ms"
   "Execution time: 0.098 ms" 
*/

Zweite Abfrage - Filtern nach Daten-> Typ, Reihenfolge nach Daten-> Index und Limit

-- SLOW ~250ms
EXPLAIN ANALYZE
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
ORDER BY data->'index' -- added ORDER BY
LIMIT 25;

/* "Limit  (cost=5583.14..5583.21 rows=25 width=90) (actual time=236.750..236.754 rows=25 loops=1)"
   "  ->  Sort  (cost=5583.14..5583.64 rows=200 width=90) (actual time=236.750..236.750 rows=25 loops=1)"
   "        Sort Key: ((data -> 'index'::text))"
   "        Sort Method: top-N heapsort  Memory: 28kB"
   "        ->  Seq Scan on docs  (cost=0.00..5577.50 rows=200 width=90) (actual time=0.020..170.797 rows=100158 loops=1)"
   "              Filter: (data @> '{"type": "type1"}'::jsonb)"
   "              Rows Removed by Filter: 99842"
   "Planning time: 0.075 ms"
   "Execution time: 236.785 ms"
*/

Dritte Abfrage - wie Zweite (vorherige), jedoch mit btree-Index für Daten-> Index:

CREATE INDEX docs_data_index_idx ON docs ((data->'index'));

-- FAST ~19ms
EXPLAIN ANALYZE
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
ORDER BY data->'index' -- added BTREE index on this field
LIMIT 25;
/* "Limit  (cost=0.42..2473.98 rows=25 width=90) (actual time=0.040..0.125 rows=25 loops=1)"
   "  ->  Index Scan using docs_data_index_idx on docs  (cost=0.42..19788.92 rows=200 width=90) (actual time=0.038..0.119 rows=25 loops=1)"
   "        Filter: (data @> '{"type": "type1"}'::jsonb)"
   "        Rows Removed by Filter: 17"
   "Planning time: 0.127 ms"
   "Execution time: 0.159 ms"
*/

Vierte Abfrage - jetzt filtern nach Daten-> ID und Limit = 1:

-- SLOW ~116ms
EXPLAIN ANALYZE
SELECT * FROM docs
WHERE data @> ('{"id": "30e84646-c5c5-492d-b7f7-c884d77d1e0a"}')::JSONB -- querying by "id" field now
LIMIT 1;
/* "Limit  (cost=0.00..27.89 rows=1 width=90) (actual time=97.990..97.990 rows=1 loops=1)"
   "  ->  Seq Scan on docs  (cost=0.00..5577.00 rows=200 width=90) (actual time=97.989..97.989 rows=1 loops=1)"
   "        Filter: (data @> '{"id": "30e84646-c5c5-492d-b7f7-c884d77d1e0a"}'::jsonb)"
   "        Rows Removed by Filter: 189999"
   "Planning time: 0.064 ms"
   "Execution time: 98.012 ms"
*/

Fünfte Abfrage - wie Vierte, jedoch mit Gin-Index (json_path_ops) für Daten:

CREATE INDEX docs_data_idx ON docs USING GIN (data jsonb_path_ops);

-- FAST ~17ms
EXPLAIN ANALYZE
SELECT * FROM docs
WHERE data @> '{"id": "30e84646-c5c5-492d-b7f7-c884d77d1e0a"}'::JSONB -- added gin index with json_path_ops
LIMIT 1;
/* "Limit  (cost=17.55..20.71 rows=1 width=90) (actual time=0.027..0.027 rows=1 loops=1)"
   "  ->  Bitmap Heap Scan on docs  (cost=17.55..649.91 rows=200 width=90) (actual time=0.026..0.026 rows=1 loops=1)"
   "        Recheck Cond: (data @> '{"id": "30e84646-c5c5-492d-b7f7-c884d77d1e0a"}'::jsonb)"
   "        Heap Blocks: exact=1"
   "        ->  Bitmap Index Scan on docs_data_idx  (cost=0.00..17.50 rows=200 width=0) (actual time=0.016..0.016 rows=1 loops=1)"
   "              Index Cond: (data @> '{"id": "30e84646-c5c5-492d-b7f7-c884d77d1e0a"}'::jsonb)"
   "Planning time: 0.095 ms"
   "Execution time: 0.055 ms"
*/

Sechste (und letzte) Abfrage - wie dritte Abfrage (Abfrage nach Daten-> Typ, Reihenfolge nach Daten-> Index, Limit):

-- SLOW AGAIN! ~224ms
EXPLAIN ANALYZE
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
ORDER BY data->'index'
LIMIT 25;
/* "Limit  (cost=656.06..656.12 rows=25 width=90) (actual time=215.927..215.932 rows=25 loops=1)"
   "  ->  Sort  (cost=656.06..656.56 rows=200 width=90) (actual time=215.925..215.925 rows=25 loops=1)"
   "        Sort Key: ((data -> 'index'::text))"
   "        Sort Method: top-N heapsort  Memory: 28kB"
   "        ->  Bitmap Heap Scan on docs  (cost=17.55..650.41 rows=200 width=90) (actual time=33.134..152.618 rows=100158 loops=1)"
   "              Recheck Cond: (data @> '{"type": "type1"}'::jsonb)"
   "              Heap Blocks: exact=3077"
   "              ->  Bitmap Index Scan on docs_data_idx  (cost=0.00..17.50 rows=200 width=0) (actual time=32.468..32.468 rows=100158 loops=1)"
   "                    Index Cond: (data @> '{"type": "type1"}'::jsonb)"
   "Planning time: 0.157 ms"
   "Execution time: 215.992 ms"
*/

Es scheint also, dass die sechste (wie die dritte) Abfrage viel langsamer ist, wenn die Datenspalte einen Gin-Index enthält. Es ist wahrscheinlich, weil es nicht viele unterschiedliche Werte für Daten-> Typfeld gibt (nur "Typ1" oder "Typ2")? Was kann ich dagegen tun? Ich brauche Gin Index, um andere Abfragen zu stellen, die davon profitieren ...

postgresql postgresql-9.4

— user606521
quelle

Es sieht so aus jsonb, als wären Sie auf das Problem gestoßen, dass Spalten eine flache Statistikrate von 1% haben, wie hier berichtet. Um jsonbs fehlende Statistiken herumarbeiten? . Wenn Sie sich Ihre Abfragepläne ansehen, sind die Unterschiede zwischen den Schätzungen und den tatsächlichen Ausführungen sehr groß. Den Schätzungen zufolge gibt es wahrscheinlich 200 Zeilen und die tatsächliche Rendite 100158 Zeilen, was den Planer veranlasst, bestimmte Strategien anderen vorzuziehen.

Da die Wahl in der sechsten Abfrage darauf hinausläuft, einen Bitmap-Index-Scan einem Index-Scan vorzuziehen, können Sie den Planer SET enable_bitmapscan=offanstupsen und versuchen, ihn dazu zu bringen, zu dem Verhalten zurückzukehren, das Sie in Ihrem dritten Beispiel hatten.

So hat es bei mir funktioniert:

postgres@[local]:5432:postgres:=# EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
ORDER BY data->'index'
LIMIT 25;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=656.06..656.12 rows=25 width=90) (actual time=117.338..117.343 rows=25 loops=1)
   Buffers: shared hit=3096
   ->  Sort  (cost=656.06..656.56 rows=200 width=90) (actual time=117.336..117.338 rows=25 loops=1)
         Sort Key: ((data -> 'index'::text))
         Sort Method: top-N heapsort  Memory: 28kB
         Buffers: shared hit=3096
         ->  Bitmap Heap Scan on docs  (cost=17.55..650.41 rows=200 width=90) (actual time=12.838..80.584 rows=99973 loops=1)
               Recheck Cond: (data @> '{"type": "type1"}'::jsonb)
               Heap Blocks: exact=3077
               Buffers: shared hit=3096
               ->  Bitmap Index Scan on docs_data_idx  (cost=0.00..17.50 rows=200 width=0) (actual time=12.469..12.469 rows=99973 loops=1)
                     Index Cond: (data @> '{"type": "type1"}'::jsonb)
                     Buffers: shared hit=19
 Planning time: 0.088 ms
 Execution time: 117.405 ms
(15 rows)

Time: 117.813 ms
postgres@[local]:5432:postgres:=# SET enable_bitmapscan = off;
SET
Time: 0.130 ms
postgres@[local]:5432:postgres:=# EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
ORDER BY data->'index'
LIMIT 25;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1320.48 rows=25 width=90) (actual time=0.017..0.050 rows=25 loops=1)
   Buffers: shared hit=4
   ->  Index Scan using docs_data_index_idx on docs  (cost=0.42..10560.94 rows=200 width=90) (actual time=0.015..0.045 rows=25 loops=1)
         Filter: (data @> '{"type": "type1"}'::jsonb)
         Rows Removed by Filter: 27
         Buffers: shared hit=4
 Planning time: 0.083 ms
 Execution time: 0.071 ms
(8 rows)

Time: 0.402 ms
postgres@[local]:5432:postgres:=#

Wenn Sie diesen Weg gehen möchten, müssen Sie diesen Scan nur für Abfragen deaktivieren, die ein solches Verhalten aufweisen. Andernfalls tritt auch bei anderen Abfrageplänen ein schlechtes Verhalten auf. So etwas zu tun sollte gut funktionieren:

BEGIN;
SET enable_bitmapscan=off;
SELECT * FROM docs
WHERE data @> '{"type": "type1"}'::JSONB
ORDER BY data->'index'
LIMIT 25;
SET enable_bitmapscan=on;
COMMIT;

Hoffe das hilft =)

— Kassandry
quelle

Ich bin mir nicht sicher, ob ich Sie richtig verstehe (ich bin nicht mit PG-Interna vertraut) - dieses Verhalten wird durch eine geringe Kardinalität im Feld "Typ" in der Jsonb-Spalte verursacht (und intern durch eine flache Statistikrate), oder? Und es bedeutet auch, dass ich, wenn ich möchte, dass meine Abfrage optimiert wird, die ungefähre Kardinalität der von mir abgefragten jsonb-Felder kennen muss, um zu entscheiden, ob ich_bitmapscan aktivieren soll oder nicht, oder?

— user606521

Ja, Sie scheinen dies in beiden Punkten zu verstehen. Die Basisselektivität von 1% begünstigt die Betrachtung des Feldes in der WHEREKlausel im Gin-Index, da davon ausgegangen wird, dass weniger Zeilen zurückgegeben werden, was nicht der Fall ist. Da Sie die Anzahl der Zeilen besser einschätzen können, können Sie feststellen ORDER BY data->'index' LIMIT 25, dass das Scannen der ersten Einträge des anderen Index (ca. 50, mit weggeworfenen Zeilen) zu noch weniger Zeilen führt Der Planer, der wirklich nicht versuchen sollte, einen Bitmapscan zu verwenden, führt dazu, dass ein schnellerer Abfrageplan verwendet wird. Hoffe das klärt die Dinge auf. =)

— Kassandry

Weitere erläuternde Informationen finden Sie hier: databaseasoup.com/2015/01/tag-all-things-part-3.html und in dieser Präsentation auch thebuild.com/presentations/json2015-pgconfus.pdf , um zu helfen.

— Kassandry

Die einzige mir bekannte Arbeit stammt von Oleg Bartunov, Tedor Sigaev und Alexander Kotorov über die JsQuery- Erweiterung und ihre Selektivitätsverbesserungen. Mit etwas Glück schafft es es in 9.6 oder höher in den PostgreSQL-Kern.

— Kassandry

Ich habe die 1% -Zahl aus der E-Mail in meiner Antwort von Josh Berkus, einem PostgreSQL Core Team-Mitglied, zitiert. Woher das kommt, erfordert ein viel, viel tieferes Verständnis der Interna, als ich derzeit besitze, sorry. = (Sie könnten versuchen, auf das pgsql-performance@postgresql.orgIRC im Freenode zu antworten oder zu überprüfen #postgresql, woher genau diese Zahl stammt.

— Kassandry