Neo4j Import Performance – Exchange Stack Database Administrators

I have a Neo4j 3.5.3 installation on my Ubuntu laptop (Intel i5, 4 GB RAM, SSD drive) and I try to import a medium size dataset from files CSV in the graph.

Here is the complete script of Cypher that I use:

MATCH (x: Shop) DETACH DELETE x RETURN count (*) AS DeletedShops;
MATCH (x: Postal) DETACH DELETE x RETURN count (*) AS DeletedPostal;
MATCH (x: City) DETACH DELETE x RETURN count (*) AS DeletedCities;
MATCH (x: Locator) DETACH DELETE x RETURN count (*) AS DeletedLocators;
MATCH (x: Mark) DETACH DELETE x RETURN count (*) AS DeletedBrands;
MATCH (x: industry) DETACH DELETE x RETURN count (*) AS DeletedIndustries;

USE OF THE PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM & # 39; file: ///accounts.csv.gz' AS csv
// locatorschemaname, account name, customer name, customer area, account area, number of locations
MERGE (b: Brand {name: csv.clientname})
FOREACH (n IN (CASE WHAND csv.clientindustry is NOT NULL AND NOT LOWER (csv.clientindustry) IN ['na','unknown','other'] THEN [1] OTHER [] END) |
MERGE (i: Industry {name: csv.clientindustry})
MERGE (b) -[:INDUSTRY]-> (i)
)
FOREACH (IN CASE CASE WHEN csv.accountindustry IS NOT NULL AND NOT LOWER) (csv.accountindustry) IN ['na','unknown','other'] THEN [1] OTHER [] END) |
MERGE (i2: Industry {name: csv.accountindustry})
MERGE (b) -[:INDUSTRY]-> (i2)
)
// Brand.shops = max (locator.shops)
FOREACH (N IN (CAS IN WHOEN Csv.locatorschemaname IS NOT NULL AND csv.locatorschemaname = csv.accountname THEN) [1] OTHER [] END) |
MERGE (l: locator {name: csv.locatorschemaname, stores: toInt (csv.locationcount)})
MERGE (l) -[:BRAND]-> (b)
SET b.shops = CASE
When b.shops is zero or b.shops < toInt(csv.locationcount) THEN toInt(csv.locationcount) ELSE b.shops END
);

MATCH (:Industry) RETURN count(*) AS IndustriesCreated;
MATCH (:Brand) RETURN count(*) AS BrandsCreated;
MATCH (:Locator) RETURN count(*) AS LocatorsCreated;

// Industry.shops = sum(Brand.shops)
MATCH (b:Brand)-[:INDUSTRY]->(i: industry)
WITH i.name AS iname, sum (b.shops) AS isum
MATCH (ii: Industry {name: iname})
SET ii.shops = isum;

// NOTE: This is done in multipass mode, to avoid performance issues with NEo4j CE 3.4.7 on ubuntu
// NOTE: we do not use WITH HEADERS because it adds 40% overhead and more

// PASS 1. Shops
USE OF THE PERIODIC COMMIT 10000
LOAD CSV FROM & # 39; file: ///locations.csv.gz&#39; AS csv
// path_name, client key, name, address1, address2, city, region, country, postal code, latitude, longitude
// search the existing location node
MATCH (l: locator {name: csv[0]})
CREATE (s): Shop {
locatorname: csv[0],
clientkey: csv[1],
latitude: toFloat (csv[9])
longitude: toFloat (csv[10])
})
CREATE (S) -[:LOCATOR]-> (l);

CREATE INDEX ON: Shop (locatorname, clientkey);

MATCH (: Shop) RETURN count (*) AS ShopsCreated;

// PASS 2. cities
USE OF THE PERIODIC COMMIT 10000
LOAD CSV FROM & # 39; file: ///locations.csv.gz&#39; AS csv
WITH CSV O CSV[5] IS NOT NULL
MATCH (s: Shop {locatorname: csv[0], customer key: csv[1]})
MERGE (city: city {name: csv[5], country: csv[7]}) ON CREATE SET city.region = csv[6]
MERGE (S) -[:CITY]-> (city);

MATCH (: City) Number of returns (*) AS CitiesCreated;

// PASS 3. zip codes
USE OF THE PERIODIC COMMIT 10000
LOAD CSV FROM & # 39; file: ///locations.csv.gz&#39; AS csv
WITH CSV O CSV[8] IS NOT NULL
MATCH (s: Shop {locatorname: csv[0], customer key: csv[1]})
MERGE (postal: Postal {name: csv[8], country: csv[7]}) ON CREATE SET postal.region = csv[6]
MERGE (S) -[:POSTAL]-> (postal);

MATCH (: Postal) RETURN count (*) AS PostcodeCreate;

CREATE A CONSTRAINT ON (i: Industry) ASSERT i.name IS UNIQUE;
CREATE CONSTRAINT ON (b: Mark) ASSERT b.name IS UNIQUE;
CREATE CONSTRAINT ON (l: landlord) ASSERT l.name IS UNIQUE;

The problem is that the scripts allow you to move quite quickly to the end of "PASS 1", then it is apparently suspended on "PASS 2". The server process is running the processor and nothing visible is happening. It lasts at least 120 minutes and does not seem to end any time soon.

I use the default settings for heap size, and so on. But the size of the entire dataset on the disk (checked in / var / lib / neo4j / data) is ~ 500 MB. So this machine should handle it.

Here is the output so far:

SuppriméBoutiques
0
SuppriméPostal
0
DeletedCities
0
DeletedLocators
0
Marks removed
0
SuppriméIndustries
0

IndustriesCréé
18
MarquesCréé
1326
LocatorsCréé
2092

BoutiquesCréé
937488

// very long wait here

And here are some ps exit:

[11:50:56][filip@lap2:~/neo4j]$ ps fuwww `pidof java`
PID USER% CPU% MEM VSZ RSS TTY STAT START TIME COMMAND
filip 4966 0.0 0.1 3979904 7412 pts / 0 Sl + 08:57 0:09 / usr / lib / jvm / java-8-oracle / bin / java -jar /usr/bin/../share/cypher-shell/lib / cypher-shell-all.jar -u neo4j -p neo --format plain
neo4j 2411 98.1 25.6 4862652 1012488? 08:50 177: 11 / usr / bin / java -cp / var / lib / neo4j / plugins: /etc/neo4j: / usr / share / neo4j / lib / *: / var / lib / neo4j / plugins / * - server -Xms950m -Xmx950m-: + UseG1GCXX-: -OmitStackTreatInFastThrow- +: = true -Dunsupported.dbms.udc.source = debian -Dfile.encoding = UTF-8 org.neo4j.server.CommunityEntryPoint --home-dir = / var / lib / neo4j --config-dir = / etc / neo4j

How can I rewrite the script to get the same thing faster?

What can I do to diagnose the neck of the neck?

Is it possible to use a limited java stack size (1 to 2 GB) – and why not?