Disclaimer:My expertise is not in networking. It’s mostly been full stack web development in my career, and a ton more cloud platform usage in the last couple of years, so my inexperience regarding IPs, subnets, CIDR (google it, I had to) probably made this worse than it should be. Oh well, I sure know a lot more now :)
Let me set the scene here. I had just successfully gotten my local Neo4J causal cluster (CC) running a three core cluster, and figured that I could set it up quickly in Azure now that I understood the basics behind CC. I was wrong.
I was using external DNS entries (dev-******.cloudapp.azure.com) for everything. All of my ***_advertised_address settings where using DNS entries that mapped to public IPs on my ubuntu VMs.
No matter what…I would always see this on all 3 of the core CC servers.
INFO [o.n.c.d.HazelcastCoreTopologyService] Attempting to connect to the other cluster members before continuing…
And 10 minutes later it would always show this
WARN [o.n.c.d.HazelcastCoreTopologyService] The server has not been able to connect in a timely fashion to the cluster. Please consult the logs for more details. Rebooting the server may solve the problem.
There were no other errors. The startup logs looked great (I had the log level set to DEBUG). Running lsof -I : 5000 showed that the server was indeed listening for connections.
I then telneted (from another CC server) into one of the other CC servers…it connects no prob!
I try connecting to a port that’s not in my Azure network security Group (NSG) and sure enough it hangs as expected…ok, so I do have access to this port from this machine….I then repeated this on every core telnetting out to every other CC server successfully.
A peer of mine was able to successfully get a CC running via docker, so then I immediately started combing over the docker images and looking for any clues.
The only difference we could see was that the docker images deployments were using the internal IPs…so, I tried that, and BAM…the logs began filling with CC communications.
Here’s the rub though, I still don’t know why this communication was blocked. I can speculate that it’s either some weird Azure thing OR that Neo4J CC setup forces the discover, tx shipping, and RAFT ports to only advertise on the same subnet as the machine it’s running on.
Just speculation at this point - more investigation is underway - but it worked.
The upside to all this is - from a security standpoint - you definitely want these ports locked down. Otherwise someone out there could just spin up a CC server and point it at your public address and copy your data.
That is…if the discovery, tx shipping, and RAFT processes don’t authenticate, I don’t know that for sure
I do know that the bolt endpoint authenticates by default - which makes sense - because it’s most likely to be used by external/outside client connectors.
I learned a ton more about Azure Virtual Networks and Vnet-to-Vnet via Vnet Gateways as well. I implemented the Data center disaster recovery model, seen here, except that mine is cross-region (4 cores and 2 read replicas spread across the East US 2 and Central US Azure regions).
We do professional services for just this sort of thing above, so feel free to hit us up at graphstory.com via chat below