Background
I have been writing AWS Glue jobs lately using PySpark. One of these jobs needed to perform some very time-consuming validations against some URLs, for reporting purposes. There were a lot of URLs to check, so it took many hours and a lot of network traffic to check them all.
Thankfully, these URLs didn't change, and once a URL had been validated, it was likely to stay that way for a few months. So there was really no need to keep checking them everyday, once every few weeks would be enough. On the other hand, new URLs appeared every day so I did want to check those. So I scheduled the Glue job to run every day.
Initial Solution
I decided to use AWS ElastiCache Memcached to cache the validated URLs, so that they didn't need to be checked again. The cache entries had an expiry time of a few weeks (max expiry for Memcached was 30 days). When the cached URL expired, it became a candidate to be checked again. I installed the elasticache-pyclient and python-memcached modules, wrote a simple Python script to test the cache from my local machine, and it looked like it would work.
Problem
Alas, things did not work as planned on Spark. The problem was the UDF code running on Spark executors could not connect to ElastiCache. It seemed that Spark executors ran in their own private VPC, isolated from the custom VPC that contained all my other services (RDS, ElastiCache).
I went to the console VPC -> Peering Connections -> Create Peering Connection page but could not see any option to creating a peering from the Glue executor VPC.
I then remembered that Glue jobs could connect to some data sources in my VPC, so it must be doing some sort of peering behind the scene. So, before I resorted to more complicated solutions like ECS, I tried a hack to get the Glue job working.
Solution
It would have been nice if I could add ElastiCache as a Glue job connection, but this was not an option. My choices were Redshift, RDS or JDBC. So I added a connection from the Glue job to an existing RDS in my VPC, even though the Glue job didn't read anything from the RDS. The idea was this would make Glue create a VPC peer to my VPC. I also added the following additional permissions to my Glue policy
- ec2:CreateTags
- ec2:DeleteTags
- ec2:DescribeVpcEndpoints
- ec2:DescribeRouteTables
And it worked! My Glue job was talking to ElastiCache. here's the picture: