Monday, June 17, 2019

AWS Glue and ElastiCache Memcached

Background

I have been writing AWS Glue jobs lately using PySpark. One of these jobs needed to perform some very time-consuming validations against some URLs, for reporting purposes. There were a lot of URLs to check, so it took many hours and a lot of network traffic to check them all.

Thankfully, these URLs didn't change, and once a URL had been validated, it was likely to stay that way for a few months. So there was really no need to keep checking them everyday, once every few weeks would be enough. On the other hand, new URLs appeared every day so I did want to check those. So I scheduled the Glue job to run every day.

Initial Solution

I decided to use AWS ElastiCache Memcached to cache the validated URLs, so that they didn't need to be checked again. The cache entries had an expiry time of a few weeks (max expiry for Memcached was 30 days). When the cached URL expired, it became a candidate to be checked again. I installed the elasticache-pyclient and python-memcached modules, wrote a simple Python script to test the cache from my local machine, and it looked like it would work.

Problem

Alas, things did not work as planned on Spark. The problem was the UDF code running on Spark executors could not connect to ElastiCache. It seemed that Spark executors ran in their own private VPC, isolated from the custom VPC that contained all my other services (RDS, ElastiCache).

I went to the console VPC -> Peering Connections -> Create Peering Connection page but could not see any option to creating a peering from the Glue executor VPC.

I then remembered that Glue jobs could connect to some data sources in my VPC, so it must be doing some sort of peering behind the scene. So, before I resorted to more complicated solutions like ECS, I tried a hack to get the Glue job working.

Solution

It would have been nice if I could add ElastiCache as a Glue job connection, but this was not an option. My choices were Redshift, RDS or JDBC. So I added a connection from the Glue job to an existing RDS in my VPC, even though the Glue job didn't read anything from the RDS. The idea was this would make Glue create a VPC peer to my VPC. I also added the following additional permissions to my Glue policy

  • ec2:CreateTags
  • ec2:DeleteTags
  • ec2:DescribeVpcEndpoints
  • ec2:DescribeRouteTables

And it worked! My Glue job was talking to ElastiCache. here's the picture:

Thursday, January 31, 2013

Getting Australian POIs from Sensis API

The App

Over the last X-mas/New Year break, I wrote a simple Web app to search for Australian Points-Of-Interest (POIs), such as toilets, petrol stations, rest areas, picnic grounds, hospitals, etc. Other than petrol stations (considered as "business" listings), the rest are mostly non-business entities.

Here is the URL of the app:

http://aus-pois.appspot.com

This web app works on desktop PCs and tablets with the following browsers:- Firefox, Chrome, Safari or the Android default browser. There is also an Android application Aus Pois that renders the JSON feed from this web app.

The Algorithm

This app gets the POI data from the Sensis API (SAPI). This is the same API that powers the Australian Yellow Pages website. Here's the URL to get the nearest POIs within 2 kms of the location, with Category Id 1007397:

http://api.sensis.com.au/ob-20110511/prod/search?location=-37.58317709893027,145.21661812353523&radius=2&sortBy=DISTANCE_ALL_PAGES&sensitiveCategories=true&categoryId=1007397&includePois=true&key=(SAPI key)

As stated above, petrol stations are considered normal business listings rather than POIs in SAPI, but they can be retrieved together as the real POIs simply by specifying the category ID for "Service Station" in the URL.

The Technology

The Web front-end is done with Google Maps (plus the geometry library), HTML5 (for geolocation and storage) and lots of jQuery goodness.

The server-side logic is implemented on top of Gaelyk, a really cool Groovy framework that allows for almost seamless deployment of the app to Google App Engine.

Monday, July 2, 2012

Moving large data between Oracle database and CSV

I'm currently working on a project that involves importing and exporting large volumes of data between spreadsheet-friendly comma-separated-value (CSV) files and an Oracle database. Although import/export of data can be done using a GUI tool like SQLDeveloper, it is not practical for large volumes or when such jobs need to be scripted. This blog describes how this can be done using BASH scripts and the command-line tools sqlplus and sqlldr.

Scenario

Suppose I run an online shop and maintain a list of purchase transactions with fields such as item_id, customer_id, delivery_status, etc. On my Oracle database, I have PURCHASE_TRANSACTIONS, ITEMS and CUSTOMERS tables with the appropriate joins.

Exporting from Oracle DB to CSV

To export data from the Oracle database table (normally with additional joins and filtering criteria), create a SQL file with:

  • a query to write out the header row of the CSV
  • a query to get the data and concatenate them together in a CSV format

Something like this to export a list of items pending delivery:

set head off;
set feed off;
set trimspool on;
set linesize 32767;
set pagesize 32767;
set echo off;
set termout off;

spool &1;
select 'ITEM_ID,ITEM_NAME,PRICE,CUSTOMER_ID,CUSTOMER_NAME,CUSTOMER_ADDRESS' from dual;
select pt.item_id
|| ',"' || itm.name
|| '",' || itm.price
|| ',' || pt.customer_id
|| ',"' || c.name
|| '","' || c.address
|| '"'
from PURCHASE_TRANSACTIONS pt
join CUSTOMERS c on pt.customer_id = c.id
join ITEMS itm on pt.item_id = itm.id
where pt.delivery_status = 'PENDING'
order by pt.customer_id;

spool off;
exit;

Note the double-quotes around textual data. To execute this, save it to a file named export-pending.sql and write a simple BASH script like so:

sqlplus -S MY_DB_USER/MY_DB_PASSWORD@MY_DB_SERVER:1521/DB_SID @export-pending.sql pending-delivery-report.csv

Running this script will write out the file pending-delivery-report.csv with contents like:

ITEM_ID,ITEM_NAME,PRICE,CUSTOMER_ID,CUSTOMER_NAME,CUSTOMER_ADDRESS
432,"Super Mouse Pad",23.5,11,"John Smith","45 Some Street, Sydney, NSW 2000"
877,"USB Fan",14,11,"John Smith","45 Some Street, Sydney, NSW 2000"
344,"NeverDie UPS",125.3,67,"Alex Jones","101 Scary Road, Melbourne, VIC 3000"
...

This export functionality is typically used to generate spreadsheet friendly reports.

Importing from CSV to Oracle DB

Suppose I have a CSV file items-data.csv containing a new list of items that I want to ADD to the ITEMS table, in the following format:

ID,NAME,PRICE
1001,'Lightbulb',1.95
1002,'Car battery',98
1003,'Super cool camera',450.50
...

To load this into the database, create a control file load-items.ctl with the following:

load data
append
into table ITEMS
fields terminated by ',' optionally enclosed by '\''
(
    ID,
    NAME,
    PRICE
)

Note that in order to REPLACE the entire table's contents with the CSV data, simply change the append instruction (second line) to replace. Finally, create a simple BASH script with the following to run sqlldr:

sqlldr MY_DB_USER/MY_DB_PASSWORD@MY_DB_SERVER:1521/DB_SID control=load-items.ctl data=items-data.csv direct=true rows=1000 skip=1 errors=1 log=load-items.log bad=load-items.bad discard=load-items.discard

After executing this script, check the items-data.log for the progress of the data load. The load-items.bad and load-items.discard files contain the data with errors.

Sunday, July 1, 2012

Mandriva 2011 : the painful way

Last week, I started setting up a new home desktop computer. As all my other home machines were running Mandriva, I decided to install the latest Mandriva 2011 on this new one. Should only take a couple of days to get everything up and running, or so I thought. This blog describes the funpain of actually getting everything to work.

Networking

The new system came with an Asus P8H61 motherboard that had a Realtek RTL8111/8168B Ethernet controller. Sounded like very ordinary hardware that would definitely just work with Mandriva, if not for the fact that the kernel picked the wrong driver! It wrongly loaded the r8169 kernel module, instead of the required r8168. It took me 3 DAYS of messing about :- checking all my cables, reconfiguring my router, updated my kernel version, trying various network settings, etc, before I realized it was all because of this stupid kernel bug.

To fix it, I first had to get the right r8168 driver from the Realtek website. The file I got was r8168-8.031.00.tar.gz2. After unpacking everything and making sure I had the kernel source and headers, I ran:

$ cd r8168-8.031.00
$ ./autorun.sh

which gave me the following error:

WARNING: Symbol version dump /usr/src/devel/2.6.39.4-5.1-desktop/Module.symvers
           is missing; modules will have no dependencies and modversions.

Why happened here? Why would the kernel source not have this file that was required for building kernel modules? I had no choice but to rebuild the whole kernel just to get this file, that I then copied to where the Realtek driver expected to find it:

$ cd /usr/src/linux-2.6.39.4-5
$ make config
$ make
$ cp Module.symvers /usr/src/devel/2.6.39.4-5.1-desktop

After that, the Realtek script worked as expected. It unloaded r8169 and swapped in r8168. Finally, I had network connectivity. To make sure my changes stuck, I modified /etc/modprobe.conf to say:

alias pci:v000010ECd00008168sv00001043sd00008432bc02sc00i00 r8168
alias eth0 r8168

and then added r8169 to the module blacklist files :

$ echo 'blacklist r8169' >> /etc/modprobe.d/blacklist-brcm80211.conf
$ echo 'blacklist r8169' >> /etc/modprobe.d/blacklist-compat
$ echo 'blacklist r8169' >> /etc/modprobe.d/blacklist-mdv

Graphics card

Next challenge:- getting hardware direct rendering working. This system had a low-end (for work, not gaming) 1GB Gigabyte GT210 graphics card running on nVidia GeForce 210.

Understandably, Mandriva had installed the free Nouveau driver by default. What I didn't expect was that installing the nVidia driver didn't actually cause it to replace the nouveau driver. Instead, running nvidia-xconfig simply broke X (thankfully it also backed up the original xorg.conf so I could revert back). Wasted several hours of my life before figuring out I first had to uninstall nouveau, and then re-installed the nVidia rpms below:

dkms-nvidia-current-295.40-2-mdv2011.0.i586
x11-driver-video-nvidia-current-295.40-2-mdv2011.0.i586

Finally, I ran XFdrake one more time and it prompted me to use the nVidia driver.

Monday, December 5, 2011

My highlights of YOW! 2011 Melbourne

I attended the YOW! 2011 Conference in Melbourne on 1-2 December 2011.

The talks


There were many great talks, and some not-so-great. For me, two talks in particular stood out as special. If I only got to listen to these two and nothing else, it would've been worth it.

Product Engineering :- Mike Lee


I first saw Mike Lee's talk on InfoQ several months ago and was really inspired by the passion of that guy, and was so happy to discover he was making an appearance at YOW.

Mike Lee

He didn't impart any novel technical knowledge to us, nor did he tell us about the latest agile technique. In fact, everything he said in YOW was common-sense in hindsight, and virtually identical to the talks on InfoQ. And yet, it was pure pleasure to watch this man share his vision of what constituted a good product. I just did not get bored listening to this same talk for the 3rd time! Maybe it was his costume.

Two things that stood out from the talk:
  • The hook :- the thing that makes a product special
  • Appsterdam :- New IT hub for freedom loving developers


Temporally Quaquaversal Virtual Nanomachine Programming In Multiple Topologically Connected Quantum-Relativistic Parallel Timespaces...Made Easy! :- Damian Conway


Speaking of the hook, the title of this talk did the job. Ever since I watched Fred Simon's JavaPosse lightning talk on Positronic Variables a few months ago, I was curious about the Damian Conway's original work. As luck would have it, Damian came to tell us all about positronic variables.



Damian's Perl (and Rod logic) code was not very familiar to me, but his tongue-in-cheek introduction to quantum physics and the multi-verse was pure entertainment. Personally, I was not convinced that an Einstein-Rosenberg bridge could be harnessed as described for running computer programs due to excessive energy requirements. A computer program using positronic variables would probably be unstable and generate incoherent results due to the unpredictable superposition of multiple variations of positronic values from an indeterminate number of future multi-verses.

Other notable talks


Jim Webber gave his usual talk on RESTful systems. I'd listened to his similarly themed talks before, but it was still entertaining to watch Jim Webber's performance. The one lesson that stuck in my mind was: Use an ATOM feed to publish events so that different clients could independently maintain / reconstruct their own internal states.



Another interesting talk was Dan Ingalls telling us about Lively Kernel. Dan built a simple online load monitoring service simply using drag-on-drop from the Lively Kernel interface... cool.

The games


As usual, there were lots of fun and games between talks. Sensis developed a online (mobile) quiz game, CodeJuicer, that morphed into to be a fun hacking game instead. Too bad I did not have a laptop with me to hack on it all day, though I did manage to hack it to zero-time from home at the end of Day 1. Thoughtworks put up a quiz game and a movie title matching game, plus frozen nitrogen ice cream... all very popular. Atlassian let us shoot Nerf guns... hard to beat that.

The food


First day's food was not too good, but got better on the second day. The Beef Stroganoff was delicious.



The best parts were the free (real) coffee paid for by Aconex, and the desserts, especially the mango mousse.



Looking forward to YOW! 2012.