China Migrants Visualisation Experiments

I recently joined Research IT from a Java developer role within the University of Bristol. I was eager to learn new skills, including Python – the primary language used by my new team. I spent a couple of months helping out with existing projects and experimenting with tools like Pandas, Jupyter, and Leaflet, and getting familiar with Visual Studio Code’s native support for Jupyter Notebooks which provides a great environment for Python based data experimentation (REPL style). I was then asked to work on a new data visualisation project for Dr Thomas Larkin with network graph data edge and node data prepared by James Thomas and Matt Chandler of the Jean Golding Institute.

I was given a small collection of CSV files containing business data about Augustine Heard & Company and related companies from the mid-nineteenth century, primarily operating in and around China. The dataset included career histories of employees, partners and companies. The employees in the dataset often worked in a particular role for a period of time before switching to a different role or moving to a new location. In addition, many of the companies we were looking at operated in multiple locations. Furthermore, the dataset revealed that new companies would sometimes appear while old ones would disappear. Our goal was to create a proof-of-concept visualisation to help us analyse the relationships between the various entities, showing migration patterns and social networks, we wanted to demonstrate changes over time and how the networks expanded and contracted.

Network Graphs: NetworkX

I started using NetworkX, following a recommendation from my colleague Mike Jones and worked through tutorials with the Panama Papers dataset (see: Exploring the Panama Papers Network). I gained an understanding of network graph theory and repeated the tutorials with our Augustine Heard & Company data. Pandas helped me split the data into years and create network graph images for each year. Although I was happy with the initial network graph images I created, I felt they could be improved further. Despite trying various layout options in NetworkX, I couldn’t quite get the graphs to look as visually pleasing as I wanted them to.

NetworkX network graph plot

Network Graphs: D3

That’s when I came across a great example of a browser-based network graph using the D3 JavaScript library (See Mike Bostock’s work: Force-Directed Graph and associated GitHub Gist). I realised I could reuse the Jupyter scripts that I had used to create the NetworkX-based network graph images to produce the JSON needed for a D3-based network graph for each year. The D3 versions of the network graphs were a significant improvement over the NetworkX image equivalents and much more visually appealing. I think the layouts in NetworkX spring layout and D3 force directed graphs are both based on simulated force interactions between the nodes. I just found D3 easier to work with.

Equivalent D3 network graph plot

Geospatial Visualisation

Leaflet, Mapbox GL JS, D3 Force simulations, SVG overlays and GeoJSON

Although I didn’t initially plan on making the visualisations interactive or web-browser based, I found that the easiest way to integrate maps into my work was by using a web-based mapping platform such as Google Maps or Leaflet. I started experimenting with the Timeline plugin for Leaflet, which allowed me to create dynamic maps that highlighted active locations during specific time periods. To do this, I used an enhanced version of NetworkX called GeoNetworkX to create static GeoJSON data files. While this resulted in an attractive output that listed person and company details in a panel down the side, it still wasn’t a network graph. Despite this, the overall result was still impressive and served to motivate me further in the direction of maps.

I was eager to combine the network graphs I had created with NetworkX and D3 with a browser-based map, but I was unsure of how to do so effectively. The data presented the challenge of many nodes being located at the same positions, leading me to experiment with clustering using the Leaflet timeline I had previously used. However, clustering did not feel like the right solution.

I began to explore the integration of network graphs with a map with only the company and location nodes, temporarily ignoring the people nodes and time axis. To achieve this, I researched Mapbox GL, which offers interactive 3D maps at low zoom levels. I experimented with different display options for multiple companies operating in the same location, including random and circle-based jitter. Although this approach mostly worked well for company/location data, the inclusion of people nodes made it unworkable.

I then focused on overlaying the D3 network graphs I had previously created onto a map. I experimented with both Leaflet and Mapbox GL JS, both of which support overlaying data onto an SVG layer. However, as I did not have all the geographic data needed upfront to produce static GeoJSON files, I needed to use D3’s force simulation to generate geographic locations for company and people nodes. I was able to pin the location nodes to their positions on the underlying map and use the D3 force simulation to determine the best positions for the remaining nodes. Transforming between the coordinate systems used by D3 and the geographic coordinate system used in the map accurately required some experimentation.

I initially wanted to use Mapbox GL JS’s 3D maps, but overlaying 2D representations of data in an SVG layer onto a 3D globe required complex mathematics that was too challenging for a prototype proof-of-concept. My early efforts had paths connecting nodes passing through the earth’s surface, highlighting the need for a thorough understanding of Geodesy mathematics when dealing with coordinates on a globe’s surface. I then realised that I now had all the geographic coordinates generated by D3, allowing me to produce complete GeoJSON data. Both Leaflet and Mapbox GL JS support GeoJSON, and the 3D geographic mathematics needed to overlay the GeoJSON data onto the map was already incorporated into Mapbox GL JS. Therefore, I shifted from overlaying data on an SVG layer to dynamically generating GeoJSON and using that to display the data on the map. After successfully overlaying the network graph onto the 3D globe with Mapbox GL JS, I further improved the visualisation by adding simple filtering based on some data attributes.

Immediate neighbours

Tom asked if it was possible to make it so that when we select a node, we only see that individual’s immediate network. I found a solution implementing that Interactive Networks with NetworkX and D3 and implemented it on my examples. It provides a useful focus on the individual when there is a large web of overlapping connections.

To better focus on the individual, I removed the company and location nodes and used another D3 force simulation, force clustering, to position the people nodes around their corresponding locations. I was even able to colour the nodes based on the various attributes and generate a dynamic key for better understanding.

Community detection

Finally, I also integrated a couple of community detection algorithms, network clustering and Bron Kerbosch. These algorithms helped cluster nodes together based on their connections, enabling the identification of communities within the data. Overall, these improvements helped make the visualisation more interactive, informative, and pleasing to the eye.

Completed solution showing colour by options

Completed solution showing node highlighting

Final thoughts

I approached the task with an open mind, uncertain about the final form of the data visualisation. Ultimately, I decided to create an interactive web-based approach, which made sense given my past experience. However, I encountered several challenges during the experimentation process. Although there was plenty of helpful information online, much of it pertained to NetworkX 1.x, which was popular about eight years ago. NetworkX 2.x is substantially different, and NetworkX 3.x came out just after I completed this project. Similarly, D3 had a very active community with numerous examples hosted on GitHub Gists and deployed using “bl.ocks.org,” but these have now been replaced by Observable.com. Furthermore, code written for D3 versions prior to version 5 was markedly different from later versions, and most example code I found was for older versions. Despite these version difficulties, I’m pleased with the final result.

https://ilrt.github.io/china-migrants/