Engineering data pipelines, but not code

18 May 2023, 5:15 am

The NSW Government is undergoing a large project to consolidate many *.nsw.gov.au websites into www.nsw.gov.au. We have a team of 10+ developers at any one time and this brings with it unique challenges.

This means we have problems like many developers solving similar problems in different ways. This adds time to the development lifecycle, makes it difficult for others to pick up the implementation (and the code) and reduces reuse across teams and developers.

One of these is that we pull data from a tonne of different sources (json files / http, csv files / http, APIs, push/pull, etc) and from a bunch of different providers. Given the scale of our site, using the data directly from the source wasn’t an option. We always imported the data into our site and wrapped it in a controller or imported into a custom entity.

A review was undertaken about the issues with the current solution and how to solve them. Data Pipelines (https://www.drupal.org/project/data_pipelines) is the result of that review.

This solved the problem by:
- Identifying that we could store most of the data in non structured noSQL storage.
- Removing the need to create a bunch of custom entities, saving time and database performance.
- Removing the custom controllers implementing custom API.
- Removing requests to our application server altogether by allowing access to the data by Elasticsearch.
- Removing a security issue of ‘trusting’ the data from the remote source.
- Mitigating our risk of invalid data causing flow on issues.
- Making the backend and the frontend problem the same for all developers.
- Making sure we don’t inadvertently take down our data providers with too many requests.
- Frontend react application reuse.

I'm not actually sure if 'Drupal Development' is the correct category for this. If you have some insight, I'd be happy to discuss it.

Other Sessions