Python Pattern Matching Examples: ETL and Dataclasses
In Mastering Structural Pattern Matching I walked you through the theory of Structural Pattern Matching, so now it’s time to apply that knowledge and build something practical.
Let’s say you need to process data from one system (a JSON-based REST API) into another (a CSV file for use in Excel). A common task. Extracting, Transforming, and Loading (ETL) data is one of the things Python does especially well, and with pattern matching you can simplify and organize your business logic in such a way that it remains maintainable and understandable.
Let’s get some test data. For this you’ll need the requests
library.
>>> resp = requests.get('https://demo.inspiredpython.com/invoices/')
>>> assert resp.ok
>>> data = resp.json()
>>> data[0]
{'recipient': {'company': 'Trommler',
'address': 'Annette-Döring-Allee 5\n01231 Grafenau',
'country_code': 'DE'},
'invoice_id': 15134,
'currency': 'JPY',
'amount': 945.57,
'sku': 'PROPANE-ACCESSORIES'}
Objectives
The data – feel free to use the demo URL provided in the example above – is a list of invoices for our fictional company that sells propane (and propane accessories.)
As part of any serious ETL process, you must consider the quality of the data. For this, I want to flag entries that may require human intervention:
-
Find mismatched payment currencies and country codes. For instance, the example above lists the payment currency as
JPY
but the country code’s German. -
Ensure the invoice IDs are unique and that they are all integers less than
50000
. -
Map each invoice to a dedicated
Invoice
dataclass, and each invoice recipient to aCompany
dataclass.
Then,
-
Write the quality-assured invoices to a CSV file.
-
Everything that fails that test is flagged and put in different CSV for manual review.
An important note though.
In a real application there would be a validation layer that checks the input data for obvious data errors, like integers in a string field, or missing fields. For brevity I will won’t include that part, but you should use a package like marshmallow
or pydantic
to formalize the contract you (the consumer) have with the data producer(s) you interface with to catch (and act on) these mistakes.
But, for the sake of argument, let’s assume the input data meets these basic standards. But it is not the job of a library like marshmallow
to validate that, say, there the country code and currency is correct.
Getting the API Data
Let’s start by formalizing the extraction of the data I did earlier:
import requests
def get_invoices(url):
response = requests.get(url)
# Raise if the request fails for any reason.
response.raise_for_status()
return response.json()
Here I let requests raise an exception if the response is anything except a 200 OK
from the server. I also naively assume the response body is JSON, as it’s just a demonstration.
Defining the dataclasses
Now let’s define the dataclasses. Two will suffice: a Company
dataclass to hold details about the invoice recipient; and an Invoice
dataclass that’ll reference the recipient company and the invoice details themselves:
from dataclasses import dataclass
from typing import Optional
@dataclass
class Company:
company: str
address: str
country_code: str
@dataclass
class Invoice:
invoice_id: int
currency: str
amount: float
sku: str
recipient: Optional[Company]
Each dataclass is the canonical representation of either a company or an invoice once the data is transformed from its source format.
Separating your concerns
One thing I want to do – to aid with testing – is separate the processing of the company from that of the invoice:
1def process_raw_records(records):
2 invoices = []
3 for record in records:
4 match record:
5 case {"recipient": raw_recipient, **raw_invoice}:
6 recipient = process_raw_recipient(raw_recipient)
7 invoice = process_raw_invoice(raw_invoice)
8 invoice.recipient = recipient
9 invoices.append(invoice)
10 case _:
11 raise ValueError(f"Cannot parse structure {record}")
12 return invoices
This function loops over every raw record in records
. For each record
it’ll attempt to match the structure of record
against the declared pattern you see in the first case
statement. The pattern I wrote is a bit diffuse, so let me explain why it looks the way it does.
I want to split the processing of the invoice and the recipient. To do this I declare a pattern that must have at least the key "recipient"
and everything else – if there is anything else – into **raw_invoice
. If the pattern does not match record
it is, of course, skipped; in that case the default pattern _
is triggered which raises an Exception.
Recall that **something
is the keyword notation in Python that usually expands a dictionary into key=value
pairs for use in function calls or inside a dictionary. Here it means the literal opposite: collect key-value pairs and store them in the dictionary something
.
The pattern matching engine is clever enough to understand that notation, and it neatly separates the logic that figures out what goes where to each respective function. That has a couple of benefits:
- Separation of Concerns and Ease of Testability
-
I can test
process_raw_recipient
,process_raw_invoice
andprocess_raw_records
as a whole, or separately, to induce various test scenarios without having to awkwardly try and come up with a list ofrecords
that matches the set of behaviors I expect in my tests. - Each function is standalone and can be used for other things
-
You can invoke – and parse – both invoices and recipients separately. Imagine you had another API endpoint called
/companies/
that you wanted to correlate the invoice recipients against. Now you can separately pull that data and seamlessly reuse theprocess_raw_recipient
function.
Now let’s take a look at each processor.
def process_raw_recipient(raw_recipient):
match raw_recipient:
case {"company": company, "address": address, "country_code": country_code}:
return Company(company=company, address=address, country_code=country_code)
case _:
raise ValueError(f"Cannot parse invoice recipient {raw_recipient}")
def process_raw_invoice(raw_invoice):
match raw_invoice:
case {
"invoice_id": invoice_id,
"currency": currency,
"amount": amount,
"sku": sku,
}:
return Invoice(
invoice_id=invoice_id,
currency=currency,
amount=amount,
sku=sku,
recipient=None,
)
case _:
raise ValueError(f"Cannot parse invoice {raw_invoice}")
These two functions each take raw dictionaries containing either an invoice recipient or the invoice itself.
Each respective case
statement represents the declarative form of the dictionary I want to match. process_raw_recipient
expects three keys: "company"
, "address"
and "country_code"
.
In process_raw_invoice
it’s the same situation but with different keys, of course, though I do specifically set recipient=None
when I create the Company
object. Why? Well, I don’t want this function to worry about the recipient or how it’s created:
- The
process_raw_invoice
function should only process invoices -
As far as that function’s concerned, it’s none of its business if there is a recipient or not.
I could make it call
process_raw_recipient
and assign theCompany
instance I get back, but then I’d tightly couple the parsing of an invoice record to that of a company. - The
process_raw_records
function is the controller -
Meaning, it is responsible for looping over each raw record; determining what it is; and correctly assembling the final form that we want. It’s very likely that function would grow over time to handle more things: remittance advice, purchase orders, etc.
With that out of the way, the basic extraction and most of the transformation is complete. Running the code works fine, too:
>>> for result in process_raw_records(get_invoices("https://demo.inspiredpython.com/invoices/")):
print(result)
Invoice(invoice_id=19757, currency='USD', amount=692.3, sku='PROPANE-ACCESSORIES',
recipient=Company(company='Rosemann Freudenberger GmbH & Co. KGaA',
address='Eberthweg 56\n30431 Artern',
country_code='DE'))
# ... etc ...
Implenting the Quality Assurance Rules
Now that leaves the final parts of the transformation and loading. Earlier I described a few business rules I want to implement to quality-assure the data. I could do it with just the dictionaries and that would be fine in this example, but if you’re building something like this yourself, you are probably dealing with data that’s far more complex. Having a few simple, structured objects that you can stick properties and other helper methods on makes it a lot easier.
Luckily, using dataclasses does not impair our ability to use pattern matching. So let’s implement the first business rule:
Finding mismatched currencies and country codes
So let’s say I want to flag certain country code and currency combinations for human review in case someone in the accounting department messed up and picked the wrong currency field by mistake. That happens more often than you think.
1def validate_currency(invoice: Invoice):
2 match invoice:
3 case Invoice(currency=currency, recipient=Company(country_code=country_code)):
4 match (currency, country_code):
5 case ("USD" | "GBP" | "EUR", _):
6 return True
7 case ("JPY", "JP"):
8 return True
9 case ("JPY", _):
10 return False
11 case _:
12 raise ValueError(
13 f"No validation rule matches {(currency, country_code)}"
14 )
15 case _:
16 raise ValueError(f"Cannot parse structure {invoice}")
The validate_currency
function takes a single invoice and returns either True
or False
, if it is able to infer if the currency is valid or not; or ValueError
if there was a general error.
Remember that you declare a pattern in a case
statement. Python works out the nitty-gritty of how to match the subject against the pattern for you. Python, in this case, does not create instances of Invoice
or Company
but instead interrogates their internal structure to determine how to match them against the subject.
The really neat thing about pattern matching in Python is the ability to pick out attributes from object structures like the code above does. I only specify the things I want to pattern match, and because you can nest structures you are free to specify the full “contract” that your code must have with the data it requires.
Right, so if there’s a match – i.e., we pass an Invoice
object with a Company
in the recipient
attribute – then we can proceed to the actual validation routine.
With the two bound names currency
and country_code
I fashion them into a tuple for no other reason than to make it easier for us, the humans, to read the intent of the code. I could just as easily turn it into a dictionary or some other structure — but a tuple is nice and easy to read.
The case
statements capture the actual business rules and, I must say, in a very clean and readable manner. Let’s look at them piecemeal.
case ("USD" | "GBP" | "EUR", _):
return True
This rule matches any tuple where the currency
part of the tuple is one of "USD"
, "GBP"
, or "EUR"
. The second part of the tuple, the country_code
, is _
indicating a wildcard pattern — meaning, it does not matter what its value is. It could be anything.
From our fictional business’s perspective the rule means that if you denominated your invoice in either of those three currencies then it does not matter what the recipient’s country is: a lot of multinationals denominate their invoices in either of those three, so the code returns True
indicating it’s valid.
The next two rules relate to the Japanese Yen specifically:
case ("JPY", "JP"):
return True
case ("JPY", _):
return False
The first declares that if you’re using Japanese Yen but paying a Japanese company then that’s sensible as Japanese companies would probably prefer to be paid in their own currency. However, if that is not the case, the first case statement fails to match and the second one matches anything with the wildcard _
, which then returns False
, indicating the validation check fails.
case
statements are tested in the order you wrote them in. Check for the most explicit and specific patterns first, and put the more generic “fallback” cases at the end. Ask yourself what happens if you invert the order of the two case
statements above?
Catching duplicate Invoice IDs
The second and final business rule is checking for duplicate invoice IDs. Another pernicious issue that can cause total mayhem if you’re not careful.
MAX_INVOICE_ID = 50000
def validate_invoice_id(invoice: Invoice, known_invoice_ids):
match invoice:
case Invoice(
invoice_id=int() as invoice_id
) if invoice_id <= MAX_INVOICE_ID and invoice_id not in known_invoice_ids:
known_invoice_ids.add(invoice_id)
return True
case Invoice(invoice_id=_):
return False
case _:
raise ValueError(f"Cannot parse structure {invoice}")
Like the previous business rule, I match just the attributes I care about. Here it’s invoice_id
. But I also assert that the named binding must be an integer by writing int() as invoice_id
. Python will do some basic type checking to ensure that, indeed, it’s an integer, as our business rule prescribes. Additionally, I added a guard to check that the invoice ID is less than the maximum we can support, and that we haven’t seen it before.
I have opted to make it possible to supply an existing set of known invoice IDs. That is particularly useful, say, if you have a live system full of invoice IDs you want to check against also.
If that case
statement matches, we make a note of the invoice ID by adding it to the set of known IDs and return True
.
If the rule fails but there’s still an attribute called invoice_id
, we simply return False
to flag it for review by a human later.
Putting it all together
import csv
from dataclasses import asdict
def retrieve_invoices(url, known_ids=None):
if known_ids is None:
known_ids = set()
validated_invoices = []
flagged_invoices = []
for invoice in process_raw_records(get_invoices(url)):
if not all(
[validate_currency(invoice), validate_invoice_id(invoice, known_ids)]
):
flagged_invoices.append(invoice)
else:
validated_invoices.append(invoice)
return validated_invoices, flagged_invoices
def store_invoices(invoices, csv_file):
fieldnames = [
# Recipient Company
"company",
"address",
"country_code",
# Invoice
"invoice_id",
"currency",
"amount",
"sku",
]
w = csv.DictWriter(csv_file, fieldnames=fieldnames, extrasaction="ignore")
w.writeheader()
w.writerows(
[{**asdict(invoice), **asdict(invoice.recipient)} for invoice in invoices]
)
def main():
validated, flagged = process_invoices("https://demo.inspiredpython.com/invoices/")
with open("validated.csv", "w") as f:
store_invoices(validated, f)
with open("flagged.csv", "w") as f:
store_invoices(flagged, f)
All there’s left to do is to tie it all together. The retrieve_invoices
function fetches the raw invoices and calls out to the processor code I wrote earlier. It also applies the business rules and based on the outcome of those checks, it separates them into flagged_invoices
or validated_invoices
.
Finally it stores the invoices into two distinct CSV files. Python’s dataclasses
module comes with a handy asdict
helper function that pulls the typed attributes out of the object into a dictionary again so the CSV writer module knows how to store the data. And that’s it.
Summary
- Pattern Matching is a natural way of expressing the structure of data and extracting the information you want
-
As this demo project showed you, it’s easy to capture business rules that pertain to the structure of your data and extract the information you need from it at the same time. And it’s easy to add or amend rules.
- Patterns are declarative
-
Like I mentioned in Mastering Structural Pattern Matching, it’s the most important concept to take away from all of this. Writing Python is imperative. You tell Python what to do and when. But with a pattern you declare the result you want and leave the thinking to Python. For instance, I did not write any existence checks in
validate_currency
to check if an invoice has a recipient at all! I leave that to Python so I can focus on writing the actual business logic.