A field guide for reading relational algebra in database papers

A field guide for reading relational algebra in database papers

If you are into databases, you will see many interesting algebraic expressions, signs, and descriptions that make content hard to understand for a newcomer. These things are mostly coming out while looking at the pipeline figures, algorithm definitions, and so on.

This post is not a complete coverage of all descriptions, but it serves as a guide for understanding what authors are saying and decoding them. Mind that, even it is nice to use detailed descriptions in the papers, some authors try to evade that as much as possible to gain space for explaining things more in-depth. And mind that formal description models tend to serve this purpose.

Here I will explain the relational algebra semantics in papers:

\(\Gamma\) – Gamma

Gamma represents group by. Which is an operation that groups data based on one or more property to later apply sort, or aggregation over it.

You can see group by operation notation like this too:

$$\Gamma_{s\_nationkey,\ count \left(^{*}\right)}$$

which expresses grouping by both s_nationkey and count(*) together, in that operation application order.

\(\bowtie\) – Bowtie

Mighty hash join, it's taking tuple with cardinality of 2 and works out it's way to filter on the logical pattern given to the combiner. It is basically a root operation for joins and always come with it's condition or similar representation for it.

$$\bowtie_{column\_1\ =\ column\_2}$$

⇞ and ⇟ – Double stroke arrows

Both of them are representing sort by and arrow's direction shows the ordering rule, which are ascending and descending.

Sometimes first sign ⇞ commonly used no matter what direction the ordering is but ordering is explicitly mentioned like:

$$⇞_{desc}$$

inversely:

$$⇞_{asc}\ or\ ⇞$$

since most of the databases defaults to ascending order, expression for ordering is shortcuted.

\(\sigma\) – Sigma

Sigma represents filter conditions and selection procedures, on SQL representations it is WHERE and combination of logical operators like AND, OR and such or directly column/s to select:

$$\sigma_{order\_value}$$

Mind that, sigma, for the sake of clarity also have a condition coming as subscript.

$$\sigma_{order\_value\ <\ 30.0}$$

Moreover, combined filters are also a thing in a single sigma expr:

$$\sigma_{order\_value\ <\ 30.0\ \cap\ order\_value > 20.0}$$

is implicit representation of WHERE order_value < 30.0 AND order_value > 20.0.

Aggregations

Most of the aggregations are written in plain texts they are not represented in signature forms since there are a lot of aggregation variants in different forms in many databases. This topic is also not heavily searched in terms of it's specifics too. I think we need more targeted papers for aggregations anyway. So example aggregation can be:

$$M_{max\\(visits\\)}customer$$

Things that are not mentioned a lot are semijoins and antijoins, it is nice to know their representations too:

\(\ltimes\) – Left Semidirect Product

This represents left semijoin. Left semijoins are hash joins without left dataset's non-join columns. That's the difference between semijoin and natural join:

$$L \ltimes R$$

\(\rtimes\) – Right Semidirect Product

Right semijoin is the same approach but joining over the right dataset. Represented as:

$$L \rtimes R$$

\(\triangleright\) – Right Triangle

Antijoin representation, it is a complement of left semijoin that discards joined rows and returns disjunction of it. You might see right facing triangles also to represent right antijoin.

$$L \triangleright R$$

Summary

With these explained, we have got a basic idea of the relational algebra in database papers. Mind that most of the papers are trying to hit on hash joins and projection optimizations, it is nice to know the different join types, filtering, sorting, and other representations you might encounter in database research papers. I hope you find this small guide useful. Have a good day!

Show Comments