Big O notation describes *time complexity* of algorithms, the relation between the *cardinality* (the number of items) of the input and the time it takes to process that input.

Lack of attention to scalability of algorithms (in the design, development and deployment phase) is a common root cause for performance (and by extension, stability) problems - that may reveal themselves only over time.

- developing against low cardinality (eg unrealistically small datasets) prevents (poor) developers being confronted with future reality.
- in the age of plenty, efficiency can initially appear to be of lesser importance - until poor scalability catches up and leaves the system dead in its tracks.

Big O notation describes the completion time of algorithms for ever larger cardinalities - the limiting behaviour of the algorithm. Real-world algorithm time complexity can be complex functions that are hard to work with, but they are bounded from above by simpler functions that serve as an approximation. For example, a full table scan (which means bulk reading a large amount of data) depends linearly on the size of the table being scanned, if the object is twice as big, the runtime will double. So the time complexity function might be something like \( T(n) = a*n + b \). No matter what constants a and b, when n gets ever larger, n will be the dominating factor in the function, and we write that '\(T(n)\) is a member of the family of functions that scale as \(n\)', or in short, \(T(n)\ \in O(n)\) or loosely 'a full table scan is \(O(n)\)'.

An algorithm with time complexity not depending on the cardinality at all is said to run in constant time, and denoted as \(O(1)\).

Algorithms with better scalability can still be outperformed by algorithms with poorer scalability when the input is (very) small, knowledge which can be used by a wrapper algorithm that selects the optimal algorithm to use internally. Most database systems do that for 'small tables', for which an index-based table lookup would be slightly more expensive compared with just reading 'all' the table data directly.

Some common algorithms and their scalability

application | algorithm | scalability | cardinality |
---|---|---|---|

database | full table scan | \(O(n)\) | n = number of table rows |

database | btree index lookup | \(O(log(n))\) | \(n\) = number of table rows |

database | nested loops join, 1 set has constant cardinality | \(O(n)\) | \(n\) = number of rows in the scaling set |

database | nested loops join, both sets have scaling cardinality | \(O(n_1 n_2)\) | \(n_{1,2}\) = number of rows in set 1,2 |

For database systems, spectacular performance improvements can be achieved by creating the proper indexes for the load (the access paths of queries, cardinality). A lookup in a binary tree index is \(O(log(n))\), which scales *much* better than the full table scan's \(O(n)\). A 'missing index' is a common root cause of database performance problems, but may manifest itself only over time - when the number of rows gets big.

### other scalability examples

- This excellent article demonstrates the better scalability of ipset+iptables (hash lookups, \(O(log(n)\)) versus iptables-only (\(O(n)\)).

## Add new comment