From eac73c9a7bd680785eb90db18fdc55773f7d9dde Mon Sep 17 00:00:00 2001 From: Lennart Spitzner Date: Thu, 11 May 2017 17:36:28 +0200 Subject: [PATCH] Add more high-level documentation: theory.md --- docs/implementation/theory.md | 313 ++++++++++++++++++++++++++++++++++ 1 file changed, 313 insertions(+) create mode 100644 docs/implementation/theory.md diff --git a/docs/implementation/theory.md b/docs/implementation/theory.md new file mode 100644 index 0000000..7795a2a --- /dev/null +++ b/docs/implementation/theory.md @@ -0,0 +1,313 @@ +# Introduction + +[The readme](../../master/README.md) mentions a couple of goals for this +project, including the following two: + +- Be clever about using the available horizontal space while not overflowing + it if it cannot be avoided; +- Have linear complexity in the size of the input. + +These two goals stand in conflict, and this project chooses a solution +to this that distinguishes it from (all) other existing formatters. This +idea was the motivation to write Brittany in the first place, and it might +even be applicable to more general-purposes structured-document formatters +(although we will not expand on that here). Before explaining the idea, we +will first consider + +## The problem + +Every haskell module can be written in a single line - of course, in most +cases, an unwieldly long one. We humans prefer our lines limitted to some +laughingly small limit like 80 or 160 or whatever. Further, we generally +prefer the indentation of our expressions(, statements etc.) line up with +its syntactic structure. This preferences (and the layouting rule which +already enforces it partially) still leaves a good amount of choice for +where exactly to insert newlines. For example, these are all consistent +in their newline/indentation usage: + +~~~~.hs +myRecord = MyRecord { abc = "abc", def = "def"} +myRecord = MyRecord + { abc = "abc" + , def = "def" + } +myList = [ "abc", "def"] +myList = + [ "abc" + , "def" + ] +~~~~ + +While consistency has the first priority, we also prefer short code: If it +fits, we prefer the version/layout with less lines of code. So we wish to trade +more lines for less columns, but only until things fit. + +For simple cases we can give a trivial rule: If there is space +for the one-line layout, use it; otherwise use the indented-multiline +layout (this applies to `myRecord` example). The +straight-forward approach works well: Calculate the length of the +one-line layout for each node, then recurse top-down and make the choices +according to that rule. Linear runtime with good results. + +Things get more interesting with nesting, and with structures with more than +two alternatives. Consider the following four alternative layouts of the +same expression: + +~~~~.hs +-- 10 20 30 40 +-- 1) -- | lets assume the user wants +nestedCaseExpr = case e1 of -- | 40 columns max. + Left x -> if func x then "good" else "bad" -- too long +-- 2) -- +nestedCaseExpr = case e1 of -- + Left x -> if func x -- + then "good" -- + else "bad" -- +-- 3) -- +nestedCaseExpr = case e1 of -- + Left x -> -- + if func x then "good" else "bad" -- +-- 4) -- +nestedCaseExpr = case e1 of -- + Left x -> -- + if func x -- + then "good" -- + else "bad" -- +~~~~ + +- We nest a case-expression and an if-expression; for both we consider two + alternative layouts, meaning that +- There are a total of four, and in general an exponential amount of + possible layouts; +- With a top-down approach, one would choose 2), as "if func x" has space in + the current line. But.. +- Layout 3) is the optimal solution considering lines-of-code and the 40-column + limit. + +So our question: Is there an algorithm which, given some input syntax tree, +returns the/an optimal (least lines-of-code while respecting max-column limit) +layout (out of a potentially exponential number of valid layouts). Further, +this algorithm's time/space usage should be linear in the size of the input +(we might weaken this to some polynomial upper bound - but that should be the +limit). + +If we pessimistically assume that such algorithm does not exist, we may ask +alternatively: What linear algorithm returns solutions which, +on average, are closest to the optimal solution? + +In the following we will describe _one_ such algorithm that seems to return +the optimal solution in several non-trivial cases, and near-optimal solutions +in many other. We won't try to prove that it is the best algorithm, but we will +consider the circumstances for which a non-optimal solution is returned. + +## The Reasoning + +A top-down approach is so bad, because when there are exponentially many +layouts to consider, there information passed down from the parents does +not help at all in pruning the alternatives on a given layer. In the above +`nestedCaseExpr` example, we might obtain a better solution by looking not +at just the first, but the first n possible layouts, but against an exponential +search-space, this does not scale: Just consider the possibility that there +are exponentially many sub-solutions for layout 2) (replace "good" with some +slightly more complex expression). You basically always end up with either +"the current line is not yet full, try to fill it up" or +"more than n columns used, abort". + +But a (pure) bottom-up approach does not work either: If we have no clue about +the "current" indentation while layouting some node of our syntax tree, +information about the (potential) size of (the layout of) child-nodes does +not allow us to make good decisions. + +So we need information to flow bottom-to-top to allow for pruning whole trees +of possible layouts, and top-to-bottom for making the actual decisions.. well, +this can be arranged. + +# The Algorithm + +This algorithm works in two passes over the input syntax tree. The first pass +is bottom-up and adds a label to each node. This label contains (a set of) +"spacings" - certain information regarding the number of columns/lines needed +for the textual representation of potential layouts of that node. +For example we might return two possible spacings for the following (same) +expression: + +~~~~.hs +if func x then "good" else "bad" +-- => 1 line, 32 columns used +if func x + then "good" + else "bad" +-- => 3 lines, 13 columns used +~~~~ + +This is heavily simplified; in Brittany spacing information is (as usual) a +bit more complex. + +We restrict the size of these sets. Given the sets of spacings for the +child-nodes in the syntax-tree, we generate a limited number of possible +spacings in the current node. We then prune nodes that already violate desired +properties, e.g. any spacing that already uses more columns locally than +globally available. + +The second pass is top-down and uses the spacing-information to decide on one +of the possible layouts for the current node. It passes the current +"cursor position" (indentation, etc.) downwards, allowing to check that the +layout fits (e.g. ensure "current indentation + columns used < max columns"). + +This algorithm is trivially linear - two traversals and only linear space +required per node of the input. + +### Consequences + +- when calculating spacings during bottom-up, several spacings are combined + into a single one via min/max/some other associative operation. Perhaps + consider how in a multi-line list layout the "columns used" will be derived + from the element spacings. For additional information embedded into the + spacings, we will need at least one such Semigroup instance. + +- We require an order on the alternatives for each syntactical construct. + The first alternative where we find some combination of spacings for the + children which is acceptable will be used. + + In unlucky cases, all spacings might get pruned. In that case, we will + default to the last alternative, which therefor should be the most + "conservative" choice, i.e. the one that gives the child-nodes the most + space. + + The first alternative should always be the "one-liner" layout, if it exists; + this alternative is preferable in general and will be filtered first should + it not fit. + + In between those two extremes can be other choices that gradually trade + "columns used" for "lines used". + +- Sometimes there exist several (near) optimal solutions. E.g. when there exist + multiple nodes where we can trade lines for columns in such a way that the + whole result fits. + + In such cases, Brittany makes the following choice: We prefer "trading" + in the node higher up in the syntax tree. The reasoning is that we rather + spend one line early on to gain additional horizontal space for _all_ the + children than the other way around. (In those cases where there are + alternatives to choose from, there are often several children.) We can apply + this to the `nestedCaseExpr` example above: The options are to either + put the right-hand-side of the case-alternative into a new line, or + split up the if-then-else. The "case" is the outer one, so Brittany will + prefer 3) over 2), which proves to be the right choice (here). + + As a consequence, we are most interested in the maximum spacings; yet, we do + not have a total order because we cannot generally prefer one of two spacings + where the first uses more columns, the second more lines. + +- The number of syntactical constructs in haskell is large. If we were to + work directly on the syntax tree to do our traversals, we'd have to write two + (or even three - one to generate spacings, one to make the choices, one to + do the output and insert comments) different functions each respecting every + syntactical construct in haskell. It would be an incredible amount of code + (and work) times three. + + Instead we can use some recursive data-type describing a structured document, + which abstracts of different syntactical constructs and only considers the + things relevant for layouting. This data-type is called `BriDoc`. + +- The `BriDoc` tree has an exponential number of nodes, but it is linear when + sharing is considered - the child-nodes can (and must) be re-used across + different alternatives. In the `nestedCaseExpr` example above, note how + there are four layouts, but essentially only two ways in which the "if" is + layouted. Either as a single line or with then/else on new lines. We can + handle spacings in such a way that we can share them for 1/3 and 2/4. This + already hints at how "columns used" will need to be redesigned slightly so + that 2/4 really have the same spacing label at the "if". + +### Concessions/non-Optimality + +- Prefering to trade in the higher node can give non-optimal results, e.g. + when there is only one child. + +- Spacings and the pruning of spacings happens in bottom-up fashion, meaning + that we might not prune a "70 columns used" spacing even though we later + notice that the current indentation already needs to be 20 and we only have + 60 columns left. If all spacings found in one label have this property, + the top-down traversal might be forced to make the "conservative" choice, + trading more lines for less indentation to prevent potential (but unknown) + overflow during child-node layouting. + +- We can increase the limits arbitrarily (i.e. increase the constant factor + while remaining linear) to get better (optimal) results in more cases. We can + choose constant factors in such a way that we get optimal results for all + inputs up to a certain size. Of course calling them "constant factors" is + delusional if they grow exponentially - but it is still nice that this + heuristical algorithm can trivially be transformed into the perfect + (but exponential) algorithm. + +### Examples of non-Optimality and Border-Cases in General + +TODO + +# Practical details + +## Comments + +Comments don't affect semantics and thus are free-form; as the user we almost +never want them reformatted in any way. The tempting approach is to keep them +entirely separate: Don't layout comments, and don't let comments affect +layout. However, this not possible always: + +~~~~.hs +val = f -- useful comment here + x +-- reformatted to +val = f -- useful comment here x +~~~~ + +Does not work too well, even when the one-liner "f x" would certainly be the +layout of choice. + +Brittany handles this in two ways: + +1) Some alternatives are pruned based on the existence of comments; +2) Always insert newlines after end-of-line comments, and restore the + current indetation to prevent violation of the layouting rule: + + ~~~~.hs + val = do + myAction -- useful comment here + x + g + ~~~~ + +## CPP + +Conditional text-based source-code insertion is horrible. +"#if"-guarded code in one line can easily affect in which way all other code +is parsed. This makes it more or less impossible to layout code involving +the preprocessor. One might be able to re-implement the preprocessor to then +determine when things are safe to layout, but that is no fun, and Brittany +does not bother. + +Brittany allows one thing: Working on the already pre-processed code, treating +any disabled sections like comments. This is not safe in general, but is nice +when the user makes only responsible usage of CPP (e.g.: only put "#if"-guards +around module top-level constructs). + +## Horizontal alignment + +Sometimes horizontal alignment can make things more readable. However there +are good reasons against using such whitespace: It can cause larger diffs +on simple changes and it is rather subjective in which cases things are +"more readable" rather than "annoyingly spaced". + +Nonetheless Brittany has a fully-featured implementation for horizontal +alignment: + +~~~~.hs +func (MyLongFoo abc def) = 1 +func (Bar a d ) = 2 +func _ = 3 +~~~~ + +Again we have to ask the question: Does alignment affect the layouting choices, +and does layouting affect alignment? The answer is clear in this case: No. +First we make layouting-choices, then, independently, we add alignment but only +in those cases where it does not cause overflows. +