Add more high-level documentation: theory.md

pull/35/head
Lennart Spitzner 2017-05-11 17:36:28 +02:00 committed by GitHub
parent 7775812cfd
commit eac73c9a7b
1 changed files with 313 additions and 0 deletions

View File

@ -0,0 +1,313 @@
# Introduction
[The readme](../../master/README.md) mentions a couple of goals for this
project, including the following two:
- Be clever about using the available horizontal space while not overflowing
it if it cannot be avoided;
- Have linear complexity in the size of the input.
These two goals stand in conflict, and this project chooses a solution
to this that distinguishes it from (all) other existing formatters. This
idea was the motivation to write Brittany in the first place, and it might
even be applicable to more general-purposes structured-document formatters
(although we will not expand on that here). Before explaining the idea, we
will first consider
## The problem
Every haskell module can be written in a single line - of course, in most
cases, an unwieldly long one. We humans prefer our lines limitted to some
laughingly small limit like 80 or 160 or whatever. Further, we generally
prefer the indentation of our expressions(, statements etc.) line up with
its syntactic structure. This preferences (and the layouting rule which
already enforces it partially) still leaves a good amount of choice for
where exactly to insert newlines. For example, these are all consistent
in their newline/indentation usage:
~~~~.hs
myRecord = MyRecord { abc = "abc", def = "def"}
myRecord = MyRecord
{ abc = "abc"
, def = "def"
}
myList = [ "abc", "def"]
myList =
[ "abc"
, "def"
]
~~~~
While consistency has the first priority, we also prefer short code: If it
fits, we prefer the version/layout with less lines of code. So we wish to trade
more lines for less columns, but only until things fit.
For simple cases we can give a trivial rule: If there is space
for the one-line layout, use it; otherwise use the indented-multiline
layout (this applies to `myRecord` example). The
straight-forward approach works well: Calculate the length of the
one-line layout for each node, then recurse top-down and make the choices
according to that rule. Linear runtime with good results.
Things get more interesting with nesting, and with structures with more than
two alternatives. Consider the following four alternative layouts of the
same expression:
~~~~.hs
-- 10 20 30 40
-- 1) -- | lets assume the user wants
nestedCaseExpr = case e1 of -- | 40 columns max.
Left x -> if func x then "good" else "bad" -- too long
-- 2) --
nestedCaseExpr = case e1 of --
Left x -> if func x --
then "good" --
else "bad" --
-- 3) --
nestedCaseExpr = case e1 of --
Left x -> --
if func x then "good" else "bad" --
-- 4) --
nestedCaseExpr = case e1 of --
Left x -> --
if func x --
then "good" --
else "bad" --
~~~~
- We nest a case-expression and an if-expression; for both we consider two
alternative layouts, meaning that
- There are a total of four, and in general an exponential amount of
possible layouts;
- With a top-down approach, one would choose 2), as "if func x" has space in
the current line. But..
- Layout 3) is the optimal solution considering lines-of-code and the 40-column
limit.
So our question: Is there an algorithm which, given some input syntax tree,
returns the/an optimal (least lines-of-code while respecting max-column limit)
layout (out of a potentially exponential number of valid layouts). Further,
this algorithm's time/space usage should be linear in the size of the input
(we might weaken this to some polynomial upper bound - but that should be the
limit).
If we pessimistically assume that such algorithm does not exist, we may ask
alternatively: What linear algorithm returns solutions which,
on average, are closest to the optimal solution?
In the following we will describe _one_ such algorithm that seems to return
the optimal solution in several non-trivial cases, and near-optimal solutions
in many other. We won't try to prove that it is the best algorithm, but we will
consider the circumstances for which a non-optimal solution is returned.
## The Reasoning
A top-down approach is so bad, because when there are exponentially many
layouts to consider, there information passed down from the parents does
not help at all in pruning the alternatives on a given layer. In the above
`nestedCaseExpr` example, we might obtain a better solution by looking not
at just the first, but the first n possible layouts, but against an exponential
search-space, this does not scale: Just consider the possibility that there
are exponentially many sub-solutions for layout 2) (replace "good" with some
slightly more complex expression). You basically always end up with either
"the current line is not yet full, try to fill it up" or
"more than n columns used, abort".
But a (pure) bottom-up approach does not work either: If we have no clue about
the "current" indentation while layouting some node of our syntax tree,
information about the (potential) size of (the layout of) child-nodes does
not allow us to make good decisions.
So we need information to flow bottom-to-top to allow for pruning whole trees
of possible layouts, and top-to-bottom for making the actual decisions.. well,
this can be arranged.
# The Algorithm
This algorithm works in two passes over the input syntax tree. The first pass
is bottom-up and adds a label to each node. This label contains (a set of)
"spacings" - certain information regarding the number of columns/lines needed
for the textual representation of potential layouts of that node.
For example we might return two possible spacings for the following (same)
expression:
~~~~.hs
if func x then "good" else "bad"
-- => 1 line, 32 columns used
if func x
then "good"
else "bad"
-- => 3 lines, 13 columns used
~~~~
This is heavily simplified; in Brittany spacing information is (as usual) a
bit more complex.
We restrict the size of these sets. Given the sets of spacings for the
child-nodes in the syntax-tree, we generate a limited number of possible
spacings in the current node. We then prune nodes that already violate desired
properties, e.g. any spacing that already uses more columns locally than
globally available.
The second pass is top-down and uses the spacing-information to decide on one
of the possible layouts for the current node. It passes the current
"cursor position" (indentation, etc.) downwards, allowing to check that the
layout fits (e.g. ensure "current indentation + columns used < max columns").
This algorithm is trivially linear - two traversals and only linear space
required per node of the input.
### Consequences
- when calculating spacings during bottom-up, several spacings are combined
into a single one via min/max/some other associative operation. Perhaps
consider how in a multi-line list layout the "columns used" will be derived
from the element spacings. For additional information embedded into the
spacings, we will need at least one such Semigroup instance.
- We require an order on the alternatives for each syntactical construct.
The first alternative where we find some combination of spacings for the
children which is acceptable will be used.
In unlucky cases, all spacings might get pruned. In that case, we will
default to the last alternative, which therefor should be the most
"conservative" choice, i.e. the one that gives the child-nodes the most
space.
The first alternative should always be the "one-liner" layout, if it exists;
this alternative is preferable in general and will be filtered first should
it not fit.
In between those two extremes can be other choices that gradually trade
"columns used" for "lines used".
- Sometimes there exist several (near) optimal solutions. E.g. when there exist
multiple nodes where we can trade lines for columns in such a way that the
whole result fits.
In such cases, Brittany makes the following choice: We prefer "trading"
in the node higher up in the syntax tree. The reasoning is that we rather
spend one line early on to gain additional horizontal space for _all_ the
children than the other way around. (In those cases where there are
alternatives to choose from, there are often several children.) We can apply
this to the `nestedCaseExpr` example above: The options are to either
put the right-hand-side of the case-alternative into a new line, or
split up the if-then-else. The "case" is the outer one, so Brittany will
prefer 3) over 2), which proves to be the right choice (here).
As a consequence, we are most interested in the maximum spacings; yet, we do
not have a total order because we cannot generally prefer one of two spacings
where the first uses more columns, the second more lines.
- The number of syntactical constructs in haskell is large. If we were to
work directly on the syntax tree to do our traversals, we'd have to write two
(or even three - one to generate spacings, one to make the choices, one to
do the output and insert comments) different functions each respecting every
syntactical construct in haskell. It would be an incredible amount of code
(and work) times three.
Instead we can use some recursive data-type describing a structured document,
which abstracts of different syntactical constructs and only considers the
things relevant for layouting. This data-type is called `BriDoc`.
- The `BriDoc` tree has an exponential number of nodes, but it is linear when
sharing is considered - the child-nodes can (and must) be re-used across
different alternatives. In the `nestedCaseExpr` example above, note how
there are four layouts, but essentially only two ways in which the "if" is
layouted. Either as a single line or with then/else on new lines. We can
handle spacings in such a way that we can share them for 1/3 and 2/4. This
already hints at how "columns used" will need to be redesigned slightly so
that 2/4 really have the same spacing label at the "if".
### Concessions/non-Optimality
- Prefering to trade in the higher node can give non-optimal results, e.g.
when there is only one child.
- Spacings and the pruning of spacings happens in bottom-up fashion, meaning
that we might not prune a "70 columns used" spacing even though we later
notice that the current indentation already needs to be 20 and we only have
60 columns left. If all spacings found in one label have this property,
the top-down traversal might be forced to make the "conservative" choice,
trading more lines for less indentation to prevent potential (but unknown)
overflow during child-node layouting.
- We can increase the limits arbitrarily (i.e. increase the constant factor
while remaining linear) to get better (optimal) results in more cases. We can
choose constant factors in such a way that we get optimal results for all
inputs up to a certain size. Of course calling them "constant factors" is
delusional if they grow exponentially - but it is still nice that this
heuristical algorithm can trivially be transformed into the perfect
(but exponential) algorithm.
### Examples of non-Optimality and Border-Cases in General
TODO
# Practical details
## Comments
Comments don't affect semantics and thus are free-form; as the user we almost
never want them reformatted in any way. The tempting approach is to keep them
entirely separate: Don't layout comments, and don't let comments affect
layout. However, this not possible always:
~~~~.hs
val = f -- useful comment here
x
-- reformatted to
val = f -- useful comment here x
~~~~
Does not work too well, even when the one-liner "f x" would certainly be the
layout of choice.
Brittany handles this in two ways:
1) Some alternatives are pruned based on the existence of comments;
2) Always insert newlines after end-of-line comments, and restore the
current indetation to prevent violation of the layouting rule:
~~~~.hs
val = do
myAction -- useful comment here
x
g
~~~~
## CPP
Conditional text-based source-code insertion is horrible.
"#if"-guarded code in one line can easily affect in which way all other code
is parsed. This makes it more or less impossible to layout code involving
the preprocessor. One might be able to re-implement the preprocessor to then
determine when things are safe to layout, but that is no fun, and Brittany
does not bother.
Brittany allows one thing: Working on the already pre-processed code, treating
any disabled sections like comments. This is not safe in general, but is nice
when the user makes only responsible usage of CPP (e.g.: only put "#if"-guards
around module top-level constructs).
## Horizontal alignment
Sometimes horizontal alignment can make things more readable. However there
are good reasons against using such whitespace: It can cause larger diffs
on simple changes and it is rather subjective in which cases things are
"more readable" rather than "annoyingly spaced".
Nonetheless Brittany has a fully-featured implementation for horizontal
alignment:
~~~~.hs
func (MyLongFoo abc def) = 1
func (Bar a d ) = 2
func _ = 3
~~~~
Again we have to ask the question: Does alignment affect the layouting choices,
and does layouting affect alignment? The answer is clear in this case: No.
First we make layouting-choices, then, independently, we add alignment but only
in those cases where it does not cause overflows.