Sunday, November 9, 2014

How do computers work?

Motivation

If you do a google search for "how do computers work?" you'll find that it's a pretty common question.  You'll get results like howstuffworks.  These explanations tend to be pretty high level.  They're stuff I've understood for over a decade from just using computers heavily, and I expect most other nerds know as well.  I understand that a computer has a BIOS which loads the operating system which then runs programs.  I understand how you can translate a human's wishes into lines of computer code.  But how do those lines of code actually perform calculations using electrons in the hardware?  That is the question I wanted answered.

I've seen hardware explanations before, mostly at informal levels which usually meant they were quite short and focused on one particular aspect.  As I work my way through a computer engineering degree, one day it occurred to me that I largely had the the ability to answer this question.

I decided to write out an outline for how a computer worked on a hardware level, mostly as a summary for my own gratification (and to identify gaps).  After I was done I figured that I might as well get one good blog post out of my degree.

My aim will be to explain how something works and then explain how we can build the next level up with that.  Keep in mind this won't always correspond to how things are actually built in practice, as it is often much more complex.  My goal is just to show how you could build it if you weren't concerned with performance and price.


What you should already know

I'll be writing this for an audience that has at least a bit of exposure to programming, and for who the above howstuffworks article was largely a review.  If you don't know anything about programming it won't matter too much, as we won't be getting to anything like that for quite a while.  For the very beginning I'll assume you're at least mildly familiar with chemistry and electricity.  I have to start at some level, and frankly don't feel comfortable explaining things at a lower level than this.

Before we begin, I'd also like to remind my readers that I have no idea what I am talking about, and am probably wrong about everything here.



Semiconductors

Wikipedia says "A semiconductor is a material which has electrical conductivity between that of a conductor such as copper and that of an insulator such as glass.", this definition obviously follows from the name, but it never really explained things to me.  It would seem that a semiconductor is just a resistor, which clearly isn't true.  

The key fact about semiconductors is that their conductivity can be changed easily.  Instead of just having a resistance which might vary slightly with temperature, a semiconductor can go from being an insulator to a conductor with the application of an electric field.


How do semiconductors work?

Silicon is a semiconductor.  Look at a periodic table, and you'll see that silicon sits 4 elements away from the left, and 4 elements away from the right (discounting the noble gases).  This means silicon has an equal number (4 specifically) of electrons and empty electrons spots in its outer shell, and therefore, bonds readily with itself.  The result is that silicon forms a nice, stable, crystalline structure.

However, this structure can be quickly messed up by even slight impurities.  Phosphorus is to the right of silicon, and has 5 electrons in its outer shell (ie, 5 valance electrons).  Putting a tiny bit of phosphorus in some otherwise pure silicon will create a crystal that has the occasional extra electron which doesn't really have a proper home.  On the other hand, boron is to the left (and up but ignore that) and has 3 valance electrons.  Putting some boron in silicon will create "holes" where there would otherwise be electrons but aren't.  It turns out that while pure silicon crystal doesn't really like to let its electrons move around (ie electrical current), if you have some of these extra electrons or holes they will move around with a little motivation.

The process of adding a bit of impurities to silicon is called doping.  Doping silicon with phosphorus will give a n-type semiconductor.  N for negative since electrons are negative and that is what we have a surplus of.  Dope with boron and we have p-type, for positive since the absence of electrons is usually viewed as positive.

I feel compelled to note here that in all cases the overall charge is neutral (ie, there is equal positive and negative charges).  This is because when we add electrons we are also adding protons in the nuclei.  But protons don't move, and we are only concerned with the sea of electrons and how the number deviates from what we would expect from pure silicon.



The Transistor

The transistor is the basis for all modern computers.  You could make a strong case for it being the most important invention of the 20th century.  The transistor is made possible by semiconductors.

If you take some p and n type semiconductors and layer them, the excess electrons in the n type will drift towards the holes in the p type.  An equilibrium is reached when the negative charge of the extra electrons taking up residence in the p type semiconductor repels further electrons from coming over to fill more holes.  If we now apply a positive voltage to the p-type layer that will give those extra electrons a place to go (they are attracted to the positive charge and will flow towards it).  With those electrons gone, more electrons will flow from the n-type layer.

Sandwich the p-type layer between two n-type layers and when the voltage is applied to the middle p-type layer a current will be able to flow from the one n-type end to the other n-type end.  In addition to applying a voltage at the middle layer, we also apply a voltage across the entire 3 layers, and that causes electrons to flow through the entire thing.

The end result is two types of transistors.  They are called the npn and pnp type.  As you can probably guess, the npn type is n-type semiconductor on either end and p-type in the middle, and the pnp is the opposite.  The npn type needs a positive voltage applied in the middle in order to conduct, and the pnp type needs a negative voltage.

This figure, which I stole from here, shows the two types of transistors.  On the left are the n and p layers.  The right shows the electronics symbol used for both.  Note that the arrow points in for pnp (Pointing iN Permanently), and points out for npn (Not Pointing iN).  The base is where the voltage is applied to control the transistor.  We are ignoring the emitter and collector; just know that they are connected when the correct voltage (negative or positive) is applied to the base.

We now have a transistor which will normally not allow current to pass through it (ie it is an insulator), however, when a voltage is applied to the middle, then current will flow through it (ie a conductor).

For the rest of this post we can think of transistors as electrically controlled switches.  Normally they are open, but apply the correct voltage and they close and allow current to pass.

If this hasn't been clear, and you want a better overview of semiconductors and transistors this video is very well done:
https://www.youtube.com/watch?v=IcrBqCFLHIY



Logic Gates

What are logic gates?

In a digital system, a high voltage (maybe 5 volts) will represent a 1, and a low voltage (close to 0) will represent a 0.  We need to be able to take these voltages and perform logical operations based on them.  Gates are devices that can be thought of as taking a number of inputs and producing a single output based on some simple rules.

As a simple example, an OR gate might take 2 inputs, and will output a 1 if any of the inputs are a 1, and produce a 0 if all the inputs are 0.  AND gates will produce a 1 if all the inputs are 1, and a 0 if any of the inputs are 0.  There is also a single input NOT or inverter gate that will just toggle the input bit, ie, it will output a 1 if the input is 0 and will output a 0 if the input is 1.

Each of the AND and OR gates have inverted versions that are the same but with a NOT after them, which are called NAND and NOR respectively.  I think NOR is interesting because it works the way its English usage would imply.  If neither the inputs are 1 then it will be 1, if either of them are 1 then it will be 0.  This corresponds to a OR followed by a NOT.

There is also the XOR or exclusive or, which will produce a 0 if the inputs are the same, and a 1 if they are different.


Each gate is represented by a symbol.  Note that the NOT gate is sometimes represented as just the bubble without the triangle.  With that in mind, the rest of the inverted gates are literally just the original symbol followed by a NOT bubble.


How to build logic gates with transistors

Gates can be thought of as just black boxes, taking inputs and giving outputs.  However, my aim to to connect the dots from electrons to code, so I will explain how we can make these gates using transistors as switches.

It turns out there are multiple ways to make gates, but what I am aiming for is more showing that it is possible rather than explaining how it is done in practice.

An AND gate is quite easy to build using some NPN transistors that we will treat as switches that connect when a positive voltage (ie, a logical 1), is applied to the middle.  Since we want a 1 at the output when there are two 1s at the inputs we need to create a situation where the output is normally a 0 but will be connected to a 1 when there are two 1s present.

This figure shows the above situation.  The inputs are A and B, applying a positive voltage to them will cause both NPN transistors to allow current to flow.  If they both have a 1 as an input then the 6V at the top will be directly connected to the output, this will mean there is 6V at the output and that will represent a 1.  The 4.7K resistor at the bottom is a pull-down resistor and is needed so that the 6V drains away when the output is disconnected from the 6V source.  4.7K is rather high resistance and it will have little effect when the output is connected to the voltage source.

In an OR gate we just have to add a bypass to each transistor so that if either has a 1 it will connect the source to the output.  A NAND is made by moving the output to the top (along with the 4.7K resistor which is now a pull-up resistor).  If you want to see them the above image came from this site.  This site shows an alternate method for constructing the gates from transistors.

In practice, a NAND gate is the fastest, and cheapest gate to produce.  All other gates can be made from NAND gates.  This Wikipedia article pretty much just gives examples of how this is done if you are interested.


Using gates to do things

So we can make these logic gates using transistors, but what does that get us?  We know we can AND or OR things, but how do we perform more complex logic using gates?

The basis for designing a circuit with gates is a truth table.  A truth table is just a listing of every possible combination of inputs along with the desired output.  In the above figure that shows the gate symbols, there is a truth table next to each gate.

There is also an algebraic way to represent logic gates.  OR is + and AND is *, NOT is a bar over top or a apostrophe afterwards.  For example if we have two inputs named A and B, A+B = Output is A OR B.  AB + BC = Output is (A AND B) OR (B AND C). 

DecBinf
0000000
0100010
0200100
0300111
0401000
0501010
0601100
0701111
0810000
0910010
1010100
1110110
1211000
1311010
1411100
1511111

Here is a truth table for 4 binary inputs.  There are `2^4 = 16` combinations.  The output is true only when the input is 3, 7, or 15.  We can create an algebraic representation using a technique called sum of products.

We will default to an output of 0.  Then, in each case where we have a 1 as an output, we will force a 1 instead.  We need to determine if we have either a 3, 7, or 15.  If any of those are the case we must output a 1, which is a perfect use for an OR gate.  To test if one specific condition is met (eg, if we have a 3) we test if the each correct bit is given as an input.  For 3 that means we need to test for 0011.  We do this by inverting the first two inputs, and then feeding all 4 into an AND gate.  That AND gate will output a 1 if and only if the input is exactly 0011.  Repeat that process for the other conditions and we have 3 AND gates, which are then fed into an OR gate.  The output of the OR gate is the output of the system.

I won't go into more detail than this, as this post is already way too long, and going into way more detail than is needed.  If you want to know more this is called digital logic design.  I made a truth table generator that uses algebraic input if you want to play around with this stuff.


Two's Complement

I will discuss how to make a binary adder using logic gates, but first I must take a detour and explain how negative numbers are represented in binary.  You can probably skip over this section if you are growing weary of all this.

You are probably aware of how positive numbers are represented in binary, if not refer to the above truth table.  If we only ever have to store positive numbers this is all well and good, but how should we store negative numbers?  A simple method would be to use one bit to represent the sign.  Since positive is the norm, we make 0 = positive, and 1 = negative.  We can then say the leftmost bit (AKA most significant bit) is the sign bit.  We lose one bit's worth of data, so our maximum value is cut in half, but that is the trade off for negative numbers.

There is a problem with this simple method though.  It has a value that represents -0.  With 4 bits this would be `1000_2 = -0_10`.  This isn't that big of a deal at first glance.  -0 is just 0, so we have two ways to represent 0.  The problem is that we can no longer simply increment from -1 to 0 to 1.  Instead, we need complex logic to detect if we are crossing from -1 to -0, and to instead jump to 0.  There is also the problem that adding these numbers doesn't work the same if there are negative numbers involved, and handling them is much more complicated.

If you have programmed a bit you are probably aware that if you increase a value that is already at its maximum value it will rollover and become the minimum value.  With 4 unsigned bits the math is: 1111 + 1 = 0000.  But, in the case of 4 signed bits it is: 0111 + 1 = 1000.  Here, 0111 represents 7, but 1000 would be the -0 value we introduced with our simple signed system.  You should be aware that this is not what actually happens.  The result will be -8, and it will increment normally from there.  This is because `1000_2 = -8_10`, but how?

The system actually used is called two's complement.  Positive numbers are stored exactly the same as above.  For negative numbers the procedure is this: Take the binary representation of its positive value, invert all the bits, and then add 1.  For example, `3_10 = 0011_2`, flip the bits to be `1100`, then add 1 to get `1101_2 = -3_10`.  For -8 we take 1000, flip them to 0111, then add one to get 1000.  This process can be repeated to go from negative back to positive.

BinaryDecimal
0111 1111127
0111 1110126
0000 00113
0000 00102
0000 00011
0000 00000
1111 1111−1
1111 1110−2
1111 1101−3
1111 1100−4
1000 0010−126
1000 0001−127
1000 0000−128

I know this seems pretty crazy, but it makes the math work out much nicer.  Addition works the same regardless of if the numbers are positive or negative.  I won't talk about this more, and if you want you can pretty much forget about it.  If you want to play around with it more, here is a two's complement converter.



Adding with gates

Now we will make an adder using logic gates.  The first step is adding two 1 bit numbers, and getting a single 1 bit number as the output.  This is just an XOR gate, as 0+0=0, 1+0=1, and 1+1=10, and we ignore the first 1 so the output is just 0.  The problem is we can't just ignore that bit, so we actually need 2 bits of output.  That extra bit is called the carry bit.  Since we will be adding multiple pairs of bits down the line we will need to input that carry bit into the next stage.  The result is 3 inputs and 2 outputs.


This is called a full adder.  The inputs are A0 and B0, Carry 0 would be the result of the previous stage's carry bit, or 0 if this is the first stage.  The output is Out 0, and the carry which is fed to the next stage.  If we want to combine these stages it's actually quite easy.  We just need to connect the carries and we can run as many bits as we need to.

We can add one more gate per stage and get subtraction out of the same hardware.  We create a subtraction bit; it is set to 0 for addition, and 1 for subtraction.  We feed that bit into an XOR gate along with the B bit.  If the subtraction bit is 0 the XOR will do nothing, just pass B along as it was before.  If subtraction is 1 then it will flip B.

You may note this is pretty much exactly how two's complement works, which makes sense as subtraction is just adding a negative number.  But wait, don't we need to also add 1 for two's complement?  Yes we do.  We do this by also feeding the subtraction bit in as the very first carry bit.  When that subtraction bit is 0, nothing changes, but when it is 1 it is the same as adding 1 to the result, and that completes the two's complement process.

Here is an example of a 4 bit adder-subtracter.  I added the numbers above and below to make the process of mental math easier.  If that bit is set to 1, then add that value to the decimal number, however, remember that the leftmost bit is the sign bit and that if it is 1 you need to convert.  The brighter lines are the ones that are carrying a logical 1, and the dark lines have a logical 0.

The picture shows 5 - 3 = 2.  Note the subtraction bit is set to 1 in the top left.  Also note that some of the carry bits are set to 1 along the bottom, including the last bit, however we do not use the carry bits outside the adder.  To check for overflow we look to see if the last two carry bits are the same via an XOR gate.

This is only 4 bits, and while it is complicated at first, it isn't really that bad.  It would seem simple to extend this to be 32 or 64 bits, and indeed it would be simple to do.  The problem is that each stage has to wait for the previous stage to finish before it can start.  As you add stages you increase the time the adder takes dramatically.  This is called a ripple carry, and is only used to demonstrate how carry can work in adders.  In practice we actually use look-ahead adders, which use some extra hardware to calculate the carries before the earlier stages of the adder produce them.  This increases hardware cost, but greatly reduces time.  It's hard to find stuff about look-ahead adders that doesn't assume you are taking a class in it and have a plenty of math background, but here is some more details, and these notes linked to at the bottom are pretty simple.

The above gate diagrams were drawn in Logisim, which is a fun program to play around with.  Here is the file with the above 4 bit adder-subtracter.


Memory with gates

So we now know how to add numbers using gates, which means we can add using transistors.  This is pretty impressive in my opinion, but there is still a long ways to go.  We must be able to store these values being fed into and out of the adder or it isn't of much use.  We do this with latches and flip-flops.

I'll just show you how a latch works with this animation stolen from Wikipedia:

The animation here makes this a lot simpler to understand.  Those are NOR gates, and they feed into each other.  Refreshing our memory, a NOR gate has an output of 1 only when both inputs are 0.  This means that if at least one input is 1, we know the output is 0.  We then know the input to the other NOR gate.

This is called a SR latch.  S and R are the inputs, S stands for set (equal to 1), and R for reset (back to 0), Q is the standard output name, and the Q with a bar over it is the inverse of Q.

In an SR latch, if S=R=0 then Q remains its last value.  S=0, R=1 resets Q to 0.  S=1, R=0 sets Q to 1.  S=R=1 is undefined.

I hope that animation explains things well enough, because I honestly don't think I can do a better job via text alone.

If you think about the operation for a while it may dawn on you that it isn't that useful by itself.  You can't just hook S or R up to an output as they will change the value of Q.  The solution is called a gated latch.  We add a pair of AND gates before the inputs S and R.  We add a third input called E for enable, and feed it into both AND gates along with either S or R.  The AND gates will always output a 0 if E is 0.  If the value of E is 1 then the AND gates will pass the other value through without change.  Thus, S and R can change all they want, but the value of Q will only change when E is set to 1.

We can take an input D, feed it into S as is, and feed the inverted version of it into R, and we have a much more useful form of memory.  Now, whenever E is set to 1, the value of D will be stored as Q.  When E is 0 the value of Q will remain as whatever value D last had when E was 1.

Remember that S and R should never both equal 1, that is why the inverter is there.  They can both equal 0 when we want to just hold the current value of Q, but the AND gates controlled by E take care of that case.  So, if E=1, then we should either have: `S=0, R=1 -> Q = 0` or `S=1, R=0 -> Q = 1`.

A flip-flop is a latch that is controlled by a clock.  For example, hooking E up to a clock would mean that the value of D was grabbed every clock cycle.  This would mean that Q was essentially a buffer storing the previous cycle's value of D.  There's more to be said about flip-flops but I'll allow someone else to say it.  Wikipedia has a good list of the various types of latches and flip-flops and how they function.




Multiplexers from gates

There is one more important piece that I want to explain how to build using gates.  A multiplexer (AKA a mux) can be thought of as a digitally controlled selector switch.  A mux will have some number of input pins, some number of selector pins, and a single output pin.  One single input pin will be connected to the output pin, the rest will be not be connected to anything.  The value represented on the selector pins will determine which input pin is connected.

For example, if we had 64 latches to store values in, we could have a mux with 64 inputs, and 6 selector pins (`2^6 = 64`).  If the value `000100_2 = 4_10` is sent to the selector pins, then input pin number 4 will be connected to the output pin.  In this case the output pin would be connected to the adder, and the input to the latch (so the names input and output are backwards).

Here is an illustration of how to build a 4 input mux with gates.  Note that the triangles in the middle are actually tri-state buffers.  This just means that if the input on the side is set to 1 they will pass the value through, if it is set to 0 they will be disconnected.  This is exactly how I've described transistors to work, so just pretend they are transistors.

We have a few AND gates and inverters at the bottom, which just serve to send a 1 to the correct transistor up top.  The rest of the transistors will have a 0, and thus be an open switch.  The transistor that gets a 1 will be a closed switch and connect the corresponding input to the output.

Wikipedia has more on muxes if you want to read some more, but they are relatively simple, albeit very useful.



Combining logic, memory, muxes

At this point we can actually perform useful functions.  Imagine we had a set of 8 latches, each would be storing 1 bit, and so the set would represent a single 8-bit byte.  Now let's take 64 of those sets of 8 latches, and call that a bank of memory.  It would be 64 single byte numbers.  Create 3 of those banks.  Banks A and B are inputs, bank C is the output.  We then need 8 muxes, each with 64 inputs.  These muxes will work together to route one of the bytes somewhere.  There will be a set of 8 muxes for each bank.  Banks A and B will be routed to an 8 bit adder, and the result of that 8 bit adder will be routed to bank C.  Each of these 3 sets of 8 muxes will be controlled by a counter which is just 6 bits starting at 000000 and incrementing to 111111.  A clock pulse will cause the counter to increment and set the enable pin on the correct set of latches in bank C.

This would allow us to add 64 pairs of 1 byte numbers, and store the results.  Admittedly, that is not going to give you any cat pics, but it could still be reasonably called a computer.



Arithmetic Logic Unit (ALU)

We already have an adder and a subtracter.  We can make a few more functions simply as well.  Bitwise AND and OR are both useful.  This means just taking each bit of two inputs and giving the result from an AND or OR between them.  Shifting all the bits either to the left or right allows us to divide or multiple by 2 (consider that 12 * 10 = 120 in base 10, likewise 101 * 2 = 1010 in base 2).

We can combine all these functions into the same package, reusing hardware when possible (as in the case of addition and subtraction) and just using independent hardware when needed.  We then use a mux to determine which hardware gets fed the inputs and then another mux to take the result from the hardware and route it to the output for the ALU as a whole.

With that, we've introduced the concept of operation codes (opcodes).  Each operation we want our processor to be able to perform will have a binary code corresponding to it.  That code will largely serve to tell muxes where to route the inputs and what to do with the outputs.

Instead of just working down our memory banks sequentially as above, we can control which set of latches is being accessed or written to by giving their control muxes the correct binary number (ie, a memory address).  Since we aren't just working down a list, we can combine memory banks A, B, and C from above into just one memory bank.  There will still be 3 groups of muxes, 2 will hooked up to the outputs of the latches (Q) and route it to the input of the ALU.  The third mux will route the output of the ALU into a set of latches.  Each byte in the memory will be numbered and called a register.

So we now have 4 binary numbers, an opcode that determines what operation is being performed, a memory address for input A, a memory address for input B, and a memory address for the output.  Each of these 4 numbers is controlling a mux.

These 4 numbers combine to form a line of machine code.  This code is listed sequentially the same as the numbers were listed before in our simple adder machine.  A clock increments a counter, and that counter tells a mux which line to route into the 4 muxes from the previous paragraph.


Machine Code and Instruction Sets

The number which controls which operation is being performed is called an opcode.  The collection of possible opcodes we have to choose from is an instruction set.  While any processor is going to be able to add, there are various other features that may or may not be available depending on the complexity of the instruction set it uses.

MIPS is an instruction set which aims to achieve high performance through simplicity of operations.  As such, it is a good first example to use.  Here is the actual MIPS machine code for an addition:

opcode rs    rt     rd   shamt funct
000000 01001 01010 01000 00000 100000

rs, rt, and rd are registers, or memory locations.  rs is source, rt is the second source, and rd is destination.  shamt stands for shift amount, and can be ignored.  The opcodes in MIPS are broken between opcode and funct.  The opcode of 000000 means that we will look to funct to determine the function.  Really you can just mentally combine opcode and funct into one number.

Function 100000 is add, so the values in registers listed in rs and rt will be sent to the ALU and inside the ALU they will be routed to the adder.  The output of the ALU will be routed to the register in rd.


Assembly

Assembly is a slightly more human readable way to represent machine code.  The above addition in assembly is:
add $t0, $t1, $t2

The first part is the operation, addition in this case.  The second part is the destination register.  Registers generally begin with a $, just as a convention in assembly.  t registers are temperary ones intended to be used to store the intermediate steps in a larger problem.  There are also 8 s registers which are in the first 8 memory locations.  Therefore, $t0 is mapped to address 8, or 01000.

Likewise, the last two parts are the two source registers.  They are mapped to locations 01001, and 01010 respectively.

It is worth noting here that assembly corresponds one to one with actual operations in the processor.  Each line will consume a clock cycle (this stops being true at a more complicated level).

This line of C code might be compiled into the above assembly:
c = a + b

That is a pretty simple line of C, but even it may need more than the above 1 line of assembly.  The number of registers in a processor are extremely limited.  As such, memory values constantly need to be swapped out between the main memory (RAM), and the CPU's memory (cache).  Swapping out memory values is another type of operation, one that doesn't really use the ALU.  That C code could require 4 addition assembly lines, for taking the two values from RAM and then storing the result in RAM.

The point is that you have no idea how many actual cycles of the CPU a single line of C will take, whereas with assembly you know that each line maps to one actual CPU operation.

Here is a list of MIPS operations.  It may seem complicated, but considering that is everything a MIPS CPU can do, it is kind of amazing that those operations are enough to perform any function a computer can be called upon to do.

If you still think the MIPS instruction set is complex, feel free to contrast MIPS with the instruction set used on desktop PCs.

In addition to the arithmetic operations we've talked a lot about, there are also the store and load operations I mentioned.  These store or load values to or from main memory (RAM).  There are also comparison operations that allow you to compare two values and see if they are equal, or if one is greater than.  Then there are jump instructions which let you jump to a new location in the list of operations being performed (like a GOTO).  Branch instructions are the same as jumps but will only jump if some condition is true.


From C to Assembly

C code:
int  x,y,z;
if(y < x) z = z - 3;
else z = 6; 

The assembly will assume x, y, z are already stored in registers $s4, $s5, $s6.
Assembly Code:

slt $t0, $s5, $s4     # if y<x, then set $t0=1
beq $t0, $zero, L     # if condition (y<x) is false then jump to L
addi $s6, $s6, -3     # if condition (y<x) is true then z=z-3
j EXIT                # jump to end to avoid code for false condition

L:                    # this is a label, used to jump to
addi $s6, $zero, 6    # z=6, this code is only reached if we don't jump to exit

EXIT: 


Closing

I feel compelled to put in some sort of closing here.  Reading through this I think parts are over simplified and some of the tangents could be removed.  I'd be amazed if anyone ever read through this all, but hopefully certain sections will prove useful to someone, and then perhaps they continue reading from there to see how it fits in to a larger picture.

If you're wondering, this post is about 6000 words and I can't believe how much stuff I left out.