1. `raw` Directory is almost never edited. It is preserved as a "source of truth" and is only edited when there are flaws in
the original pre-processing.
2. `normalization/truths` acts as a second layer of truth. It is the most basic XML processed file available.
a. All characters are extracted as 'Speakers', meaning 'Michael' and 'Andy & Dwight' are still valid speakers.
b. Speakers extracted are placed in `speaker_mapping.xml`. This allows misspellings and other such errors to be merged together.
- This step of the process has explicit and direct impact on script data. For example, while we do want
"Bob Vance Refrigeration Worker #1" and "Bob Vance Refrigeration Worker #2" to show up internally as the same, we do want them to textually
show differently on the script page.
- Thus, at this stage, we do not merge names of background workers; we only correct mispellings.
c. After this, speakers are translated into a 'identification' file to give short, web-friendly slug identifiers.
- For example, "Michael" becomes `michael`, and "Bob Vance" becomes `bob-vance`.
- Characters will acquire IDs that are most familiar and easy; "Phyllis", while her full name is "Phyllis Vance", will get
`phyllis`.
- This step of the process is entirely for internal data referencing. From before, the bob vance refrigeration workers
will all map to the same `bob-vance-refrigeration-worker` internally.
- Additionally, compound speakers (those that do not directly reference their speaker) like `Kevin's Computer` or
`Kevin and Oscar`, or `Dwight, Kelly, Andy and Pam` will be broken up, and hopefully, be properly
annotated.
```xml
Phyllis
phyllis
Kevin's Computer
{Kevin}'s Computer
kevin
```
- `` elements will be used only for compound speakers. Warnings should show in console in the next step
when compound speakers are not annotated, or if a `Characters` tag is used while only containing one element.
If `AnnotatedText` appears in a `Speaker` element's children but `annotated` is false, or
3. `normalization/characters` acts as the character data layer. Here, characters will have their metadata assigned, like whether or not
they are a main, recurring, background or meta character.
a. Michael, Dwight and Jim are **main** characters. This can be defined by having a very large number of quotes, continued and prolonged
presence in the show, so and so forth.
b. David Wallace, Bob Vance and Esther are **recurring** characters. While they may play a hefty role in the show, they don't appear enough
to make it in as a "main character".
c. Captain Jack, Pizza Guy and Bob Vance Refrigeration Worker are **background** characters. These are characters that appear only once
or make so little impact that it damages the meaning of being a *recurring* character if they were included. The line between a
*background* character and a *recurring* character may be pretty thin at times, so I anticipate some characters will be difficult to choose.
d. "Everyone" and "None" are **meta** characters (the speakers active won't be searchable, but the quote text will be, as usual).
This type is reserved for lines that don't really have a character or for more abstract things, or for scene descriptions.
4. `normalization/compiled` is the final stage when all data is *compiled* into one singular dataset.
a. `episodes/{season}-{episode}.xml` contains each episode's data.
```xml
{Michael}
michael
People say I am the best boss. They go, "God we've never worked in a place like this before. You're hilarious."
"And you get the best out of us." [shows the camera his WORLD'S BEST BOSS mug] I think that pretty much sums it up.
I found it at Spencer Gifts.
{Dwight} and {Andy}
dwight
andy
[singing] Shall I play for you? Pa rum pump um pum [Imitates heavy drumming] I have no gifts for you.
Pa rum pump um pum [Imitates heavy drumming]
```