update README with information on quote processing, basic installation/dependencies

2025-12-15 06:13:30 -06:00 · 2020-08-07 18:30:54 -05:00
parent 16b673ec08
commit 11a82525fd
1 changed files with 110 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -19,10 +19,120 @@ A Vue.js and Flask Web Application designed to provide a quick way to search for
 - Instant Search provided Algolia
 - Sleek, responsive design that is easy on the eyes

+## Quote Data
+
+### Credit
+
+Credit to [officequotes.net/](https://www.officequotes.net/) for providing all quote data.
+
+Credit to [imdb.com](https://www.imdb.com/title/tt0386676/) for episode descriptions.
+
+### Processing
+
+Quotes are scraped directly from the website as of this moment.
+
+This repository will hold the current pre-processed raw quote data, but the application has the ability to fetch and parse
+HTML pages directly as needed.
+
+```
+python server/cli.py fetch
+    --season SEASON          Fetches all episodes from a specific season.
+    --episode EPISODE        Fetches a specific episode. Requires SEASON to be specified.
+    --all                    Fetches data for every episode from every season.
+    --skip SEASON:EPISODE    When specified, it will skip a given episode.
+```
+
+The data has to be parsed, but due to high irregularity (at least too much for me to handle), the files will have to be
+inspected and manually processed.
+
+```python server/cli.py preprocess
+    --season SEASON     Pre-processes all episodes from a specific season.
+    --episode EPISODE   Pre-processes a specific episode. Requires SEASON to be specified.
+    --all               Pre-processes all episodes from every season.
+    --overwrite         DANGER: Will overwrite files. May result in manually processed files to be lost forever.
+```
+
+From then on, once all files have been pre-processed, you will have to begin the long, annoying process of editing them into my custom format.
+
+These raw pre-processed files are located in `'./server/data/processed/`
+
+Each section (barring the first) is pre-pended by a .hyphen.
+
+```
+CharacterName: Text that character says.
+OtherCharacter: More text that other character says..
+-
+ThirdCharacter: Text that character says in a second scene/section.
+-!1
+Fourth Character With Spaces In Name: Text that fourth character says in a deleted scene.
+Fifth-Character: Which deleted scene? Deleted scene number one.
+```
+
+Deleted scenes are marked by a initial exclamation mark, and then a number of digits marking which deleted scene they are a part of.
+
+Please note that extra text like 'Deleted Scenes 3' might appear before a hyphen - this is expected and is helpful when deciding
+which scene goes with which Deleted Scene ID. If you don't know, do what I did - go look at the web page it's based on.
+Otherwise, I read the quotes and figure out based on context.
+
+This concept is rather loose, slow, and dumb, it simply allows me to mark what deleted scenes go together while working
+with a incredibly inconsistent, human curated data format.
+
+To ease text processing, I did come up with RegEx expressions for search and replacement:
+
+```
+^([\w\s]+\-*[\w\s]*):\s+
+$1|
+```
+
+From then on, the process becomes much simpler, 95% of the work needed to process quotes is already done.
+
+Now that quotes are in a consistent (although custom) format, they need to be processed into individual episodes. In reality,
+they are just the JSON format of the previous stage.
+
+```
+python server/cli.py process
+    --season SEASON     Processes all episodes from a specific season.
+    --epsiode EPISODE   Processes a specific episode. Requires SEASON to be specified.
+    --all               Processes all episodes from all seasons.
+```
+
+Now that they're all in individual files, the final commands can be ran to compile them into one file, a static
+'database' or something. Technically, they could be kept scattered, but I decided to make it simpler with just 1 big file.
+
+This also is where Algolia comes in.
+
+```
+python server/cli.py build [algolia|final]
+```
+
+Each command is ran with no special arguments (as of now), generating a `algolia.json` or `data.json` in the `./server/data/` folder.
+
+This `data.json` file is loaded by the Flask server and the `algolia.json` can be uploaded to your primary index.
+
 ## Setup

 This project was built on Python 3.7 and Node v12.18.3 / npm 6.14.6.

+### Installation
+
+To install all Node/NPM dependencies, run
+
+```
+npm install
+```
+
+To install Python's dependencies, run
+
+```
+pip install -r ./requirements.txt
+```
+
+I recommend that you use a virtualenv in order to keep dependencies separate from other projects, as I do.
+Personally, I use PyCharm Professional to maintain virtualenvs, just because it's easy to start, use, update and maintain
+them.
+
+### Running
+
 - Vue.js can be ran via `npm run serve`.
    - Run this in `./client/`.
 - Flask can be ran via `flask run`.