AWS Builder Center Article

Build a Grammar and Pronunciation Enrichment Pipeline for Tagalog Cards for AWS Manila Community Day

The Tagalog learning app already had article pages and sentence cards. The next development challenge was enrichment: add better grammar breakdowns and pronunciation guides to every extra example without manually editing hundreds of repeated HTML blocks. The Python scripts show a practical batch-processing pattern for that work.

Tagalog Learning App Article Series

Disclaimer

Purpose: This static app is an educational prototype for language practice and developer sharing. It helps learners prepare polite Tagalog phrases for AWS Manila Community Day.

Non-Commercial: This project has no paid feature, no advertising, no registration requirement, and no commercial purpose. It is intended for learning, experimentation, and community preparation.

No Guarantees: Generated language content may contain mistakes. Tagalog translations, grammar explanations, pronunciation guides, and cultural notes should be checked by native speakers before production use.

Scope: This app is not an official AWS product, not an official translation tool, and not a substitute for human language instruction. It is a technical demo with useful learning content.

Community Respect: The app should avoid stereotypes and should teach polite language carefully. Words like po, opo, kayo, and ninyo should be explained as tools for respect, not decoration.


Demo

Paki-check po kung pumasok ang bayad.

Grammatical Breakdown

  • Paki-check: Filipino-English request phrase meaning "please check."
  • po: Respect marker used for polite speech.
  • kung: Means "if" or "whether."
  • pumasok: Means "entered" or "came in" in this payment context.
  • bayad: Means "payment."

Pronunciation Guide

It is pronounced word by word as:

pah-kee-chehk poh koong poo-mah-sohk ahng bah-yahd.

  • Paki-check: pah-kee-chehk.
  • po: poh.
  • kung: koong.
  • pumasok: poo-mah-sohk.
  • bayad: bah-yahd.

Content Snapshot

Core enrichment tasks:

  • Find every div.extra-example block
  • Read the Tagalog sentence from span
  • Generate grammar breakdown list items
  • Generate pronunciation guide text and chunk list
  • Replace old sections in the HTML
  • Write updated article files
  • Print sanity checks

Table of Content

Part 1: Use Batch Scripts For Article Groups

This section explains why the enrichment work is split across multiple scripts instead of one giant file.

Part 2: Treat Glossaries As Small Local Knowledge Bases

This section shows how the scripts use dictionary entries for beginner-friendly grammar meanings and local loanword explanations.

Part 3: Generate Pronunciation From Known Words And Fallback Rules

This section explains the pronunciation map, token handling, vowel fallback, and chunked beginner output.

Part 4: Patch HTML With BeautifulSoup

This section shows the code pattern that finds existing sections and replaces only the content after a target heading.

Part 5: Validate Enriched Output

This section explains why the scripts print counts for extra examples, pronunciation phrases, local tips, and grammar breakdowns.


Part 1: Use Batch Scripts For Article Groups

Goal

Improve grammar and pronunciation across 24 article pages without hand-editing every card.

Development skill

The key skill is controlled batch processing. Each script owns a small group of article files and a glossary tuned to that group.

files = [
    "article-22-manila-daily-home-laundry-bills-and-errands.html",
    "article-23-manila-daily-work-study-and-social-plans.html",
    "article-24-manila-daily-health-safety-weather-and-money.html",
]

This pattern is easier to review than one huge script because each batch can carry topic-specific words. Community Day pages need words such as registration, workshop, badge, and volunteer. Manila Daily pages need words such as laundry, delivery, battery, cash, clinic, and medicine.

Prompt

For each article group, update every extra example.
Keep the existing Tagalog sentence.
Replace the grammar breakdown with beginner-friendly word meanings.
Replace the pronunciation guide with word-by-word pronunciation.
Write an updated HTML file and print a summary.

Result

The project gets a repeatable enrichment workflow:

article group
  -> topic glossary
  -> visible Tagalog sentence
  -> grammar list
  -> pronunciation guide
  -> updated HTML
  -> sanity check

Tips

  • Split scripts by article topic when the vocabulary changes.
  • Keep the file list explicit so reviewers know the script scope.
  • Use the same function names across batches.
  • Prefer deterministic output over runtime AI generation.
  • Print a summary after every batch.

Part 2: Treat Glossaries As Small Local Knowledge Bases

Goal

Turn each script into a small, inspectable language helper.

Development skill

The scripts use dictionaries as local knowledge bases. A word such as po or saan receives a stable beginner explanation, while unknown words fall back to a generic local-use explanation.

defs = {
    "po": "Respect marker used for polite speech.",
    "saan": "Means where.",
    "workshop": "English loanword used locally; means workshop.",
    "badge": "English loanword used locally; means badge.",
}

def get_def(word):
    key = token_key(word)
    if key in defs:
        return defs[key]
    return f'English loanword or useful word used locally; means "{word}" in this context.'

Why this matters

This approach is not a full grammar parser, but it is useful for a static learning prototype. The learner sees consistent meanings, and the developer can update one dictionary entry when a meaning needs improvement.

Example

Sentence:
Saan po ang registration area?

Generated breakdown:
- Saan: Means where.
- po: Respect marker used for polite speech.
- ang: Focus marker placed before the main noun or idea.
- registration: English loanword used locally; means registration.
- area: English loanword or useful word used locally; means "area" in this context.

Tips

  • Keep definitions short.
  • Explain loanwords honestly instead of pretending every word is pure Tagalog.
  • Use beginner wording.
  • Add topic-specific glossary entries when fallback text appears too often.
  • Treat dictionary entries as reviewable content.

Part 3: Generate Pronunciation From Known Words And Fallback Rules

Goal

Give every extra example a pronounceable guide even when not every word is in the pronunciation map.

Development skill

The scripts combine two strategies:

Known word:
  use a curated pronunciation and optional syllable chunks.

Unknown word:
  use a simple vowel fallback so the learner still gets a readable guide.

Code pattern

pron = {
    "salamat": ("sah-lah-maht", [("sa", "sah"), ("la", "lah"), ("mat", "maht")]),
    "kayo": ("kah-yoh", [("ka", "kah"), ("yo", "yoh")]),
    "bayad": ("bah-yahd", [("ba", "bah"), ("yad", "yahd")]),
}

vmap = {"a": "ah", "e": "eh", "i": "ee", "o": "oh", "u": "oo"}

def fallback_pron(word):
    output = []
    for character in word.lower():
        if character in vmap:
            output.append(vmap[character])
        elif character.isalpha():
            output.append(character)
    return "".join(output) or word

Example output

Tagalog:
Uminom po kayo ng tubig dahil mainit.

Pronunciation:
oo-mee-nohm poh kah-yoh ngah too-beeg dah-heel mah-ee-neet.

Chunks:
- Uminom: oo-mee-nohm.
- po: poh.
- kayo: kah-yoh.
- tubig: too-beeg.
- dahil: dah-heel.
- mainit: mah-ee-neet.

Tips

  • Curate common words first.
  • Keep fallback simple and transparent.
  • Preserve acronyms and technical words carefully.
  • Make pronunciation useful enough for practice, not over-precise.
  • Ask native speakers to review important phrases.

Part 4: Patch HTML With BeautifulSoup

Goal

Update the generated HTML without rewriting the whole article page.

Development skill

The enrichment scripts parse the page with BeautifulSoup, find each div.extra-example, read the Tagalog span, and replace the content after specific headings.

for fname in files:
    soup = BeautifulSoup(Path(fname).read_text(encoding="utf-8"), "html.parser")
    divs = soup.find_all("div", class_="extra-example")

    for div in divs:
        span = div.find("span", lang="tl")
        if not span:
            continue

        sentence = " ".join(span.get_text(" ", strip=True).split())
        replace_after_heading(div, "Grammatical Breakdown:", [make_breakdown_ul(soup, sentence)])
        replace_after_heading(div, "Pronunciation Guide:", make_pronunciation(soup, sentence))

The helper replace_after_heading is important because it avoids replacing the entire card. It removes only the old content between one heading and the next known heading.

def replace_after_heading(div, heading_text, new_nodes):
    heading = None
    for candidate in div.find_all("p", recursive=False):
        strong = candidate.find("strong")
        if strong and heading_text in strong.get_text():
            heading = candidate
            break

    if not heading:
        return False

    sibling = heading.find_next_sibling()
    while sibling:
        next_sibling = sibling.find_next_sibling()
        if sibling.name == "p":
            strong = sibling.find("strong")
            if strong and "Pronunciation Guide:" in strong.get_text():
                break
        sibling.extract()
        sibling = next_sibling

    last = heading
    for node in new_nodes:
        last.insert_after(node)
        last = node
    return True

Tips

  • Parse HTML instead of doing blind string replacement when the structure matters.
  • Keep the search scope narrow.
  • Preserve the learner sentence.
  • Replace only the generated helper sections.
  • Write output to a separate file when testing a risky batch.

Part 5: Validate Enriched Output

Goal

Prove that the batch update touched the expected content.

Development skill

The scripts print summary rows and sanity checks after writing files.

print("Update summary:")
for source, out, total, updated, missing in summary:
    print(f"{source} -> {out}: extra_examples={total}, updated={updated}, missing={missing}")

print("Sanity check:")
for out in outputs:
    soup = BeautifulSoup(Path(out).read_text(encoding="utf-8"), "html.parser")
    divs = soup.find_all("div", class_="extra-example")
    phrase = sum(1 for div in divs if "It is pronounced word by word as:" in div.get_text())
    breakdown = sum(1 for div in divs if "Grammatical Breakdown:" in div.get_text())
    print(f"{out}: extra_examples={len(divs)}, pron_phrase={phrase}, has_breakdown={breakdown}")

Result

The developer can explain the enrichment pipeline as a measurable process, not a manual cleanup.

Input:
article HTML files

Transformation:
grammar and pronunciation regeneration

Output:
updated HTML files

Evidence:
counts for extra examples, pronunciation phrases, and grammar breakdowns

Tips

  • Count the exact blocks you changed.
  • Print missing sections instead of silently skipping them.
  • Keep validation close to the script.
  • Use generated counts as technical-sharing evidence.
  • Review a few updated cards visually after the script passes.

Field Note 1: Enrichment Is A Product Layer

Background: A phrase pair is useful, but grammar and pronunciation turn it into a learning card.

Goal: Add repeatable learning support without changing the sentence-card layout.

Prompt: Generate grammar and pronunciation from the visible Tagalog sentence.

Result: Every extra example becomes more useful for beginners.

Review check: Does the generated helper content explain the actual sentence on the card?

Field Note 2: Batch Scripts Are Reviewable

Background: One script for all 24 articles would be large and hard to tune.

Goal: Keep each batch close to its vocabulary domain.

Prompt: Process only three article files per script.

Result: Community Day, Friendship, and Manila Daily content can each have better local glossaries.

Review check: Can a reviewer understand the vocabulary scope from the file list and comments?

Field Note 3: Fallbacks Need Humility

Background: The pronunciation fallback is useful, but it is not a native-speaker guarantee.

Goal: Give learners a starting point while keeping the review requirement visible.

Prompt: Use curated pronunciation when available and simple fallback when needed.

Result: The site stays useful even before every word has a perfect pronunciation entry.

Review check: Are important event phrases curated instead of relying only on fallback?

Technical Sharing Angle

For a developer talk, this enrichment pipeline is a strong example of practical content engineering:

HTML article files
      ->
BeautifulSoup parser
      ->
extra-example blocks
      ->
Tagalog sentence extraction
      ->
glossary definitions
      ->
pronunciation map and fallback
      ->
section replacement
      ->
updated HTML files
      ->
sanity checks

The lesson is simple: AI-assisted learning content still needs deterministic tools. Small scripts can turn generated pages into reviewable educational material.

Closing Reflection

The grammar and pronunciation scripts show a useful middle ground between hand editing and overbuilding. The project does not need a database or a language engine to improve every card. It needs clear article batches, topic glossaries, pronunciation helpers, careful HTML patching, and validation output. That makes the app better for learners and easier to explain to developers.