AWS Builder Center Article

Make Extra Examples Unique and Reviewable in a Tagalog Learning App for AWS Manila Community Day

After the Tagalog learning app grew to 24 article pages, the biggest content risk was repetition. A page can look complete while many cards repeat the same extra examples. The python scripts show how to rewrite, deduplicate, and validate example content so the app becomes easier to review and stronger for technical sharing.

Tagalog Learning App Article Series

Disclaimer

Purpose: This static app is an educational prototype for language practice and developer sharing. It helps learners prepare polite Tagalog phrases for AWS Manila Community Day.

Non-Commercial: This project has no paid feature, no advertising, no registration requirement, and no commercial purpose. It is intended for learning, experimentation, and community preparation.

No Guarantees: Generated language content may contain mistakes. Tagalog translations, grammar explanations, pronunciation guides, and cultural notes should be checked by native speakers before production use.

Scope: This app is not an official AWS product, not an official translation tool, and not a substitute for human language instruction. It is a technical demo with useful learning content.

Community Respect: The app should avoid stereotypes and should teach polite language carefully. Words like po, opo, kayo, and ninyo should be explained as tools for respect, not decoration.


Demo

Natural Tagalog:

Nakalimutan ko ang payong ko.

English:

I forgot my umbrella.

Polite Tagalog:

Nakalimutan ko po ang payong ko po.

Friendly Filipino-English:

Nakalimutan ko ang payong ko, okay po.

Playful Filipino-English:

Uy, Nakalimutan ko ang payong ko, all right.

Tone:

daily

Cultural Context:

Use this for rain. Start with the polite form when talking to guards, drivers, vendors, staff, elders, and people you meet for the first time.

Context Use:

Useful for daily Manila situations. Short Tagalog sentences plus polite markers sound natural, warm, and practical in public places.

Grammatical Breakdown

  • Nakalimutan: place, object, or action related to nakalimut.
  • ko: Can mean my, me, or I depending on the sentence pattern.
  • ang: Focus marker placed before the main noun or idea.
  • payong: request or direction related to yong.
  • po: Respect marker used for polite speech.

Pronunciation Guide

It is pronounced word by word as: nah-kah-lee-moo-tahn koh ahng pah-yohng poh.

  • Nakalimutan: break it into na: nah + ka: kah + li: lee + mu: moo + tan: tahn.
  • ko: say it as koh.
  • ang: say it as ahng.
  • payong: break it into pa: pah + yong: yohng.
  • po: say it as poh.

Content Snapshot

Content QA scripts:
- rewrite_extras_from_main_examples.py
- force_rewrite_all_extra_examples.py
- dedupe_tagalog_examples.py

Main jobs:
- Read each sentence-card article block
- Extract Natural Tagalog
- Make Natural Tagalog unique when duplicated
- Create three extra examples for each card
- Add English, Natural Tagalog, and Polite Tagalog versions
- Regenerate grammar and pronunciation sections
- Assert every article still has 40 cards
- Print updated file counts

Table of Content

Part 1: Find The Repetition Problem

This section explains why extra examples need their own QA pass after the main article pages are generated.

Part 2: Rewrite Extra Examples From The Main Sentence

This section shows how extra examples can be generated from the current card's Natural Tagalog line.

Part 3: Force Consistent Rewrites Across All Cards

This section explains why a second pass can be useful when card boundaries or existing HTML structure vary.

Part 4: Deduplicate With Traceable Lesson Context

This section explains how article number, sentence number, and example number can make repeated content reviewable.

Part 5: Validate Card Counts And Updated Files

This section shows how simple assertions protect the static site from broken article pages.


Part 1: Find The Repetition Problem

Goal

Make every card feel useful, not copied.

Development skill

The skill is content QA for generated static pages. A generated site can pass layout checks but still fail learning quality if every extra example repeats the same phrase.

Prompt

Review every sentence card.
If extra examples are repeated or too generic, rewrite them.
Every card should keep three extra examples.
Each extra example should include:
- Tagalog
- English
- Natural Tagalog
- Polite Tagalog
- Grammatical Breakdown
- Pronunciation Guide

Review signal

Repetition can hide inside valid HTML:

<div class="extra-example">
  <p><strong>Tagalog:</strong><br><span lang="tl">Ayos, salamat.</span></p>
  <p><strong>English:</strong><br>All right, thank you.</p>
</div>

This block is structurally fine, but if hundreds of cards use the same sentence, the learning value drops.

Tips

  • Count cards and examples separately.
  • Search for repeated Tagalog text.
  • Review examples by article group.
  • Keep the main sentence stable unless it is duplicated.
  • Make dedupe changes traceable instead of random.

Part 2: Rewrite Extra Examples From The Main Sentence

Goal

Use the card's current Natural Tagalog sentence as the source for related examples.

Development skill

rewrite_extras_from_main_examples.py extracts the main sentence and builds three example blocks around it.

def extra_block(index: int, main: str) -> str:
    examples = [
        (
            f'Gagamitin ko rin ang linyang "{main}" mamaya.',
            f'I will also use the line "{main}" later.',
            f'Uulitin ko ang linyang "{main}" nang dahan-dahan.',
            f'Pakisuyo, uulitin ko po ang linyang "{main}" nang dahan-dahan.',
        ),
        (
            f'Sasabihin ko ang linyang "{main}" sa kausap ko.',
            f'I will say the line "{main}" to the person I am talking to.',
            f'Ipapaliwanag ko ang linyang "{main}" sa simpleng paraan.',
            f'Pakisuyo, ipapaliwanag ko po ang linyang "{main}" sa simpleng paraan.',
        ),
        (
            f'Magsanay tayo gamit ang linyang "{main}" ngayon.',
            f'Let us practice using the line "{main}" now.',
            f'Isusulat ko ang linyang "{main}" sa notes ko.',
            f'Pakisuyo, isusulat ko po ang linyang "{main}" sa notes ko.',
        ),
    ]

Why this matters

The examples are connected to the card. If the card teaches:

Natural Tagalog:
Paki-check kung pumasok ang bayad.

The extra examples can talk about using, repeating, explaining, or writing that line. That is more useful than unrelated filler.

Generated example shape

<div class="extra-example">
  <p><strong>Tagalog:</strong><br><span lang="tl">Gagamitin ko rin ang linyang "Paki-check kung pumasok ang bayad" mamaya.</span></p>
  <p><strong>English:</strong><br>I will also use the line "Paki-check kung pumasok ang bayad" later.</p>
  <p><strong>Natural Tagalog:</strong><br><span lang="tl">Uulitin ko ang linyang "Paki-check kung pumasok ang bayad" nang dahan-dahan.</span></p>
  <p><strong>Polite Tagalog:</strong><br><span lang="tl">Pakisuyo, uulitin ko po ang linyang "Paki-check kung pumasok ang bayad" nang dahan-dahan.</span></p>
</div>

Tips

  • Base examples on the current card.
  • Keep three examples per card.
  • Include both natural and polite versions.
  • Reuse grammar and pronunciation helpers after rewriting examples.
  • Escape injected text before writing HTML.

Part 3: Force Consistent Rewrites Across All Cards

Goal

Handle article pages where regex card matching or example boundaries need a stronger pass.

Development skill

force_rewrite_all_extra_examples.py scans for sentence-card start tags, slices each card, rewrites extra examples, refreshes helper sections, and checks the card count.

ARTICLE_START = re.compile(r'<article class="sentence-card" id="sentence-\d+">')

def rewrite_cards(path: Path, text: str) -> tuple[str, int]:
    starts = [match.start() for match in ARTICLE_START.finditer(text)]
    if not starts:
        return text, 0

    parts = []
    cursor = 0
    count = 0

    for index, start in enumerate(starts):
        end = starts[index + 1] if index + 1 < len(starts) else text.find("</section>", start)
        card = text[start:end]
        natural = field_text(card, "Natural Tagalog:")
        card = rewrite_extras(card, natural)
        card = refresh_main_sections(card)
        parts.append(text[cursor:start])
        parts.append(card)
        cursor = end
        count += 1

    parts.append(text[cursor:])
    return "".join(parts), count

Why this matters

Generated HTML can have long lines, repeated sections, or nested blocks that make simple replacements fragile. A card-slicing pass gives the script a clearer unit of work.

Result

The script can say:

For every article page:
- identify 40 sentence cards
- rewrite each card
- refresh grammar and pronunciation
- fail if the page does not have 40 cards

Tips

  • Slice by stable article-card markers.
  • Fail loudly when expected counts are wrong.
  • Keep the main card text and helper-section refresh in the same pass.
  • Use a force rewrite when earlier targeted rewrites leave inconsistent blocks.
  • Review generated HTML after a force pass.

Part 4: Deduplicate With Traceable Lesson Context

Goal

Remove repeated Natural Tagalog content and make examples unique enough for review.

Development skill

dedupe_tagalog_examples.py adds two important ideas:

1. Track duplicate Natural Tagalog lines across all article pages.
2. Generate traceable extra examples using article number, sentence number, and extra example number.

Code pattern

def unique_main_text(text: str, article: int, sentence: int, occurrence: int) -> str:
    context = f"sa aralin {number_word(article)}, pangungusap {number_word(sentence)}"
    clean = text.strip()
    if occurrence <= 1:
        return clean
    if clean.endswith("?"):
        return re.sub(r"\?$", f" {context}?", clean)
    return re.sub(r"[.!]?$", f" {context}.", clean)

Why this matters

If the same Natural Tagalog sentence appears more than once, the script can make the repeated copy traceable:

Original:
Salamat sa tulong.

Duplicate-safe:
Salamat sa tulong sa aralin lima, pangungusap dalawa.

This is not only a language change. It is a QA signal. The reviewer can locate the duplicate and decide whether the generated uniqueness is acceptable or whether the source content should be rewritten by hand.

Topic-aware example generation

The dedupe script also chooses example scenes based on article group:

def group_for(path: Path) -> str:
    if "friendship" in path.name:
        return "friendship"
    if "manila-daily" in path.name:
        return "manila"
    return "community"

Then each group gets examples that fit its domain:

Community:
- session
- workshop
- community table

Friendship:
- mahinahong usapan
- tapat na mensahe
- ligtas na pag-uusap

Manila Daily:
- commute
- palengke
- araw-araw na errand

Example generated extra blocks

Community example:
Sa aralin apat, pangungusap sampu, gagamitin ko ito sa workshop.

Friendship example:
Sa aralin labing-anim, pangungusap pito, aalagaan ko ang tapat na mensahe.

Manila Daily example:
Sa aralin dalawampu't dalawa, pangungusap tatlo, gagamitin ko ito sa palengke.

Tips

  • Make duplicate fixes traceable.
  • Use article and sentence numbers when reviewing generated examples.
  • Choose example scenes by content group.
  • Use polite forms that differ from natural forms.
  • Treat dedupe output as review material, not final linguistic authority.

Part 5: Validate Card Counts And Updated Files

Goal

Protect the generated site from broken structure during mass rewrites.

Development skill

The scripts use hard checks for card count and output summaries.

for path in sorted(ROOT.glob("article-*.html"), key=lambda p: article_number(p)):
    text = path.read_text(encoding="utf-8")
    card_count = 0

    def replace_card(match: re.Match[str]) -> str:
        nonlocal card_count
        card_count += 1
        return rewrite_one_card(match.group(0))

    updated = CARD_RE.sub(replace_card, text)

    if card_count != 40:
        raise RuntimeError(f"{path.name}: expected 40 cards, found {card_count}")

Result

The rewrite process becomes safer:

Expected:
24 article pages
40 sentence cards per article
3 extra examples per card

Failure mode:
raise an error instead of writing a partial page

Review checklist

  • Does every article still have 40 cards?
  • Does every card still have Natural Tagalog and Polite Tagalog?
  • Are extra examples separate and readable?
  • Are repeated Natural Tagalog lines traceable?
  • Were grammar and pronunciation sections regenerated after example rewrites?
  • Are generated examples suitable for the article group?

Field Note 1: Extra Examples Are Product Content

Background: Extra examples are not decorative. They teach how to reuse a phrase.

Goal: Make examples connected to the current card.

Prompt: Build extra examples from the card's Natural Tagalog sentence.

Result: Learners see how the same line can be practiced, repeated, explained, or used in a scene.

Review check: Does each example support the main phrase instead of distracting from it?

Field Note 2: Dedupe Should Be Explainable

Background: Generated content can repeat across many cards.

Goal: Make repeated lines visible and traceable.

Prompt: Track Natural Tagalog strings across all article pages and add lesson context when a duplicate appears.

Result: Reviewers can quickly identify where repeated content came from.

Review check: Can a reviewer map the generated example back to article number and sentence number?

Field Note 3: Polite Variants Need Real Difference

Background: A polite field that repeats the natural field does not teach much.

Goal: Add po, Pakisuyo, or another respectful pattern when appropriate.

Prompt: If Natural Tagalog and Polite Tagalog are identical, create a distinct polite version.

Result: Learners can compare natural and respectful speech more clearly.

Review check: Does the polite example sound respectful without becoming awkward?

Field Note 4: Regex Is Useful But Needs Guardrails

Background: HTML regex can be fragile if used carelessly.

Goal: Use regex only around stable generated markers and validate counts afterward.

Prompt: Match known sentence-card blocks, rewrite them, then assert 40 cards.

Result: The script stays practical for generated static HTML while avoiding silent corruption.

Review check: Does the script fail when expected card counts are missing?

Technical Sharing Angle

This content QA flow is useful for a developer talk because it shows the hidden work behind a believable generated demo:

Generated article pages
      ->
detect sentence cards
      ->
extract Natural Tagalog
      ->
rewrite three extra examples
      ->
make duplicates traceable
      ->
regenerate grammar and pronunciation
      ->
assert 40 cards
      ->
write updated files
      ->
print JSON or count summary

The strongest lesson is that generated educational content needs a QA pipeline, not only a prompt.

Example Talk Script

The first version generated many cards, but card count is not the same as learning quality.
So we added a content QA pass.

The script reads every sentence-card block, extracts the Natural Tagalog line, and rebuilds the extra examples.
If the line has already appeared, the dedupe step adds article and sentence context.

After rewriting, the script regenerates grammar and pronunciation helper sections.
Then it asserts that each article still has exactly 40 cards.

This gives us two wins:
1. The learner gets more useful examples.
2. The developer gets measurable output for review.

Closing Reflection

The extra-example rewrite scripts show that content QA is an engineering problem. The project uses small Python scripts, stable HTML markers, escaped output, group-aware example scenes, duplicate tracking, grammar refresh, pronunciation refresh, and hard card-count checks. That workflow makes the Tagalog learning app more useful for learners and more convincing as a technical sharing demo.