Data lab: What we can (and can’t) say about r/NBA postgame thread volume
82 PullPush playoff threads: no significant margin, scoring, or market-tier slopes; team effects are heterogeneous but confounded; OT regression is misleading — read why.
Data lab · sports + platform
This is a honest read of one reproducible slice: [Post Game Thread] submissions from the PullPush archive (2025-04-15 → 2025-05-25 UTC, overlapping the 2025 playoffs). We parse scores and team names from thread titles (no live Reddit API), attach coarse market tiers, and test whether game lines or markets predict how big the thread gets (log(1 + comment count) on the submission).
Frozen numbers: /data/nba-reddit-pgt-pullpush-lab.json. Code: data-projects/nba-r-postgame-threads/scripts/run_pullpush_archive_analysis.py.
The short answer: significant relationships?
At conventional levels (α = 0.05), we do not find robust associations between comment volume and:
- Final margin (Spearman ρ ≈ 0.07, p ≈ 0.54)
- Combined points (ρ ≈ −0.01, p ≈ 0.90)
- Coarse market tier sum (ρ ≈ 0.10, p ≈ 0.37)
So the scatterplots below are not hiding a strong linear story in this N — they are mostly noise plus heavy tails (a few threads dominate the scale).
That is still a useful result: in this playoff window, thread volume does not line up cleanly with “close game,” “shootout,” or our hand-built market tier once you look at the data without wishful thinking.
What looked “significant” — and how to read it
1) Team fixed effects (home + away) — the overall F-test for the model is significant (p ≈ 0.002). Some individual team dummies are significant too. That does not mean “Boston away threads are always bigger” in general. It means which matchup and which round you are in (Knicks–Celtics Game 1 vs. a Tuesday first-round game) moves volume — and those dimensions are bundled into team labels here. Treat it as heterogeneity, not a clean causal team ranking.
2) “Overtime” in the regression — core OLS gives OT a negative coefficient with p ≈ 0.08. That is not evidence that overtime games get less discussion. In our 82 rows, only 4 titles mention OT — and 3 of those 4 have num_comments = 0 in the archive (likely removed threads, lock states, or ingestion gaps). The negative sign is selection: we are comparing mostly normal threads to a handful of empty OT threads. Ignore the OT coefficient until comments are recovered for those posts or OT is labeled from a box-score join.
3) MixedLM — the estimated variance of the home-team random intercept sits on the boundary (~0). So within this slice, extra team-level random intercepts do not explain residual variance beyond the fixed predictors — consistent with small groups and noise.
What we measured
- Outcome:
log(1 + num_comments)on the submission (thread volume, not per-comment scores). - From titles: absolute margin, combined points, rough OT if the title says “OT” / “overtime.”
- Market: tier_sum from a coarse internal 1–3 scale per team (not Nielsen).
N = 82 threads with a parsable scoreline and two matched teams.
Figures (exploratory, not proof)
Margin vs volume — visually flat; confirms weak correlation.
Combined score vs volume — same story on a different axis.
Regulation vs OT — only four OT-labeled threads; the boxplot is not a fair OT test (see above).
Market tier vs volume — exploratory; not significant in OLS or Spearman.
Limitations (still)
- PullPush is a third-party archive — not Reddit’s official API.
- Title parsing ≠ schedule join; scores can appear in multiple places in a title.
- Playoffs only — do not generalize to regular-season cadence or moderation patterns.
- 2025–26 live games need either archive catch-up or OAuth ingestion for a season-long panel.
Reproduce
cd data-projects/nba-r-postgame-threads
pip install -e .
python3 scripts/run_pullpush_archive_analysis.py
Figures write to apps/web/public/images/blog/; the JSON above is the canonical summary for this run.