|
Dear Reader,
Welcome to the November 7th edition of the Data Science Briefing!
We're proud to announce that a brand new Data Visualization with Python on-demand video is now available on the O'Reilly website: Python Data Visualization: Create impactful visuals, animations and dashboards. This in depth tutorial is almost 7h in length and covers fundamental and advanced usage of matplotlib, seaborn, plotly and bokeh as well as tips on how to use Jupyter widgets. Check it out!
The latest blog post on the Epidemiology series is also out:
Demographic Processes. In this post we explore how to include birth and death rates in your epidemik models. Check it out!
This week’s links trace the full AI stack, from gritty data plumbing to big-picture ethics. Netflix’s write-ahead log post is a masterclass in designing resilient pipelines (think exactly-once semantics, fast recovery, and taming out-of-order events). At the same time, Karpathy’s “Yes, you should understand backprop” is the clearest case yet for owning the math that actually shapes your gradients.
On the product side, Fly.io’s “You Should Write an Agent” argues for shipping small, focused agents now, wiring tools, memory, and guardrails instead of chasing grand abstractions. Meanwhile, the data economy behind those agents is under scrutiny: The Atlantic’s look at Common Crawl spotlights the invisible infrastructure (and messy provenance) that feeds modern models, and Lee Fang tracks how copyright enforcement has faded just as AI scraping has surged, shifting old piracy debates into the training-data era.
Zoom out to the grid, and the New Yorker’s tour of AI data centers underscores the real-world costs of scale. And if you need a north star for why all this machinery matters, Quanta reports on language models meeting expert-level analyses on specific tasks, an achievement that’s as much about disciplined engineering and data stewardship as it is about model size.
From micro-level ties to macro-level contagion, this week’s papers sketch a unified playbook for modeling how information and misinformation move. Work on intersectional inequalities in social networks shows that who connects to whom still gates opportunity and exposure, setting the initial conditions for any diffusion. The physics-inspired takes on news, rumors, and opinions, then models how those signals propagate.
At the same time, human mobility studies remind us that geography and movement patterns still serve as the hidden coupling layer across communities. Methodologically, neural symbolic regression enables the direct discovery of governing equations from network data, and new null models for information decomposition provide principled baselines for separating synergy from redundancy when multiple signals interact.
On the language-model front, evidence that accumulating context shifts model “beliefs” raises sharp questions about prompt hygiene and evaluation drift, just as continuous autoregressive models push beyond token clocks toward smoother dynamics. Together, they argue for pipeline designs that respect structure (inequality and mobility), dynamics (learned equations, robust baselines), and cognition (context-sensitive LMs) if we want forecasting, moderation, and AI-assistant behavior to hold up outside the lab.
Our current book recommendation is Sinan Ozdemir’s "Quick Start Guide to Large Language Models". You can find all the previous book reviews on our website. In this week's video, we have an overview of "what is a Laplace Transform?”.
Data shows that the best way for a newsletter to grow is by word of mouth, so if you think one of your friends or colleagues would enjoy this newsletter, go ahead and forward this email to them. This will help us spread the word!
Semper discentes,
The D4S Team
Sinan Ozdemir’s "Quick Start Guide to Large Language Models" lives up to its name. It moves quickly from core concepts, tokens, context windows, and prompt structure to working patterns like chat apps, RAG, summarization, and lightweight agents. The sequencing is pragmatic: read a chapter, ship a prototype.
The standout value for DS/ML folks is its treatment of embeddings and retrieval. Ozdemir shows when embeddings beat fine-tuning, how to chunk and index, and how to trade off accuracy, latency, and cost with clear, reusable checklists. His sections on prompt patterns, tool use/function-calling, and interface design treat prompting like API design, constrain inputs, structure outputs, plan for failure modes, making it easy to slot into existing services.
In short: an excellent on-ramp and onboarding text. Pair it with heavier resources for evaluation, alignment, and production-grade deployments.
- Building a Resilient Data Platform with Write-Ahead Log at Netflix [netflixtechblog.com]
- You Should Write An Agent [fly.io]
- Common Crawl Is Doing the AI Industry’s Dirty Work [theatlantic.com]
- Yes you should understand backprop [karpathy.medium.com]
- What Happened to Piracy? Copyright Enforcement Fades as AI Giants Rise [leefang.com]
- Inside the Data Centers That Train A.I. and Drain the Electrical Grid [newyorker.com]
- In a First, AI Models Analyze Language As Well As a Human Expert [quantamagazine.org]
- Intersectional inequalities in social ties (S. Martin-Gutierrez, M. N. C. van Dissel, F. Karimi)
-
Discovering network dynamics with neural symbolic regression (Z. Yu, J. Ding, Y. Li)
-
Null models for comparing information decomposition across complex systems (A. Liardi, F. E. Rosas, R. L. Carhart-Harris, G. Blackburne, D. Bor, P. A. M. Mediano)
-
Accumulating Context Changes the Beliefs of Language Models (J. Geng, H. Chen, R. Liu, M. H. Ribeiro, R. Willer, G. Neubig, T. L. Griffiths)
-
Continuous Autoregressive Language Models (C. Shao, D. Li, F. Meng, J. Zhou)
-
The Physics of News, Rumors, and Opinions (G. Caldarelli, O. Artime, G. Fischetti, S. Guarino, A. Nowak, F. Saracco, P. Holme, M. de Domenico)
-
Human Mobility in Epidemic Modeling (X. Lu, J. Feng, S. Lai, P. Holme, S. Liu, Z. Du, X. Yuan, S. Wang, Y. Li, X. Zhang, Y. Bai, X. Duan, W. Mei, H. Yu, S. Tan, F. Liljeros)
But what is a Laplace Transform?
All the videos of the week are now available in our YouTube playlist.
Upcoming Events:
Opportunities to learn from us
On-Demand Videos:
Long-form tutorials
- Natural Language Processing 7h, covering basic and advanced techniques using NTLK and PyTorch.
- Python Data Visualization 7h, covering basic and advanced visualization with matplotlib, ipywidgets, seaborn, plotly, and bokeh.
- Times Series Analysis for Everyone 6h, covering data pre-processing, visualization, ARIMA, ARCH, and Deep Learning models.
|
|
|
|