Vol. I · No. 52WED, JUN 10, 2026
Source · Community

r/MachineLearning

Reddit · COMMUNITY

Last updated May 28, 2026, 6:00 PM

A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]

Hello everyone. The new dataset is named MONET, is Apache 2.0 and available on HF: [https://huggingface.co/datasets/jasperai/monet](https://huggingface.co/datasets/jasperai/monet) **MONET is open, Apache 2.0-licensed image–text dataset. It was built from 2.9 billion images and refined to 104.9 million high-quality samples.** We are also publishing [a paper](https://arxiv.org/abs/2605.21272) that explains how the dataset was created if you are curious and 3 compagnions projects * [A umap to visualize the distribution](https://huggingface.co/spaces/jasperai/monet-umap) * [A retreival tool ...

··

AI-generated CUDA kernels silently break training and inference [R]

Last month NVIDIA released [SOL-ExecBench](https://research.nvidia.com/benchmarks/sol-execbench), a new benchmark of 235 production CUDA kernels lifted from DeepSeek, Qwen, Gemma, and Kimi. We took several top-ranked AI-generated submissions and tried using them in production workloads. Many of them broke, sometimes in surprising ways. One of those kernels is the fused embedding-gradient + RMSNorm backward pass, which runs at the end of every transformer training step. We took the fastest submission on the benchmark for it, and dropped it into the training loop of a small transformer. The ke...

··

[D] Where do you go for serious AI research discussion online? [D]

Looking for communities where people actually dig into ML/AI research, not hype, not "look what I built with an LLM API," but discussions about papers, training dynamics, debugging real models, infra problems, that kind of thing. I'm specifically interested in places where you can post something like "I'm seeing X behaviour in my SSL training, here's the loss curve, anyone seen this before?" and get thoughtful replies instead of generic advice.

··

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]

Disclaimer: I work for Numind, the company behind this open-weight model We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs. Try it, we have a huggingface space that is completely free (you don't even have to sign-up): [https://huggingface.co/spaces/numind/NuExtract3](https://huggingface.co/spaces/numind/NuExtract3) If you ever used [NuMarkdown](https://hu...

··

Novel Problems in VLA [R]

Reddit discussion: researcher seeking novel VLA directions after discovering concurrent work on equivariant approaches.

··

Machine Learning on Spherical Manifold [R]

Hi, I'm interested in geometric deep learning (due to Michael M. Bronstein's book and Maurice Weiler's PhD thesis), and in order not to write projects to nowhere, I decided to keep a technical blog. I started with a short note about machine learning on spherical manifolds, but it's a pretty simple thing. Is there a list of some open problems on the topic of GDL, or maybe some of you are doing something in this direction and can suggest which GDL problems are relevant in the research community.

··

What do you think about Tabular Foundation Models [D]

I've seen TabPFN-3's recent results, and there is a lot of buzz about foundation models for tabular data (TabICL, TabPFN). The performance that those models achieve is really amazing. What makes me a little suspicious about them? They can analyze small datasets only, so a few MB of data, and you need to have a large GPU machine and download a few GB of model to predict on a few MB of data. That doesn't sound rational ... I really miss the old school approach of running a single decision tree or a linear model on the data. What do you think about it? Do you think feature engineering + class...

··

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

Built this over the past few weeks as part of a multilingual research project. Figured I'd share it here. Check it out! \~9.8M web documents across 11 languages — hi, bn, ta, te, mr, gu, kn, ml, pa, ur, en. \~8.4B tokens. CC0 license. 🤗 [https://huggingface.co/datasets/AM0908/indic-hplt-v1](https://huggingface.co/datasets/AM0908/indic-hplt-v1)

··

Sub-JEPA: a simple fix to LeCun group's LeWorldModel that consistently improves performance [P]

**World models** learn compact latent representations for planning without pixel reconstruction. LeWorldModel (LeWM), from LeCun's group at NYU, achieves stable end-to-end JEPA training by enforcing an isotropic Gaussian prior over the full latent space. **The flaw:** real environment dynamics live on low-dimensional manifolds, so a global high-dimensional Gaussian is an overly rigid prior — mismatched to the task geometry. LeWM itself struggles most on low-intrinsic-dimension tasks like Two-Room. **Our fix (Sub-JEPA):** apply the Gaussian regularization inside multiple frozen random orthog...

··

Human-level performance via ML was *not* proven impossible with complexity theory [D]

Van Rooij, Guest, Adolfi, Kolokolova, and Rich [claimed to have proven that AGI via ML is impossible](https://link.springer.com/article/10.1007/s42113-024-00217-5) in *Computational Brain & Behavior* in 2024. The basic idea was to try to reduce a known NP-hard problem to the problem of learning a human-level classifier from data. The purported result, called "Ingenia Theorem" by the authors, made some noise on the internet, including here. My paper showing that the proof is irreparably broken is now [also out in CBB](https://link.springer.com/article/10.1007/s42113-026-00284-w) (ungated ...

··

Built Support Vector Machine(SVM) from scratch in Rust [P]

Built my own SVM classifier from scratch in Rust. It uses SMO optimization, have linear and rbf kernel, uses grid search to tune the hyperparameters. I tested it on two datasets one using Linear dataset and other using RBF, these were the results: |Dataset|Kernel|Accuracy|Recall|F1| |:-|:-|:-|:-|:-| |Banknote Auth|Linear|96%|94%|95%| |Breast Cancer|RBF|93%|100%|92%| https://preview.redd.it/uw26u1uo0w0h1.jpg?width=720&format=pjpg&auto=webp&s=1784e1d7d310a26fa67efc63fa5191f45433a695 https://preview.redd.it/o0ahkq7p0w0h1.jpg?width=720&format=pjpg&auto=webp&s=dcb1053c3...

··

Elastic Attention Cores for Scalable Vision Transformers [R]

Wanted to share our latest paper on an alternative building block for Vision Transformers. [Illustration of our model's accuracy and dense features](https://preview.redd.it/x4acnx478w0h1.png?width=2457&format=png&auto=webp&s=3ce49caf2b0cdea5d35141aebb7297862fdc6a7d) Traditional ViTs utilize dense (***N******^(2)***) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (***2NC + C******^(2)***) for ***C*** core tokens. We further train this usin...

··

How do you create memorable poster for top tier conferences ( ICML/ICLR/NEURips ect…) [D]

Hello everyone, Presenting at a top-tier conference for the first time and having a very hard time coming up with an appropriate design for my poster. Everything I do seems basic and banal. My paper is more theory-oriented, and apart from putting math formulas in bold in the middle, I am not sure what the best way is to design the poster. Even the sizing choice is complicated as ICML gives 3 different recommendations to pick from, and somehow from my computer, I can’t see how the PowerPoint slide will look like printed on those dimensions. And Printing a poster is nearly $100 CAD, ...

··

Steam Recommender using similarity! (Undergraduate Student Project) [P]

(DISCLAIMER: I accidentally deleted the last post on this subreddit my apologies if this is your second time seeing it) Last year I made a [post](https://www.reddit.com/r/datascience/comments/1lkjxmr/steam_recommender_using_vectors_student_project/) about my steam recommender The last one was great and served its purpose of showing many people new games, But this new version is much more functional! I love making recommendation systems that tell the user WHY they got the recommendation. During a steam sale event, I always find myself trying to look for new video games to play. If I wanted ...

··

TabPFN-3 just released: a pre-trained tabular foundation model for up to 1M rows [R][N]

TabPFN-3 was released today, the next iteration of the tabular foundation model, originally published in Nature. Quick recap for anyone new to TabPFN: TabPFN predicts on tabular data in a single forward pass - no training, no hyperparameter search, no tuning. Built on TabPFN-2.5 (Nov 2025) and TabPFNv2 (Nature, Jan 2025), which together crossed 3M downloads and 200+ published applications. What's new: * Scale: 1M rows on a single H100 (10x larger than 2.5).A reduced KV cache (\~8GB per million rows per estimator) and row-chunked inference make this practical on a single GPU * Speed: 10x-10...

··

ICML Author Removal [D]

PhD student. Need advice. After the ICML abstract deadline, industry coauthors asked to be removed, they missed their employer's internal approval window. They had contributed (discussions and written feedback) but I hadn't explicitly asked before adding them. January: wrote to PC chairs, got written confirmation from all coauthors, got explicit written approval. Chairs said they'd implement. Never happened. Paper accepted four months later with original author list. At camera-ready we followed up. Chairs reversed: blanket policy, no exceptions, keep the list or withdraw. What do you t...

··

Interactive Jensen–Shannon Divergence Visualisation [P]

An interactive visualisation of Jensen–Shannon divergence - the symmetric, always-finite cousin of KL. Shape two distributions and watch JSD, its ceiling of one bit, and the per-point contribution respond in real time. https://robotchinwag.com/posts/jensen-shannon-divergence-visualisation/ Feedback welcome.

··

My experience interviewing with Huawei Vancouver for an ML research role: strong mismatch between how it was pitched and how it was evaluated [D]

I want to share an interview experience anonymously in case it helps others on the job market. I was approached about a Vancouver ML role that was presented to me as research-oriented. The recruiter told me the team had looked at my research and that I should be ready to discuss my projects, so I expected a conversation about modelling, research ideas, and fit. That is not how the interview felt. It was much more focused on trivia-style and coding-style questioning, with very little real engagement with my research or how I think about problems. The overall process felt much narrower and mo...

··

Interactive KL Divergence Visualisation [P]

I built a small interactive explorer for building intuition about KL divergence: https://robotchinwag.com/posts/kl-divergence-visualisation/ You control two skew-normal distributions and can see the KL integrand and the KL metric. It’s good for exploring how it changes with a mean offset, skew, truncation and discretisation. It run entirely close side. Feedback is welcome.

··

Stop letting LLMs edit your .bib [D]

Research community reports frequent LLM hallucinations in bibliography generation, with incorrect author attributions despite correct titles, raising integrity concerns.

··
50 stories