GitHub Releases Dataset for Multilingual Developer Content
Researchers studying multilingual developer collaboration now have a repository-level dataset from GitHub for identifying public repositories with evidence of non-English natural-language content. According to a GitHub Blog post by Kevin Xu, staff software engineer, CELA, the GitHub Multilingual Repositories Dataset was released on 15 June 2026 under the CC0-1.0 licence. The release follows GitHub’s 2025 commitment under Microsoft’s European Digital Commitments to make multilingual data more accessible to open source AI developers.
The dataset is intended as a discovery layer, not a dump of repository content. It covers more than 80 million classification rows across more than 40 million repositories, using signals from README files, the most-commented issue, and the most-commented pull request. It also includes repository metadata such as creation timestamp, disk usage, stars, forks, primary programming language, SPDX licence, issue counts, pull request counts, and snapshot date.
Language classifications are provided by fastText, gcld3, and lingua-py, each with confidence scores above 0.5. GitHub says it did not collapse the classifiers into a single label because language-identification tools vary in coverage and confidence, especially for lower-resource languages. The dataset allows users to set their own precision and recall thresholds depending on the research question.
GitHub says the dataset can support studies of non-English developer documentation, issue discussions, pull request activity, and evaluation sets for AI coding tools. The company says Portuguese is the most common non-English language in README classifications, while Korean is the most common non-English language in issue text. These findings point to differences in how language appears across documentation and collaboration channels.
The post also sets limits on interpretation. GitHub says the dataset should not be treated as a ground-truth benchmark for language identification because repository text can be short, mixed-language, code-heavy, or shaped by templates and badges. It also says the dataset should not be used to infer sensitive attributes about repository owners, contributors, or communities because the signals are repository-level metadata, not person-level attributes.
