Beware: Over Half of the GitHub Public Repositories are Not Open Source Licensed!

Open Weaver
4 min readDec 10, 2020

--

We discussed different open source licenses and their implications for your development needs in our previous post. We also highlighted that No License or unlicensed is not free and open source, but instead enforces strict copyright. As a user, you should not use these in your work. As a creator, you must review and choose an open source license type to make your work usable for others based on your intent. It is also crucial for you as a creator to choose a license that upfront protects you from any lawsuit through appropriate ‘without warranty’ and ‘zero liability’ clauses.

Public Repositories

Many seem to understand this issue, and GitHub also clarifies that you cannot use a public repository without an appropriate license¹. However, this is a nontrivial issue considering the scale of public repositories today and the criticality of intellectual property protection. Our deep-dive analysis of public repositories gives a view of how widespread this issue of unlicensed software assets is. Here are insights from GitHub public repositories. We will review other repositories soon.

Over half of repositories in the top 1 million do not have an appropriate license

We focused our analysis of the most useful repositories and hence picked the top 1 million public repositories in GitHub across multiple languages with at least 5 GitHub Stars. The 1 million public repositories were spread across 39 licenses. The results showed that the unlicensed software problem is HUGE. 46% of the public repositories have no license attributed to them. Another 7% has a license type as ‘other’, implying that they do not belong to a standard license term. So using these 53% repositories is a risk. The MIT License and the Apache License 2.0 are the most popular open source licenses on GitHub public repositories, supporting Permissive licensing.

Lisences across top 1 million public repositories

The following sections are a subset of this top 1 million public repositories.

The most active repositories fare slightly better but still have a high percentage of unlicensed software

Though we had taken the most popular 1 million for the analysis and found over half without appropriate licenses, we reviewed multiple slices of the repositories to understand the issue better. We could see the trend very prevalent across all types of repositories though the active ones had a higher percentage of licensed open source software.

The set of repositories with more than 10 issues, totaling over 93,500 repositories, could see the MIT License edging to the number one slot with a 32% share, while the unlicensed ones stood at 23%.

Licences across public repositories with more than 10 issues

Now looking at a more stringent measure of repositories with over 10 releases, totaling over 43,400 repositories, we see the MIT License improving its number one position with a 42% share. Unlicensed and ‘other’ are tied to the number two position with the Apache License 2.0.

Licences across public repositories with more than 10 releases

License Violations in Public Repositories Could be Rampant

GitHub’s Terms of Service grants all public repository users rights to view and fork the repository within GitHub². But we believe forking may signal the intent of reuse of code or knowhow. So we analyzed this slice and were alarmed. We picked all repositories that have forks, totaling over 865,000 repositories. Unlicensed topped the category with a 45% share. MIT license followed with a 29% share.

Licences across public repositories with forks

Further, we analyzed the public repositories that are being watched. Again we believe that derivative work’s potential is high when you are watching a public repository to use it or learn from it. We picked all repositories that have watchers, totaling over 935,000 repositories. Unlicensed topped the category with a 45% share. MIT license followed with a 30% share.

Licences across public repositories with watchers.

While we see the MIT License and the Apache License 2.0 being most popular and propagating Permissive Open Source culture, it is alarming to see 46% of popular public repositories copyrighted and another 7% with nonstandard licenses preventing reuse. GitHub should proactively suggest open source licenses to developers when they create public repositories to alleviate this issue. As developers, it is vital to understand the license implications before reuse. More important, as creators, please be aware of the different license choices so the world can use your knowledge in the way you intended it.

Happy Reuse!

Sources

While every effort has been made to provide accurate and updated information, we regret any omission or error.

[1] Licensing a repository — https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/licensing-a-repository

[2] GitHub Terms of Service — https://docs.github.com/en/free-pro-team@latest/github/site-policy/github-terms-of-service

--

--

Open Weaver

Open Weaver is a SaaS tech company, changing the way the world builds digital. Learn more at www.openweaver.com