오픈소스SW동향 상세 확인

오픈소스SW 동향 상세

[해외법률자료] 우리가 AI 난제를 극복한 방법

작성일 :
2025.03.31
작성자 :
관리자
조회수 :
31

2025.03.31

[원문]

How we passed the AI conundrums

Some people believe that full unfettered access to all training data is paramount. This group argues that anything less than all the data would compromise the Open Source principles, forever removing full reproducibility of AI systems, transparency, security and other outcomes. We’ve heard them and we’ve provided a solution rooted in decades of Open Source practice.

To have the chance for powerful Open Source AI systems to exist in any domain, the OSI community has incorporated in the Definition this principle:

An Open Source AI needs to make available three kinds of components: the software used to create the dataset and run the training, the model parameters and the code to run inference, and finally all the data that can be made available legally.

Recognizing that there are four kinds of “data”, each with its own legal frameworks allowing different freedoms of distribution, we bypass what Stephen O’Grady called the “AI conundrums” and give Open Source AI builders a chance to build freedom-respecting alternatives to pretty much any proprietary AI.

Limiting Open source AI only to systems trainable on freely distributable data would relegate Open Source AI to a niche. One of which is that the amount of freely and legally shareable data is a tiny fraction of what is necessary to train powerful systems. Additionally, it’d be excluding Open Source AI from areas where data cannot be shared, like medical or anything dealing with personal or private data. What remains for “Open Source AI” would be tiny. There are abundant motives to reject this limitation.

The fact is, mixing openly distributable and non-distributable data is very similar to a reality we are very familiar with: Open Source software built with proprietary compilers and system libraries.

Is GNU Emacs Open Source software?

I’m sure you’d answer yes (and some of you will say “well, actually it’s free software”) and we’ll all agree. Below is a rough diagram of Emacs built for the GNOME desktop on a modern Linux distribution. Emacs depends on a few system libraries that GNOME provides with OSI-Approved Licenses. The whole stack is Open Source these days and one can distribute Emacs on a disk with all its dependencies without too much legal trouble. Imagine scientists who want to freeze the whole environment of an experiment they made; they could package all the pieces of a system like this without trouble and distribute it all with their paper. No problem here.

Now let’s go back to an age when Linux systems weren’t ready. When Stallman started writing Emacs, there was no GNOME and no Linux, no gcc and no glibc. He thought very early on that in order to have more freedom, he had to create a wedge to allow Emacs to run on proprietary software.

Emacs on the latest Solaris versions would look something like this: some pieces like X11 and Gstreamer are Open Source. Others, like libc and others aren’t. The hypothetical scientists from before couldn’t really freeze their full scientific environment. All they could say in their paper was: “We used Emacs from this CVS version, built with gcc version X with these makefile; tar.gz attached” and make a list of the operating system’s version and libraries versions they used. That’s because they have the right only to distribute Emacs, X11, some libraries and not the rest of Solaris.

Is Emacs on Solaris Open Source? Of course it is, even though the source code for the system libraries are not available.

One more question, Emacs on Mac OS: it can only be built with a proprietary compiler on proprietary GUI and other proprietary libraries.

Is Emacs on Mac Open Source? Of course it is. Can you fully study Emacs on Mac OS? For Emacs, yes. For the MacOS components, no. There are many programs that run only on MacOS or Windows: for OSI, those are Open Source. Would someone argue that they’re not “really Open Source” because you can’t see “everything?” Some people might but we’ve learned to live with that, adding governance rules in addition to those of the Open Source Definition. Debian for example requires that programs are Open Source and support multiple hardware platforms; the ASF graduates only projects that are Open Source and have a diverse community of contributors. If you only want to use Open Source applications running on Open Source stacks, you can decide that! Just as you can decide that your company will only acquire Open Source software whose copyright is owned by multiple entities.

These are all additional requirements built on top of the base floor set by the Open Source Definition.

For AI, you can do the same: You can say “I will only use Open Source AI built with open data, because I don’t want to trust anything less than that.” A large organization could say “I will buy only Open Source AI that allows me to audit their full dataset, including unshareable data.” You can do all that. Open Source AI is the floor that you can build on, like the OSD.

Bypassing the conundrums

We’ve looked for a solution for almost three years and this is it: Require all the data that is legally shareable, and for the other data provide all the details. It’s exactly what we’ve been doing for Open Source software:

You developed a text editor for Mac OS but you can’t share the system libraries? Fine, we’ll fork it: give us all the code you can legally share with an OSI-Approved License and we’ll rip the dependencies and “liberate” it to run on GNU. The editor will be slightly different, like code that runs on some ARM+Linux systems behaves differently on Intel+Windows for the different capabilities of the underlying hardware and OS, but it’s still Open Source.

For Open Source AI it’s a similar dance: You can’t legally give us all the data? Fine, we’ll fork it. For example, you made an AI that recognizes bone cancer in humans but the data can’t be shared. We’ll fork it! Tell us exactly how you built the system, how you trained it, share the code you used, and an anonymized sample of the data you used so we can train on our X-ray images. The system will be slightly different but it’s still Open Source AI.

If we want to have broad availability of powerful alternatives to proprietary AI systems that respect the freedoms of users and deployers, we must recognize conditions that make sense for the domain of AI. These examples of proprietary compilers and system libraries used to build Open Source software prove that there is room for similar conditions when talking about Code, Data and Parameters within the definition of Open Source AI.

[번역본]

우리가 AI 난제를 극복한 방법

각각 다른 배포의 자유를 허용하는 고유한 법적 틀이 있다는 점을 인식하고, Stephen O'Grady가 " AI 난제 "라고 부른 것을 우회하고 오픈 소스 AI 빌더에게 사실상 모든 독점 AI에 대한 자유를 존중하는 대안을 구축할 수 있는 기회를 제공합니다.

오픈소스 AI를 자유롭게 배포 가능한 데이터로 훈련 가능한 시스템에만 제한하면 오픈소스 AI를 틈새 시장으로 전락시킬 것입니다. 그 중 하나는 자유롭고 합법적으로 공유할 수 있는 데이터의 양이 강력한 시스템을 훈련하는 데 필요한 양의 아주 작은 일부라는 것입니다. 또한 의료 또는 개인 또는 사적 데이터를 다루는 것과 같이 데이터를 공유할 수 없는 영역에서 오픈소스 AI를 제외하는 것입니다. "오픈소스 AI"에 남는 것은 아주 작을 것입니다. 이러한 제한을 거부할 수 있는 동기는 많습니다.

사실, 공개적으로 배포 가능한 데이터와 배포 불가능한 데이터를 섞는 것은 우리가 잘 알고 있는 현실, 즉 독점적인 컴파일러와 시스템 라이브러리를 사용하여 구축된 오픈 소스 소프트웨어와 매우 유사합니다.

GNU Emacs는 오픈 소스 소프트웨어입니까?

네라고 대답할 거라고 확신합니다(그리고 어떤 사람들은 "사실 무료 소프트웨어예요"라고 말할 겁니다). 그리고 우리 모두 동의할 겁니다. 아래는 최신 Linux 배포판에서 GNOME 데스크톱용으로 빌드된 Emacs의 대략적인 다이어그램입니다. Emacs는 GNOME이 OSI 승인 라이선스로 제공하는 몇 가지 시스템 라이브러리에 의존합니다. 요즘은 전체 스택이 오픈 소스이고, 모든 종속성이 있는 디스크에 Emacs를 배포해도 법적 문제가 크게 발생하지 않습니다. 실험 환경 전체를 동결하고 싶어하는 과학자를 상상해보세요. 그들은 이런 식으로 시스템의 모든 부분을 문제 없이 패키징하고 논문과 함께 배포할 수 있습니다. 여기서는 문제가 없습니다.

이제 리눅스 시스템이 준비되지 않았던 시대로 돌아가 봅시다. Stallman이 Emacs를 쓰기 시작했을 때는 GNOME도 없고 Linux도 없고 gcc도 없고 glibc도 없었습니다. 그는 아주 일찍부터 더 많은 자유를 얻기 위해서는 Emacs가 독점 소프트웨어에서 실행될 수 있도록 쐐기를 만들어야 한다고 생각했습니다.

최신 Solaris 버전에서의 Emacs는 다음과 같습니다. X11과 Gstreamer와 같은 일부 부분은 오픈 소스입니다. libc와 같은 다른 부분과 다른 부분은 오픈 소스가 아닙니다. 이전의 가상 과학자들은 전체 과학 환경을 실제로 동결시킬 수 없었습니다. 그들이 논문에서 말할 수 있는 전부는 "우리는 이 CVS 버전의 Emacs를 사용했고, gcc 버전 X로 빌드했으며, 이 makefile; tar.gz가 첨부되었습니다"라고 말하고, 그들이 사용한 운영 체제 버전과 라이브러리 버전 목록을 만드는 것이었습니다. 그 이유는 그들이 Emacs, X11, 일부 라이브러리만 배포할 권리가 있고 나머지 Solaris는 배포할 권리가 없기 때문입니다.

Emacs on Solaris는 오픈 소스인가요? 물론입니다. 시스템 라이브러리의 소스 코드는 사용할 수 없지만요.

한 가지 더 질문이 있습니다. Mac OS에서 Emacs는 독점 GUI와 기타 독점 라이브러리에서만 독점 컴파일러로 빌드할 수 있습니다.

Emacs가 Mac에서 오픈 소스인가요? 물론입니다. Mac OS에서 Emacs를 완전히 공부할 수 있나요? Emacs의 경우 그렇습니다. MacOS 구성 요소의 경우 아니요. MacOS나 Windows에서만 실행되는 프로그램이 많이 있습니다. OSI의 경우 이러한 프로그램은 오픈 소스입니다. "모든 것"을 볼 수 없기 때문에 "진정한 오픈 소스"가 아니라고 주장하는 사람이 있을까요? 그럴 수도 있겠지만, 우리는 오픈 소스 정의에 거버넌스 규칙을 추가하여 이를 받아들이는 법을 배웠습니다. 예를 들어 Debian은 프로그램이 오픈 소스여야 하며 여러 하드웨어 플랫폼을 지원해야 합니다. ASF는 오픈 소스 이고 다양한 기여자 커뮤니티가 있는 프로젝트만 졸업시킵니다. 오픈 소스 스택에서 실행되는 오픈 소스 애플리케이션만 사용하고 싶다면 스스로 결정할 수 있습니다! 여러 개체가 저작권을 소유한 오픈 소스 소프트웨어만 회사에서 인수하기로 결정할 수 있는 것처럼요.

이러한 모든 요구 사항은 오픈 소스 정의에서 정한 기본 사항을 기반으로 구축된 추가 요구 사항입니다.

AI의 경우에도 마찬가지입니다. "오픈 데이터로 구축된 오픈 소스 AI만 사용하겠습니다. 그보다 낮은 것은 신뢰하고 싶지 않거든요."라고 말할 수 있습니다. 대규모 조직은 "공유할 수 없는 데이터를 포함하여 전체 데이터 세트를 감사할 수 있는 오픈 소스 AI만 구매하겠습니다."라고 말할 수 있습니다. 이 모든 것을 할 수 있습니다. 오픈 소스 AI는 OSD처럼 구축할 수 있는 기반입니다.

난제를 우회하다

우리는 거의 3년 동안 해결책을 찾았고, 바로 이것이 해결책입니다. 합법적으로 공유할 수 있는 모든 데이터를 요구하고, 다른 데이터에 대해서는 모든 세부 정보를 제공합니다 . 바로 이것이 우리가 오픈 소스 소프트웨어에 대해 해 온 일입니다.

Mac OS용 텍스트 편집기를 개발했지만 시스템 라이브러리를 공유할 수 없나요? 좋습니다. 포크하겠습니다. OSI 승인 라이선스로 합법적으로 공유할 수 있는 모든 코드를 제공하면 종속성을 제거하고 GNU에서 실행되도록 "해방"하겠습니다. 편집기는 약간 다를 것입니다. 일부 ARM+Linux 시스템에서 실행되는 코드는 기본 하드웨어와 OS의 다른 기능에 따라 Intel+Windows에서 다르게 동작하지만 여전히 오픈 소스입니다.

오픈소스 AI의 경우도 비슷합니다. 합법적으로 모든 데이터를 제공할 수 없나요? 좋습니다. 포크하겠습니다. 예를 들어, 인간의 뼈암을 인식하는 AI를 만들었지만 데이터를 공유할 수 없습니다. 포크하겠습니다! 시스템을 어떻게 구축했는지, 어떻게 훈련했는지, 사용한 코드, 그리고 우리가 X선 이미지로 훈련할 수 있도록 사용한 데이터의 익명화된 샘플을 정확히 알려주세요. 시스템은 약간 다를 것이지만 여전히 오픈소스 AI입니다.

사용자와 배포자의 자유를 존중하는 독점 AI 시스템에 대한 강력한 대안을 광범위하게 이용하려면 AI 도메인에 적합한 조건을 인식해야 합니다. 오픈 소스 소프트웨어를 빌드하는 데 사용되는 독점 컴파일러와 시스템 라이브러리의 이러한 예는 오픈 소스 AI의 정의 내에서 코드, 데이터 및 매개변수에 대해 이야기할 때 유사한 조건을 위한 여지가 있음을 증명합니다.

[원문출처] https://opensource.org/blog/how-we-passed-the-ai-conundrums

※ opensource.org(https://opensource.org/)에 의해 작성된 이 저작물은 크리에이티브 커먼즈 저작자표시-동일조건변경허락 4.0 국제 라이선스에 따라 이용할 수 있습니다.

첨부파일

이전글, 다음글
이전글	[기고문] Llama와 DeepSeek은 오픈소스일까? - OSI의 ‘오픈소스 AI 정의’를 중심으로(이철남 교수(충남대학교 법학전문대학원)
다음글	[해외법률자료] 오픈소스 이니셔티브, 업계 최초의 오픈소스 AI 정의 공개 발표

자료실

오픈소스 SW동향