Web lists-archives.com

Bits from /me: A humble draft policy on "deep learning v.s. freedom"




Hi people,

A year ago I raised a topic on -devel, pointing out the
"deep learning v.s. software freedom" issue. We drew no
conclusion at that time, and linux distros who care about
software freedom may still have doubt on some fundamental
problems, e.g. "is this piece of deep learning software
really free"?

People do lazy execution on this problem. Now that a
related package entered my packaging radar, and I think
I'd better write a draft and shed some light on a safety
area. Then here is the first humble attempt:

  https://salsa.debian.org/lumin/deeplearning-policy
  (issue tracker is enabled)

This draft is conservative and overkilling, and currently
only focus on software freedom. That's exactly where we
start, right?

Specifically, I defined 3 types of pre-trained machine
learning models / deep learning models:

  Free Model, ToxicCandy Model. Non-free Model

Developers who'd like to touch DL software should be
cautious to the "ToxicCandy" models. Details can be
found in my draft.

Apart from that, I pointed out in the draft that software
associated with any critical task should be considered
carefully as deep neural networks introduced a new kind
of vulnerability, that a network's response can be
disrupted or even controlled by some carefully designed
perturbations added to the network put.

Hence, I suggest that packaging an intelligent software
must be discussed on -devel if the piece of software is
associated with any kind of critical task, including but
not limited to

  * authentication (e.g. login via face verification or
    identification)
  * program execution (e.g. intelligent voice assistants:
    "Hey, Siri! sudo rm -rf / --no-preserve-root")
  * physical object manipulation (e.g. mechanical
    arms in non-educational occasion,
    cars i.e. auto pilot), etc.

See my draft for details.

The package that entered my packaging radar is nltk_data.
https://github.com/nltk/nltk_data
The 2 most widely used python-based computational
linguistics toolkit, NLTK and Spacy, require these
data (datasets + models) to enable most of their
functionalities.

Best,
Mo.