Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition between threading.local() and GIL acquisition on Linux #100892

Closed
tkoeppe opened this issue Jan 9, 2023 · 8 comments
Closed
Assignees
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@tkoeppe
Copy link

tkoeppe commented Jan 9, 2023

Bug report

TSAN reports a race condition from a Python program that both uses threading.local() and also a native extension that attempts to acquire the GIL in a separate, native thread.

To reproduce, build both the Python interpreter and the native module with TSAN. An example native module is like this, but note that all that matters is that it spawns a native thread that attempts to acquire the GIL:

thread_haver.cc:

#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

extern "C" {

static pthread_t t;

static void* DoWork(void* arg) {
  printf("Thread trying to acquire GIL\n");
  fflush(stdout);
  PyGILState_STATE py_threadstate = PyGILState_Ensure();   // Race here!!
  printf("Thread called with arg %p\n", arg);
  PyGILState_Release(py_threadstate);
  printf("Thread has released GIL\n");
  return arg;
}

static PyObject* SomeNumber(PyObject* module, PyObject* object) {
  if (pthread_create(&t, nullptr, DoWork, nullptr) != 0) {
    fprintf(stderr, "pthread_create failed\n");
    abort();
  }
  return PyLong_FromLong(reinterpret_cast<uintptr_t>(object));
}

static PyObject* AnotherNumber(PyObject* module, PyObject* object) {
  if (pthread_join(t, nullptr) != 0) {
    fprintf(stderr, "pthread_join failed\n");
    abort();
  }
  return PyLong_FromLong(reinterpret_cast<uintptr_t>(object));
}

}  // extern "C"

static PyMethodDef thbmod_methods[] = {
    {"some_number", SomeNumber, METH_O, "Makes a number."},
    {"another_number", AnotherNumber, METH_O, "Makes a number."},
    {nullptr, nullptr, 0, nullptr}        /* Sentinel */
};

static struct PyModuleDef thbmod = {
    PyModuleDef_HEAD_INIT,
    "thread_haver",   /* name of module */
    nullptr,          /* module documentation, may be NULL */
    1024,             /* size of per-interpreter state of the module,
                         or -1 if the module keeps state in global variables. */
    thbmod_methods,
};

PyMODINIT_FUNC PyInit_thread_haver() {
  return PyModule_Create(&thbmod);
}

Compile this into a shared object and using TSAN with, say, ${CC} -fPIC -shared -O2 -fsanitize=thread -o thread_haver.so thread_haver.cc.

Now the data race happens if we create and destroy a threading.local() object in Python:

demo.py:

import threading
import thread_haver

print("Number: {}".format(thread_haver.some_number(thread_haver)))  # starts a thread
for _ in range(10000):
  _ = threading.local()   # race here (?)
print("Number: {}".format(thread_haver.another_number(threading)))  # joins the thread

Concretely, here is the TSAN output, for Python 3.9:

Thread trying to acquire GIL
Number: 135325830047024
==================
WARNING: ThreadSanitizer: data race (pid=[...])
  Read of size 8 at 0x7b4400019f98 by main thread:
    #0 local_clear [...]/Modules/_threadmodule.c:819:25 (python+0xcbc14e)
    #1 local_dealloc [...]/Modules/_threadmodule.c:838:5 (python+0xcbbd1d)
    #2 _Py_DECREF [...]/Include/object.h:447:9 (python+0x104efaa)
    #3 _Py_XDECREF [...]/Include/object.h:514:9 (python+0x104efaa)
    #4 insertdict [...]/Objects/dictobject.c:1123:5 (python+0x104efaa)

  Previous write of size 8 at 0x7b4400019f98 by thread T1:
    #0 malloc [...]/tsan/rtl/tsan_interceptors_posix.cpp:683:5 (python+0xbd28f1)
    #1 _PyMem_RawMalloc [...]/Objects/obmalloc.c:116:11 (python+0x1083956)

  Location is heap block of size 264 at 0x7b4400019f00 allocated by thread T1:
    #0 malloc [...]/tsan/rtl/tsan_interceptors_posix.cpp:683:5 (python+0xbd28f1)
    #1 _PyMem_RawMalloc [...]/Objects/obmalloc.c:116:11 (python+0x1083956)

  Thread T1 (tid=3039745, running) created by main thread at:
    #0 pthread_create [...]/tsan/rtl/tsan_interceptors_posix.cpp:1038:3 (python+0xbd4679)
    #1 SomeNumber(_object*, _object*) thread_haver.cc:23:7 (thread_haver.so+0xb58)
    #2 cfunction_vectorcall_O [...]/Objects/methodobject.c:516:24 (python+0x107cb3d)


SUMMARY: ThreadSanitizer: data race [...]/Modules/_threadmodule.c:819:25 in local_clear
==================
Thread called with arg (nil)
Thread has released GIL
Number: 135325829912464
ThreadSanitizer: reported 1 warnings

Unfortunately, the backtrace does not go into the details of PyGILState_Ensure(), but the race seems to be on tstate->dict in

if (tstate->dict && PyDict_GetItem(tstate->dict, self->key)) {
from the threading.local() deallocation function, and on the tstate struct being allocated by malloc (by the GIL acquisition?).

Your environment

  • CPython versions tested on: 3.9 and 3.10
  • Operating system and architecture: Linux (Debian-derived)

I suspect that there may be some TLS access that both the threading.local() deallocation function and the GIL acquisition perform and that may not be sufficiently synchronised. It could also be a bug in TSAN that it does not track TLS access correctly.

Linked PRs

@tkoeppe tkoeppe added the type-bug An unexpected behavior, bug, or error label Jan 9, 2023
@kumaraditya303
Copy link
Contributor

Looks like we are missing taking the runtime lock while iterating over the tstates. PyThreadState_Next is not thread-safe.

cc @pitrou @pablogsal

@kumaraditya303 kumaraditya303 added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Jan 10, 2023
@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Jan 10, 2023

@tkoeppe Can you test this branch?

@tkoeppe
Copy link
Author

tkoeppe commented Jan 10, 2023

That seems to work, thank you! Tested on Py 3.10. I was thinking something similar myself, but couldn't figure out the right thing to lock!

@tkoeppe
Copy link
Author

tkoeppe commented Jan 10, 2023

@kumaraditya303: You seem to be making a semantic change: previously, if tstate->dict was null, you would skip to the next round, whereas now you break the loop entirely. Is that intended?

@kumaraditya303
Copy link
Contributor

You seem to be making a semantic change: previously, if tstate->dict was null, you would skip to the next round, whereas now you break the loop entirely. Is that intended?

No, it was an oversight, will fix it in the PR.

@kumaraditya303 kumaraditya303 self-assigned this Jan 10, 2023
@kumaraditya303 kumaraditya303 added 3.11 only security fixes 3.10 only security fixes 3.12 bugs and security fixes labels Jan 10, 2023
@kumaraditya303
Copy link
Contributor

See #100922

@tkoeppe
Copy link
Author

tkoeppe commented Jan 10, 2023

Wonderful, thank you!

kumaraditya303 added a commit to kumaraditya303/cpython that referenced this issue Jan 11, 2023
…nGH-100922).

(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
kumaraditya303 added a commit to kumaraditya303/cpython that referenced this issue Jan 11, 2023
…nGH-100922).

(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
kumaraditya303 added a commit to kumaraditya303/cpython that referenced this issue Jan 11, 2023
…ythonGH-100922).

(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>.
(cherry picked from commit 683e9fe)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
@kumaraditya303 kumaraditya303 added the 3.9 only security fixes label Jan 11, 2023
kumaraditya303 added a commit to kumaraditya303/cpython that referenced this issue Jan 11, 2023
…ythonGH-100922).

(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>.
(cherry picked from commit 683e9fe)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
kumaraditya303 added a commit that referenced this issue Jan 11, 2023
…#100937)

(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
kumaraditya303 added a commit that referenced this issue Jan 11, 2023
…#100938)

(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
ambv pushed a commit that referenced this issue Jan 20, 2023
…100939)

[3.9] [3.10] GH-100892: Fix race in clearing `threading.local` (GH-100922).
(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>.
(cherry picked from commit 683e9fe)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
@tkoeppe
Copy link
Author

tkoeppe commented Jan 21, 2023

Many thanks, this is great!

gentoo-bot pushed a commit to gentoo/cpython that referenced this issue May 21, 2024
…GH-100922) (python#100939)

[3.9] [3.10] pythonGH-100892: Fix race in clearing `threading.local` (pythonGH-100922).
(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>.
(cherry picked from commit 683e9fe)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>

[adjusted for 3.8 by Michał Górny]
gentoo-bot pushed a commit to gentoo/cpython that referenced this issue Sep 19, 2024
…GH-100922) (python#100939)

[3.9] [3.10] pythonGH-100892: Fix race in clearing `threading.local` (pythonGH-100922).
(cherry picked from commit 762745a)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>.
(cherry picked from commit 683e9fe)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>

[adjusted for 3.8 by Michał Górny]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error type-crash A hard crash of the interpreter, possibly with a core dump
2 participants