TLDExtract Registered Domain is Empty using Custom S3 Domain

I’m trying to integrate Label Studio with a custom S3-compatible storage endpoint hosted at http://s3.company.internal, but I’ve encountered an issue where the system isn’t correctly parsing the domain name (at label-studio/label_studio/io_storages/s3/utils.py at develop · HumanSignal/label-studio · GitHub). In this line, registered_domain returns as an empty string, which then causes the next line to fail and the “unrecognized S3 domain” exception.

Here’s the code:

self.s3_endpoint = ‘http://s3.company.internal

urlparse(self.s3_endpoint)
ParseResult(scheme=‘http’, netloc=‘s3.company.internal’, path=‘’, params=‘’, query=‘’, fragment=‘’)

extracted_lib = extractor.extract_urllib(parsed)
extracted_lib
ExtractResult(subdomain=‘s3.company’, domain=‘internal’, suffix=‘’, is_private=False)
extracted_lib.registered_domain
‘’

I’ve verified that the URL is correct, and other tools can access it without issues.
I’ve attempted to troubleshoot using urlparse directly, and the result shows that while the domain (internal) and subdomain (s3.company) are extracted, the registered_domain is empty.

Has anyone faced similar issues with custom S3 domains or endpoints? Any advice on how to make Label Studio correctly handle custom domains like s3.company.internal?

I am using Label Studio 1.16.0. I’d prefer to keep the custom domain as it follows a similar format to all of my other endpoints.

Hi Anna,

Thanks for writing in with this error - and great work digging into the issue! One thing to note is that if the tldextract code is being reached at all, this does imply that an exception is already being raised when Label Studio is trying to interact with the storage endpoint. If you’re able to see the logs of the running Label Studio instance, you’d be able to read the details of the full logged exception on stderr:

By reading these logs, it should be possible to figure out what’s causing the exception and remedy the issue.

As far as providing a general in-UI solution for this scenario, I think the ideal approach would be to expose an environment variable setting in base.py for tldextract’s extra suffixes feature: GitHub - john-kurkowski/tldextract: Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL). - and similarly, if you’re comfortable modifying Label Studio’s code and building it, you could add internal to an extra_suffixes kwarg on this line: label-studio/label_studio/io_storages/s3/utils.py at 827e3a6edc66b4c3425be1fde5291674808ca732 · HumanSignal/label-studio · GitHub

In the meantime, we’ll track implementing extra suffixes support as a feature request on our side.

Cheers,
Jo