API Reference¶

impscan¶

Command line tool to identify minimal imports list and repository sources by parsing package dependency trees

—

Scan imports in a directory, determine which are non-standard library, and then (tentatively) determine the package dependency tree and prune the requirements accordingly, as well as determining which can be obtained from Conda (and on which channels) and which from PyPI.

Unlike some other refactoring tools, impscan does not need to operate on a package (e.g. it can just be scripts)

Currently, requirements (AKA “root packages”), imported module name (“site packages” name) and other features are computed for one build for every package on conda’s anaconda and conda-forge channels (over 20,000 packages).

Workflow¶

Identify imports
Identify total dependency tree
Prune dependency tree
Identify sources (obeying source preferences if specified)
Save artifacts: CONDA_SETUP.md and requirements.txt

Conda metadata¶

This class represents a file being streamed as a sequence of non-overlapping ranges.

async impscan.conda_meta.async_utils.async_fetch_urlset(urls, archives: list, pbar=None)[source]¶

async impscan.conda_meta.async_utils.fetch(session: httpx.AsyncClient, url: str, can_raise: bool = False) → httpx.Response[source]¶

async impscan.conda_meta.async_utils.process_archive(resp: httpx.Response, lst: list, pbar=None)[source]¶

class impscan.conda_meta.formats.CondaArchive(source_url: str, defer_pull: bool = False)[source]¶

Bases: object

about_info = 'info/about.json'¶

about_json = None¶

property archive¶

check_bz2_info_dir() → None[source]¶: Validate the members for assignment to instance attributes. Note: ‘members’ means the filenames within the compressed .tar.bz2 archive.

determine_site_package_name() → str | None[source]¶: Identify the package(s) which can be imported after the conda package is installed, by inspecting the /site-packages/ paths it creates. Multiple names are comma-separated in alphabetical order. Returns None if no such names are found.

property filename: str¶

index_info = 'info/index.json'¶

index_json = None¶

property info_fields: list¶

info_is_read = False¶

property is_bz2¶

property is_zstd¶

property members¶

parse_to_db_entry() → dict[source]¶

path_info = 'info/paths.json'¶

path_json = None¶

pull() → None[source]¶

read_info()[source]¶: Load the JSON files from the info archive (otherwise all attempts to access the JSON-parsed dict attributes’ keys will fail) and set the info_is_read flag to show this.

summarise_root_pkgs() → str[source]¶: Rather than store full spec for each root package, just store the names (as a space-separated string). Note: should not be used to ‘follow’ dependency chains without checking versions.

zst_meta_and_tarballs() → tuple[source]¶: Validate the members for assignment to instance attributes. Note: ‘members’ means the filenames within the compressed .conda archive. (Validate package tarball but don’t return as not used.)

class impscan.conda_meta.streaming_formats.CondaArchiveStream(source_url: str, defer_pull: bool = True)[source]¶

Bases: object

about_info = 'info/about.json'¶

about_json = None¶

property archive¶

check_bz2_info_dir() → None[source]¶: Validate the members for assignment to instance attributes. Note: ‘members’ means the filenames within the compressed .tar.bz2 archive.

determine_site_package_name() → str | None[source]¶: Identify the package(s) which can be imported after the conda package is installed, by inspecting the /site-packages/ paths it creates. Multiple names are comma-separated in alphabetical order. Returns None if no such names are found.

property filename: str¶

index_info = 'info/index.json'¶

index_json = None¶

inflate_archive(db: impscan.db.db_utils.CondaPackageDB)[source]¶

Pull and parse the archive to a database entry, and insert it.

Parameters: db – The database to insert the entry into.

property info_fields: list¶

info_is_read = False¶

property is_bz2¶

property is_zstd¶

property members¶

parse_to_db_entry() → dict[source]¶

path_info = 'info/paths.json'¶

path_json = None¶

pull() → None[source]¶

read_info()[source]¶: Load the JSON files from the info archive (otherwise all attempts to access the JSON-parsed dict attributes’ keys will fail) and set the info_is_read flag to show this.

read_zst(filename: str, paths: list) → list[source]¶

Extract the bytes from a CondaStream’s internal tar.zst archive. Requires downloading the entire tarball range (but not the entire CondaStream).

Parameters

filename – Name of the tar.zst file within the CondaStream
paths – Paths within the tar.zst archive to return bytes from

summarise_root_pkgs() → str[source]¶: Rather than store full spec for each root package, just store the names (as a space-separated string). Note: should not be used to ‘follow’ dependency chains without checking versions.

zst_meta_and_tarballs() → tuple[source]¶: Validate the members for assignment to instance attributes. Note: ‘members’ means the filenames within the compressed .conda archive. (Validate package tarball but don’t return as not used.)

impscan.conda_meta.so_utils.verify_exported_module_name(conda_archive, so_path: str) → set[str] | None[source]¶

class impscan.conda_meta.url_utils.ArchiveType(value)[source]¶

Bases: enum.Enum

An enumeration.

Bz2 = '.tar.bz2'¶

Zstd = '.conda'¶

impscan.conda_meta.url_utils.detect_archive_type_from_url(url: str) → impscan.conda_meta.url_utils.ArchiveType[source]¶

impscan.conda_meta.url_utils.detect_channel_from_url(url: str) → str[source]¶

impscan.conda_meta.url_utils.read_raw_stream(url: str)[source]¶

impscan.conda_meta.zip_utils.open_zipfile_from_url(url: str) → zipfile.ZipFile[source]¶

impscan.conda_meta.zip_utils.read_zipped_zst(zf: zipfile.ZipFile, zst_tar_fn: str, zst_paths: list) → list[source]¶: Given the ZipFile zf, tarball filename zst_tar_fn, and path(s) within the zst tarball zst_paths, return a list of one or more bytestrings from decompressing those paths.

class impscan.conda_meta.zstd_utils.ZstdTarFile(name, mode='r', *, level_or_option=None, zstd_dict=None, **kwargs)[source]¶

Bases: tarfile.TarFile

close()[source]¶: Close the TarFile. In write-mode, two finishing zero blocks are appended to the archive.

impscan.conda_meta.zstd_utils.extract_zst(zst: bytes, file_paths: list) → list[source]¶

Database handling¶

Set up a database to store the package archive listings in.

class impscan.db.CondaArchiveListings(start_from_pkg: str | None = None)[source]¶

Bases: object

Synchronous listings, using CondaStream to efficiently look at conda archives.

fetch_archives(verbose: bool = False, n_retries: int = 3)[source]¶

inflate_all_archives(show_progress: bool = False)[source]¶

make_archive(source_url: str, defer_pull: bool = True) → impscan.conda_meta.formats.CondaArchive[source]¶: Create CondaArchive object; includes channel and format detection

make_archives(defer_pull: bool = True)[source]¶: Make and return a list of CondaArchive objects and pull their URLs collectively in an efficient async procedure (not seriallly).

property urlset: Generator[str, None, None]¶: Generator of URLs for async fetching

class impscan.db.PackageDB(dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/impscan/envs/latest/lib/python3.9/site-packages/impscan/assets'), filename='package_catalogue.db', create=True, no_touch=False)[source]¶

Bases: object

connect()[source]¶

directory = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/impscan/envs/latest/lib/python3.9/site-packages/impscan/assets')¶

exists()[source]¶

filename = 'package_catalogue.db'¶

has_package(package_name)[source]¶

property path¶

class impscan.db.db_utils.CondaPackageDB(dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/impscan/envs/latest/lib/python3.9/site-packages/impscan/assets'), filename='package_catalogue.db', create=True, no_touch=False)[source]¶

Bases: impscan.db.db_utils.PackageDB

create(no_touch=False)[source]¶

insert_entry(pkgname: str, impname: str, channel: str, depends: str, fn: str, url: str, version: str, rootpkgs: str)[source]¶

retrieve_filename(fn, fetch_all=False)[source]¶

retrieve_package(package_name, fetch_all=True)[source]¶

class impscan.db.db_utils.PackageDB(dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/impscan/envs/latest/lib/python3.9/site-packages/impscan/assets'), filename='package_catalogue.db', create=True, no_touch=False)[source]¶

Bases: object

connect()[source]¶

directory = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/impscan/envs/latest/lib/python3.9/site-packages/impscan/assets')¶

exists()[source]¶

filename = 'package_catalogue.db'¶

has_package(package_name)[source]¶

property path¶

impscan.db.version_utils.sort_package_json_by_version(j: list) → list[source]¶

Lookup¶

?

impscan.lookup.lookup_requirements(requirements)[source]¶

impscan.lookup.conda_util.conda_search_reqs(requirements) → set[source]¶

impscan.lookup.listings_xref.lookup_requirements(requirements)[source]¶

impscan.lookup.pypi_util.pypi_search_reqs(requirements) → set[source]¶

class impscan.lookup.req_spec.CondaReqSpec(package: str, channel: list, constraints: list)[source]¶: Bases: impscan.lookup.req_spec.ReqSpec

class impscan.lookup.req_spec.PyPIReqSpec(package: str, channel: list, constraints: list)[source]¶: Bases: impscan.lookup.req_spec.ReqSpec

class impscan.lookup.req_spec.ReqSpec(package: str, repository: impscan.lookup.req_spec.Repository, channel: list, constraints: list)[source]¶: Bases: object

Scanning¶

The scanner subpackage handles import module name identification

class impscan.scanner.ast_parsing.ParsedPy(py_file_path: pathlib.Path, env_config: impscan.config.EnvConfig)[source]¶

Bases: object

property allowed_imports¶

ast_parse()[source]¶

property banned_imports¶

impscan.scanner.ast_utils.retrieve_imported_modules(py_file_path: pathlib.Path) → set[source]¶: Return a set of imported names (excluding stdlib modules) by parsing the AST for import statements (ignoring relative imports).

impscan.scanner.build_utils.check_for_build_reqs(toml_file) → set[source]¶

impscan.scanner.import_utils.get_imported_name_sources(trunk: list) → set[source]¶

impscan.scanner.import_utils.get_sibling_module_names(target_module_path: pathlib.Path) → set[source]¶: Given a source module at target_module_path, determine the names of any modules it may import in the local directory: either those files ending in .py or directories (which do not need to contain an __init__.py due to implicit namespaces).

impscan.scanner.module_utils.stdlib_dynload_module_names(stdlib_path: pathlib.Path) → set[source]¶

Given the path to the standard library, extend it to the lib-dynload/ directory, collect the module names of all dynamic libraries within it.

Return a set of all the modules loaded dynamically in the standard library.

impscan.scanner.module_utils.stdlib_module_names() → set[source]¶

Get the path to the standard library by using the sys.modules list, specifically the filepath stored for a non-builtin library (pathlib), and use this path to detect all standard library module names rather than hard-code them.

Return a set of all the modules in the standard library.

class impscan.scanner.requirement.EnvReqs(env_config: impscan.config.EnvConfig)[source]¶

Bases: object

register(python_file: pathlib.Path)[source]¶

impscan.scanner.sanitiser.is_ignored_path(path: pathlib.Path)[source]¶: Check each part of a path for matches against the list of filters given in ignore_part_names.

impscan.scanner.scan.scan_imports(source_path: pathlib.Path, env_config) → impscan.scanner.requirement.EnvReqs[source]¶: Execute the scan of import statements below source_path (either a Python file or a directory to be walked recursively to find them), identifying the dependency graphs within the repositories given in env_config and returning the list(s) of requirements for each.

Miscellaneous shared utils¶

These are the commonly used parts of the library.

impscan.share.http_utils.GET(url, raise_for_status=True)[source]¶

impscan.share.multiproc_utils.batch_multiprocess(funcs: list[Callable], n_cores: int = 2, show_progress: bool = True, tqdm_desc: str | None = None) → None[source]¶: Run a list of functions on n_cores (default: all CPU cores), with the option to show a progress bar using tqdm (default: shown).

impscan.share.multiproc_utils.batch_multiprocess_with_return(funcs: list[Callable], pool_results: list | None = None, n_cores: int = 2, show_progress: bool = True, tqdm_desc: str | None = None) → list[source]¶: Run a list of functions on n_cores (default: all CPU cores), with the option to show a progress bar using tqdm (default: shown).

Streaming¶

Stream handling.

impscan.streams.conda_unbox.get_info_stream(stream_url: str) → dict[source]¶: Given the URL of a conda packaged binary (either .tar.bz2 or .conda) obtain its info from decompressing the stream.

CLI¶

The command-line tool impscan is made available as an entrypoint to impscan.__main__.main(), in turn a thin interface to impscan.cli.

—

impscan.cli.main()[source]¶

Config¶

Configuration handling.

—

class impscan.config.EnvConfig(**kwargs)[source]¶

Bases: object

__dict__ = mappingproxy({'__module__': 'impscan.config', '__init__': <function EnvConfig.__init__>, 'set_config': <function EnvConfig.set_config>, '__dict__': <attribute '__dict__' of 'EnvConfig' objects>, '__weakref__': <attribute '__weakref__' of 'EnvConfig' objects>, '__doc__': None, '__annotations__': {}})¶

__init__(**kwargs)[source]¶

__module__ = 'impscan.config'¶

__weakref__¶: list of weak references to the object (if defined)

set_config(setting, value)[source]¶