Skip to content

Architecture

This page documents the public API of the three core modules that power reproscreener.

paper_analyzer module

The paper_analyzer module is responsible for analysing individual research papers hosted on arXiv. It can:

  • Parse canonical arXiv identifiers from arbitrary arXiv URLs.
  • Download either the TeX source bundle (e-print) or the PDF of the paper.
  • Optionally convert PDFs to Markdown via docling for easier text processing.
  • Extract reproducibility variables (problem statements, dataset mentions, hypotheses, etc.).
  • Detect external links such as source-code or data repositories contained in the manuscript.
  • Return the results as a convenient pandas.DataFrame.

analyze_arxiv_paper(arxiv_url, download_dir, url_type='tex')

Main function to download, extract, and analyze an arXiv paper.

Parameters:

arxiv_url : str The arXiv URL of the paper download_dir : Path Directory to download and store the paper url_type : str Type of arXiv URL, either "tex" or "pdf"

Returns:

pd.DataFrame Analysis results as a DataFrame

analyze_content(folder_path, paper_id, title)

Evaluate a paper by extracting variables and URLs from its files. Returns a DataFrame with evaluation results.

combine_files_in_folder(folder_path, file_extensions=['.tex', '.md', '.txt'])

Combine all files with specified extensions in a given directory into a single file.

download_extract_source(arxiv_url, path_download)

Downloads and extracts the source code of an ArXiv paper from its URL. Also retrieves the paper title from the arXiv API. Returns the paper title and the download directory path.

download_pdf_and_convert(arxiv_url, path_download)

Downloads a PDF from arXiv and converts it to markdown using docling. Returns the paper title and the path to the markdown file.

extract_urls(combined_path)

Extract URLs from the combined file.

Find URLs belonging to allowed domains.

find_variables(combined_path)

Return a list of (variable_category, matched_phrase) pairs found in the paper.

parse_arxiv_id(arxiv_url)

Extract the canonical arXiv identifier (without version or extension) from arxiv_url which may be any of: • https://arxiv.org/abs/1909.00066v1 • https://arxiv.org/pdf/1909.00066.pdf • https://arxiv.org/pdf/1909.00066v2.pdf • https://arxiv.org/src/1909.00066 • https://arxiv.org/e-print/1909.00066 Returns the bare identifier, e.g. 1909.00066.

repo_analyzer module

The repo_analyzer module evaluates the structure of a Git repository that claims to implement the research. Its main tasks are:

  • Cloning public repositories (GitHub, GitLab, Bitbucket, …).
  • Searching for dependency specification files (e.g. requirements.txt, environment.yml, pyproject.toml, Dockerfile, …).
  • Detecting wrapper scripts or entry-point files (run.py, main.sh, Makefile, …).
  • Parsing the project's README for sections that describe installation or requirements.
  • Aggregating the findings into a tabular report.

analyze_github_repo(repo_url, clone_dir)

Main function to clone and analyze a GitHub repository.

analyze_repository_structure(repo_path)

Evaluate a repository by checking the existence of certain files and sections in README. Returns a DataFrame with the evaluation results.

check_files(dir_path, files_to_check, current_ext_mapping)

Check if the given files exist in the directory based on ext_mapping.

clone_repo(repo_url, cloned_path, overwrite=False)

Clone a repository from the given URL to the given path. If the repository already exists, it won't be overwritten unless specified.

keywords module

The keywords module generates the lists of keywords/regular-expression patterns that are used by the analyser modules to identify important concepts inside paper text. Currently it implements the metrics from 1.

generate_gunderson_dict()

Generate a dictionary of Gunderson variables with regex patterns.

Returns:

Name Type Description
_type_

A dictionary of keywords and regex patterns.


  1. Bhaskar, A. and Stodden, V. 2024. Reproscreener: Leveraging LLMs for Assessing Computational Reproducibility of Machine Learning Pipelines. Proceedings of the 2nd ACM Conference on Reproducibility and Replicability (New York, NY, USA, Jul. 2024), 101--109.