Architecture

This page documents the public API of the three core modules that power reproscreener.

`paper_analyzer` module

The paper_analyzer module is responsible for analysing individual research papers hosted on arXiv. It can:

Parse canonical arXiv identifiers from arbitrary arXiv URLs.
Download either the TeX source bundle (e-print) or the PDF of the paper.
Optionally convert PDFs to Markdown via docling for easier text processing.
Extract reproducibility variables (problem statements, dataset mentions, hypotheses, etc.).
Detect external links such as source-code or data repositories contained in the manuscript.
Return the results as a convenient pandas.DataFrame.

`analyze_abstract_file(abstract_path)`

Analyze a single abstract .txt file and return a row dict for the CSV. The paper_id is derived from the filename stem.

`analyze_abstracts_directory(abstracts_dir)`

Analyze all .txt abstracts in a directory and return a DataFrame with the expected CSV columns.

`analyze_arxiv_paper(arxiv_url, download_dir, url_type='tex')`

Main function to download, extract, and analyze an arXiv paper.

Parameters:

arxiv_url : str The arXiv URL of the paper download_dir : Path Directory to download and store the paper url_type : str Type of arXiv URL, either "tex" or "pdf"

Returns:

pd.DataFrame Analysis results as a DataFrame

`analyze_content(folder_path, paper_id, title)`

Evaluate a paper by extracting variables and URLs from its files. Returns a DataFrame with evaluation results.

`combine_files_in_folder(folder_path, file_extensions=['.tex', '.md', '.txt'])`

Combine all files with specified extensions in a given directory into a single file.

`download_extract_source(arxiv_url, path_download)`

Downloads and extracts the source code of an ArXiv paper from its URL. Also retrieves the paper title from the arXiv API. Returns the paper title and the download directory path.

`download_pdf_and_convert(arxiv_url, path_download)`

Downloads a PDF from arXiv and converts it to markdown using docling. Returns the paper title and the path to the markdown file.

`extract_category_presence_from_text(text)`

Given raw text (e.g., an abstract), return boolean presence for each category relevant to the gold-standard CSV output.

Output keys (CSV columns): - problem - objective - research_method - research_questions - pseudocode - dataset - hypothesis - prediction - code_available (mapped from method_source_code) - software_dependencies - experiment_setup

`extract_urls(combined_path)`

Extract URLs from the combined file.

`find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'bitbucket', 'zenodo'])`

Find URLs belonging to allowed domains.

`find_variables(combined_path)`

Return a list of (variable_category, matched_phrase) pairs found in the paper.

`parse_arxiv_id(arxiv_url)`

Extract the canonical arXiv identifier (without version or extension) from arxiv_url which may be any of: • https://arxiv.org/abs/1909.00066v1 • https://arxiv.org/pdf/1909.00066.pdf • https://arxiv.org/pdf/1909.00066v2.pdf • https://arxiv.org/src/1909.00066 • https://arxiv.org/e-print/1909.00066 Returns the bare identifier, e.g. 1909.00066.

`repo_analyzer` module

The repo_analyzer module evaluates the structure of a Git repository that claims to implement the research. Its main tasks are:

Cloning public repositories (GitHub, GitLab, Bitbucket, …).
Searching for dependency specification files (e.g. requirements.txt, environment.yml, pyproject.toml, Dockerfile, …).
Detecting wrapper scripts or entry-point files (run.py, main.sh, Makefile, …).
Parsing the project's README for sections that describe installation or requirements.
Aggregating the findings into a tabular report.

`analyze_github_repo(repo_url, clone_dir)`

Main function to clone and analyze a GitHub repository.

`analyze_repositories_from_csv(csv_path, clone_dir)`

Analyze repositories from a CSV file containing paper IDs and repository URLs. Returns a DataFrame with aggregated repository metrics for each paper.

`analyze_repository_structure(repo_path)`

Evaluate a repository by checking the existence of certain files and sections in README. Returns a DataFrame with the evaluation results.

`check_files(dir_path, files_to_check, current_ext_mapping)`

Check if the given files exist in the directory based on ext_mapping.

`clone_repo(repo_url, cloned_path, overwrite=False)`

Clone a repository from the given URL to the given path. If the repository already exists, it won't be overwritten unless specified.

`keywords` module

The keywords module generates the lists of keywords/regular-expression patterns that are used by the analyser modules to identify important concepts inside paper text. Currently it implements the metrics from ¹.

`generate_gunderson_dict()`

Generate a dictionary of Gunderson variables with regex patterns.

Returns:

Name	Type	Description
`_type_`		A dictionary of keywords and regex patterns.

Bhaskar, A. and Stodden, V. 2024. Reproscreener: Leveraging LLMs for Assessing Computational Reproducibility of Machine Learning Pipelines. Proceedings of the 2nd ACM Conference on Reproducibility and Replicability (New York, NY, USA, Jul. 2024), 101--109. ↩

Architecture

paper_analyzer module

analyze_abstract_file(abstract_path)

analyze_abstracts_directory(abstracts_dir)

analyze_arxiv_paper(arxiv_url, download_dir, url_type='tex')

Parameters:

Returns:

analyze_content(folder_path, paper_id, title)

combine_files_in_folder(folder_path, file_extensions=['.tex', '.md', '.txt'])

download_extract_source(arxiv_url, path_download)

download_pdf_and_convert(arxiv_url, path_download)

extract_category_presence_from_text(text)

extract_urls(combined_path)

find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'bitbucket', 'zenodo'])

find_variables(combined_path)

parse_arxiv_id(arxiv_url)

repo_analyzer module

analyze_github_repo(repo_url, clone_dir)

analyze_repositories_from_csv(csv_path, clone_dir)

analyze_repository_structure(repo_path)

check_files(dir_path, files_to_check, current_ext_mapping)

clone_repo(repo_url, cloned_path, overwrite=False)

keywords module

generate_gunderson_dict()

`paper_analyzer` module

`analyze_abstract_file(abstract_path)`

`analyze_abstracts_directory(abstracts_dir)`

`analyze_arxiv_paper(arxiv_url, download_dir, url_type='tex')`

`analyze_content(folder_path, paper_id, title)`

`combine_files_in_folder(folder_path, file_extensions=['.tex', '.md', '.txt'])`

`download_extract_source(arxiv_url, path_download)`

`download_pdf_and_convert(arxiv_url, path_download)`

`extract_category_presence_from_text(text)`

`extract_urls(combined_path)`

`find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'bitbucket', 'zenodo'])`

`find_variables(combined_path)`

`parse_arxiv_id(arxiv_url)`

`repo_analyzer` module

`analyze_github_repo(repo_url, clone_dir)`

`analyze_repositories_from_csv(csv_path, clone_dir)`

`analyze_repository_structure(repo_path)`

`check_files(dir_path, files_to_check, current_ext_mapping)`

`clone_repo(repo_url, cloned_path, overwrite=False)`

`keywords` module

`generate_gunderson_dict()`