Architecture
This page documents the public API of the three core modules that power reproscreener.
paper_analyzer
module
The paper_analyzer
module is responsible for analysing individual research papers hosted on arXiv.
It can:
- Parse canonical arXiv identifiers from arbitrary arXiv URLs.
- Download either the TeX source bundle (e-print) or the PDF of the paper.
- Optionally convert PDFs to Markdown via
docling
for easier text processing. - Extract reproducibility variables (problem statements, dataset mentions, hypotheses, etc.).
- Detect external links such as source-code or data repositories contained in the manuscript.
- Return the results as a convenient
pandas.DataFrame
.
analyze_arxiv_paper(arxiv_url, download_dir, url_type='tex')
Main function to download, extract, and analyze an arXiv paper.
Parameters:
arxiv_url : str The arXiv URL of the paper download_dir : Path Directory to download and store the paper url_type : str Type of arXiv URL, either "tex" or "pdf"
Returns:
pd.DataFrame Analysis results as a DataFrame
analyze_content(folder_path, paper_id, title)
Evaluate a paper by extracting variables and URLs from its files. Returns a DataFrame with evaluation results.
combine_files_in_folder(folder_path, file_extensions=['.tex', '.md', '.txt'])
Combine all files with specified extensions in a given directory into a single file.
download_extract_source(arxiv_url, path_download)
Downloads and extracts the source code of an ArXiv paper from its URL. Also retrieves the paper title from the arXiv API. Returns the paper title and the download directory path.
download_pdf_and_convert(arxiv_url, path_download)
Downloads a PDF from arXiv and converts it to markdown using docling. Returns the paper title and the path to the markdown file.
extract_urls(combined_path)
Extract URLs from the combined file.
find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'bitbucket', 'zenodo'])
Find URLs belonging to allowed domains.
find_variables(combined_path)
Return a list of (variable_category, matched_phrase) pairs found in the paper.
parse_arxiv_id(arxiv_url)
Extract the canonical arXiv identifier (without version or extension)
from arxiv_url which may be any of:
• https://arxiv.org/abs/1909.00066v1
• https://arxiv.org/pdf/1909.00066.pdf
• https://arxiv.org/pdf/1909.00066v2.pdf
• https://arxiv.org/src/1909.00066
• https://arxiv.org/e-print/1909.00066
Returns the bare identifier, e.g. 1909.00066
.
repo_analyzer
module
The repo_analyzer
module evaluates the structure of a Git repository that claims to implement the research.
Its main tasks are:
- Cloning public repositories (GitHub, GitLab, Bitbucket, …).
- Searching for dependency specification files (e.g.
requirements.txt
,environment.yml
,pyproject.toml
,Dockerfile
, …). - Detecting wrapper scripts or entry-point files (
run.py
,main.sh
,Makefile
, …). - Parsing the project's
README
for sections that describe installation or requirements. - Aggregating the findings into a tabular report.
analyze_github_repo(repo_url, clone_dir)
Main function to clone and analyze a GitHub repository.
analyze_repository_structure(repo_path)
Evaluate a repository by checking the existence of certain files and sections in README. Returns a DataFrame with the evaluation results.
check_files(dir_path, files_to_check, current_ext_mapping)
Check if the given files exist in the directory based on ext_mapping.
clone_repo(repo_url, cloned_path, overwrite=False)
Clone a repository from the given URL to the given path. If the repository already exists, it won't be overwritten unless specified.
keywords
module
The keywords
module generates the lists of keywords/regular-expression patterns that are used by the analyser modules to identify important concepts inside paper text.
Currently it implements the metrics from 1.
generate_gunderson_dict()
Generate a dictionary of Gunderson variables with regex patterns.
Returns:
Name | Type | Description |
---|---|---|
_type_ |
A dictionary of keywords and regex patterns. |
-
Bhaskar, A. and Stodden, V. 2024. Reproscreener: Leveraging LLMs for Assessing Computational Reproducibility of Machine Learning Pipelines. Proceedings of the 2nd ACM Conference on Reproducibility and Replicability (New York, NY, USA, Jul. 2024), 101--109. ↩